
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/ensemble/plot_bagging_classifier.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_ensemble_plot_bagging_classifier.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_ensemble_plot_bagging_classifier.py:


=================================
Bagging classifiers using sampler
=================================

In this example, we show how
:class:`~imblearn.ensemble.BalancedBaggingClassifier` can be used to create a
large variety of classifiers by giving different samplers.

We will give several examples that have been published in the passed year.

.. GENERATED FROM PYTHON SOURCE LINES 12-16

.. code-block:: default


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT








.. GENERATED FROM PYTHON SOURCE LINES 17-19

.. code-block:: default

    print(__doc__)








.. GENERATED FROM PYTHON SOURCE LINES 20-26

Generate an imbalanced dataset
------------------------------

For this example, we will create a synthetic dataset using the function
:func:`~sklearn.datasets.make_classification`. The problem will be a toy
classification problem with a ratio of 1:9 between the two classes.

.. GENERATED FROM PYTHON SOURCE LINES 28-38

.. code-block:: default

    from sklearn.datasets import make_classification

    X, y = make_classification(
        n_samples=10_000,
        n_features=10,
        weights=[0.1, 0.9],
        class_sep=0.5,
        random_state=0,
    )








.. GENERATED FROM PYTHON SOURCE LINES 39-43

.. code-block:: default

    import pandas as pd

    pd.Series(y).value_counts(normalize=True)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    1    0.8977
    0    0.1023
    Name: proportion, dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 44-48

In the following sections, we will show a couple of algorithms that have
been proposed over the years. We intend to illustrate how one can reuse the
:class:`~imblearn.ensemble.BalancedBaggingClassifier` by passing different
sampler.

.. GENERATED FROM PYTHON SOURCE LINES 48-51

.. code-block:: default


    from sklearn.ensemble import BaggingClassifier








.. GENERATED FROM PYTHON SOURCE LINES 52-59

.. code-block:: default

    from sklearn.model_selection import cross_validate

    ebb = BaggingClassifier()
    cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

    print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.710 +/- 0.011




.. GENERATED FROM PYTHON SOURCE LINES 60-68

Exactly Balanced Bagging and Over-Bagging
-----------------------------------------

The :class:`~imblearn.ensemble.BalancedBaggingClassifier` can use in
conjunction with a :class:`~imblearn.under_sampling.RandomUnderSampler` or
:class:`~imblearn.over_sampling.RandomOverSampler`. These methods are
referred as Exactly Balanced Bagging and Over-Bagging, respectively and have
been proposed first in [1]_.

.. GENERATED FROM PYTHON SOURCE LINES 70-79

.. code-block:: default

    from imblearn.ensemble import BalancedBaggingClassifier
    from imblearn.under_sampling import RandomUnderSampler

    # Exactly Balanced Bagging
    ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
    cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

    print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.758 +/- 0.011




.. GENERATED FROM PYTHON SOURCE LINES 80-88

.. code-block:: default

    from imblearn.over_sampling import RandomOverSampler

    # Over-bagging
    over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
    cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")

    print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.702 +/- 0.014




.. GENERATED FROM PYTHON SOURCE LINES 89-96

SMOTE-Bagging
-------------

Instead of using a :class:`~imblearn.over_sampling.RandomOverSampler` that
make a bootstrap, an alternative is to use
:class:`~imblearn.over_sampling.SMOTE` as an over-sampler. This is known as
SMOTE-Bagging [2]_.

.. GENERATED FROM PYTHON SOURCE LINES 98-106

.. code-block:: default

    from imblearn.over_sampling import SMOTE

    # SMOTE-Bagging
    smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
    cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")

    print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.746 +/- 0.013




.. GENERATED FROM PYTHON SOURCE LINES 107-120

Roughly Balanced Bagging
------------------------
While using a :class:`~imblearn.under_sampling.RandomUnderSampler` or
:class:`~imblearn.over_sampling.RandomOverSampler` will create exactly the
desired number of samples, it does not follow the statistical spirit wanted
in the bagging framework. The authors in [3]_ proposes to use a negative
binomial distribution to compute the number of samples of the majority
class to be selected and then perform a random under-sampling.

Here, we illustrate this method by implementing a function in charge of
resampling and use the :class:`~imblearn.FunctionSampler` to integrate it
within a :class:`~imblearn.pipeline.Pipeline` and
:class:`~sklearn.model_selection.cross_validate`.

.. GENERATED FROM PYTHON SOURCE LINES 122-166

.. code-block:: default

    from collections import Counter

    import numpy as np

    from imblearn import FunctionSampler


    def roughly_balanced_bagging(X, y, replace=False):
        """Implementation of Roughly Balanced Bagging for binary problem."""
        # find the minority and majority classes
        class_counts = Counter(y)
        majority_class = max(class_counts, key=class_counts.get)
        minority_class = min(class_counts, key=class_counts.get)

        # compute the number of sample to draw from the majority class using
        # a negative binomial distribution
        n_minority_class = class_counts[minority_class]
        n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

        # draw randomly with or without replacement
        majority_indices = np.random.choice(
            np.flatnonzero(y == majority_class),
            size=n_majority_resampled,
            replace=replace,
        )
        minority_indices = np.random.choice(
            np.flatnonzero(y == minority_class),
            size=n_minority_class,
            replace=replace,
        )
        indices = np.hstack([majority_indices, minority_indices])

        return X[indices], y[indices]


    # Roughly Balanced Bagging
    rbb = BalancedBaggingClassifier(
        sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
    )
    cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")

    print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")






.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.761 +/- 0.014




.. GENERATED FROM PYTHON SOURCE LINES 167-179

.. topic:: References:

   .. [1] R. Maclin, and D. Opitz. "An empirical evaluation of bagging and
          boosting." AAAI/IAAI 1997 (1997): 546-551.

   .. [2] S. Wang, and X. Yao. "Diversity analysis on imbalanced data sets by
          using ensemble models." 2009 IEEE symposium on computational
          intelligence and data mining. IEEE, 2009.

   .. [3] S. Hido, H. Kashima, and Y. Takahashi. "Roughly balanced bagging
         for imbalanced data." Statistical Analysis and Data Mining: The ASA
         Data Science Journal 2.5‐6 (2009): 412-426.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  15.168 seconds)


.. _sphx_glr_download_auto_examples_ensemble_plot_bagging_classifier.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example




    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_bagging_classifier.py <plot_bagging_classifier.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_bagging_classifier.ipynb <plot_bagging_classifier.ipynb>`


.. include:: plot_bagging_classifier.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
