
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/applications/plot_impact_imbalanced_classes.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_applications_plot_impact_imbalanced_classes.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_applications_plot_impact_imbalanced_classes.py:


==========================================================
Fitting model on imbalanced datasets and how to fight bias
==========================================================

This example illustrates the problem induced by learning on datasets having
imbalanced classes. Subsequently, we compare different approaches alleviating
these negative effects.

.. GENERATED FROM PYTHON SOURCE LINES 10-14

.. code-block:: default


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT








.. GENERATED FROM PYTHON SOURCE LINES 15-17

.. code-block:: default

    print(__doc__)








.. GENERATED FROM PYTHON SOURCE LINES 18-27

Problem definition
------------------

We are dropping the following features:

- "fnlwgt": this feature was created while studying the "adult" dataset.
  Thus, we will not use this feature which is not acquired during the survey.
- "education-num": it is encoding the same information than "education".
  Thus, we are removing one of these 2 features.

.. GENERATED FROM PYTHON SOURCE LINES 29-34

.. code-block:: default

    from sklearn.datasets import fetch_openml

    df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
    df = df.drop(columns=["fnlwgt", "education-num"])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /Users/glemaitre/Documents/packages/scikit-learn/sklearn/datasets/_openml.py:1002: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
      warn(




.. GENERATED FROM PYTHON SOURCE LINES 35-36

The "adult" dataset as a class ratio of about 3:1

.. GENERATED FROM PYTHON SOURCE LINES 38-41

.. code-block:: default

    classes_count = y.value_counts()
    classes_count





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    class
    <=50K    37155
    >50K     11687
    Name: count, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 42-44

This dataset is only slightly imbalanced. To better highlight the effect of
learning from an imbalanced dataset, we will increase its ratio to 30:1

.. GENERATED FROM PYTHON SOURCE LINES 46-56

.. code-block:: default

    from imblearn.datasets import make_imbalance

    ratio = 30
    df_res, y_res = make_imbalance(
        df,
        y,
        sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},
    )
    y_res.value_counts()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    class
    <=50K    37155
    >50K      1238
    Name: count, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 57-62

We will perform a cross-validation evaluation to get an estimate of the test
score.

As a baseline, we could use a classifier which will always predict the
majority class independently of the features provided.

.. GENERATED FROM PYTHON SOURCE LINES 62-65

.. code-block:: default


    from sklearn.dummy import DummyClassifier








.. GENERATED FROM PYTHON SOURCE LINES 66-73

.. code-block:: default

    from sklearn.model_selection import cross_validate

    dummy_clf = DummyClassifier(strategy="most_frequent")
    scoring = ["accuracy", "balanced_accuracy"]
    cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)
    print(f"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Accuracy score of a dummy classifier: 0.968




.. GENERATED FROM PYTHON SOURCE LINES 74-76

Instead of using the accuracy, we can use the balanced accuracy which will
take into account the balancing issue.

.. GENERATED FROM PYTHON SOURCE LINES 78-83

.. code-block:: default

    print(
        f"Balanced accuracy score of a dummy classifier: "
        f"{cv_result['test_balanced_accuracy'].mean():.3f}"
    )





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Balanced accuracy score of a dummy classifier: 0.500




.. GENERATED FROM PYTHON SOURCE LINES 84-88

Strategies to learn from an imbalanced dataset
----------------------------------------------
We will use a dictionary and a list to continuously store the results of
our experiments and show them as a pandas dataframe.

.. GENERATED FROM PYTHON SOURCE LINES 90-93

.. code-block:: default

    index = []
    scores = {"Accuracy": [], "Balanced accuracy": []}








.. GENERATED FROM PYTHON SOURCE LINES 94-99

Dummy baseline
..............

Before to train a real machine learning model, we can store the results
obtained with our :class:`~sklearn.dummy.DummyClassifier`.

.. GENERATED FROM PYTHON SOURCE LINES 101-111

.. code-block:: default

    import pandas as pd

    index += ["Dummy classifier"]
    cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.5</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 112-122

Linear classifier baseline
..........................

We will create a machine learning pipeline using a
:class:`~sklearn.linear_model.LogisticRegression` classifier. In this regard,
we will need to one-hot encode the categorical columns and standardized the
numerical columns before to inject the data into the
:class:`~sklearn.linear_model.LogisticRegression` classifier.

First, we define our numerical and categorical pipelines.

.. GENERATED FROM PYTHON SOURCE LINES 124-136

.. code-block:: default

    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler

    num_pipe = make_pipeline(
        StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
    )
    cat_pipe = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="missing"),
        OneHotEncoder(handle_unknown="ignore"),
    )








.. GENERATED FROM PYTHON SOURCE LINES 137-140

Then, we can create a preprocessor which will dispatch the categorical
columns to the categorical pipeline and the numerical columns to the
numerical pipeline

.. GENERATED FROM PYTHON SOURCE LINES 142-151

.. code-block:: default

    from sklearn.compose import make_column_selector as selector
    from sklearn.compose import make_column_transformer

    preprocessor_linear = make_column_transformer(
        (num_pipe, selector(dtype_include="number")),
        (cat_pipe, selector(dtype_include="category")),
        n_jobs=2,
    )








.. GENERATED FROM PYTHON SOURCE LINES 152-155

Finally, we connect our preprocessor with our
:class:`~sklearn.linear_model.LogisticRegression`. We can then evaluate our
model.

.. GENERATED FROM PYTHON SOURCE LINES 157-161

.. code-block:: default

    from sklearn.linear_model import LogisticRegression

    lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))








.. GENERATED FROM PYTHON SOURCE LINES 162-170

.. code-block:: default

    index += ["Logistic regression"]
    cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 171-178

We can see that our linear model is learning slightly better than our dummy
baseline. However, it is impacted by the class imbalance.

We can verify that something similar is happening with a tree-based model
such as :class:`~sklearn.ensemble.RandomForestClassifier`. With this type of
classifier, we will not need to scale the numerical data, and we will only
need to ordinal encode the categorical data.

.. GENERATED FROM PYTHON SOURCE LINES 178-181

.. code-block:: default


    from sklearn.ensemble import RandomForestClassifier








.. GENERATED FROM PYTHON SOURCE LINES 182-200

.. code-block:: default

    from sklearn.preprocessing import OrdinalEncoder

    num_pipe = SimpleImputer(strategy="mean", add_indicator=True)
    cat_pipe = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="missing"),
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    )

    preprocessor_tree = make_column_transformer(
        (num_pipe, selector(dtype_include="number")),
        (cat_pipe, selector(dtype_include="category")),
        n_jobs=2,
    )

    rf_clf = make_pipeline(
        preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)
    )








.. GENERATED FROM PYTHON SOURCE LINES 201-209

.. code-block:: default

    index += ["Random forest"]
    cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 210-224

The :class:`~sklearn.ensemble.RandomForestClassifier` is as well affected by
the class imbalanced, slightly less than the linear model. Now, we will
present different approach to improve the performance of these 2 models.

Use `class_weight`
..................

Most of the models in `scikit-learn` have a parameter `class_weight`. This
parameter will affect the computation of the loss in linear model or the
criterion in the tree-based model to penalize differently a false
classification from the minority and majority class. We can set
`class_weight="balanced"` such that the weight applied is inversely
proportional to the class frequency. We test this parametrization in both
linear model and tree-based model.

.. GENERATED FROM PYTHON SOURCE LINES 226-236

.. code-block:: default

    lr_clf.set_params(logisticregression__class_weight="balanced")

    index += ["Logistic regression with balanced class weights"]
    cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 237-247

.. code-block:: default

    rf_clf.set_params(randomforestclassifier__class_weight="balanced")

    index += ["Random forest with balanced class weights"]
    cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
        <tr>
          <th>Random forest with balanced class weights</th>
          <td>0.963743</td>
          <td>0.620920</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 248-260

We can see that using `class_weight` was really effective for the linear
model, alleviating the issue of learning from imbalanced classes. However,
the :class:`~sklearn.ensemble.RandomForestClassifier` is still biased toward
the majority class, mainly due to the criterion which is not suited enough to
fight the class imbalance.

Resample the training set during learning
.........................................

Another way is to resample the training set by under-sampling or
over-sampling some of the samples. `imbalanced-learn` provides some samplers
to do such processing.

.. GENERATED FROM PYTHON SOURCE LINES 262-271

.. code-block:: default

    from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
    from imblearn.under_sampling import RandomUnderSampler

    lr_clf = make_pipeline_with_sampler(
        preprocessor_linear,
        RandomUnderSampler(random_state=42),
        LogisticRegression(max_iter=1000),
    )








.. GENERATED FROM PYTHON SOURCE LINES 272-280

.. code-block:: default

    index += ["Under-sampling + Logistic regression"]
    cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
        <tr>
          <th>Random forest with balanced class weights</th>
          <td>0.963743</td>
          <td>0.620920</td>
        </tr>
        <tr>
          <th>Under-sampling + Logistic regression</th>
          <td>0.789285</td>
          <td>0.807170</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 281-287

.. code-block:: default

    rf_clf = make_pipeline_with_sampler(
        preprocessor_tree,
        RandomUnderSampler(random_state=42),
        RandomForestClassifier(random_state=42, n_jobs=2),
    )








.. GENERATED FROM PYTHON SOURCE LINES 288-296

.. code-block:: default

    index += ["Under-sampling + Random forest"]
    cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
        <tr>
          <th>Random forest with balanced class weights</th>
          <td>0.963743</td>
          <td>0.620920</td>
        </tr>
        <tr>
          <th>Under-sampling + Logistic regression</th>
          <td>0.789285</td>
          <td>0.807170</td>
        </tr>
        <tr>
          <th>Under-sampling + Random forest</th>
          <td>0.787566</td>
          <td>0.798879</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 297-316

Applying a random under-sampler before the training of the linear model or
random forest, allows to not focus on the majority class at the cost of
making more mistake for samples in the majority class (i.e. decreased
accuracy).

We could apply any type of samplers and find which sampler is working best
on the current dataset.

Instead, we will present another way by using classifiers which will apply
sampling internally.

Use of specific balanced algorithms from imbalanced-learn
.........................................................

We already showed that random under-sampling can be effective on decision
tree. However, instead of under-sampling once the dataset, one could
under-sample the original dataset before to take a bootstrap sample. This is
the base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and
:class:`~imblearn.ensemble.BalancedBaggingClassifier`.

.. GENERATED FROM PYTHON SOURCE LINES 318-325

.. code-block:: default

    from imblearn.ensemble import BalancedRandomForestClassifier

    rf_clf = make_pipeline(
        preprocessor_tree,
        BalancedRandomForestClassifier(random_state=42, n_jobs=2),
    )








.. GENERATED FROM PYTHON SOURCE LINES 326-334

.. code-block:: default

    index += ["Balanced random forest"]
    cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
        <tr>
          <th>Random forest with balanced class weights</th>
          <td>0.963743</td>
          <td>0.620920</td>
        </tr>
        <tr>
          <th>Under-sampling + Logistic regression</th>
          <td>0.789285</td>
          <td>0.807170</td>
        </tr>
        <tr>
          <th>Under-sampling + Random forest</th>
          <td>0.787566</td>
          <td>0.798879</td>
        </tr>
        <tr>
          <th>Balanced random forest</th>
          <td>0.784987</td>
          <td>0.817841</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 335-339

The performance with the
:class:`~imblearn.ensemble.BalancedRandomForestClassifier` is better than
applying a single random under-sampling. We will use a gradient-boosting
classifier within a :class:`~imblearn.ensemble.BalancedBaggingClassifier`.

.. GENERATED FROM PYTHON SOURCE LINES 339-362

.. code-block:: default


    from sklearn.ensemble import HistGradientBoostingClassifier

    from imblearn.ensemble import BalancedBaggingClassifier

    bag_clf = make_pipeline(
        preprocessor_tree,
        BalancedBaggingClassifier(
            estimator=HistGradientBoostingClassifier(random_state=42),
            n_estimators=10,
            random_state=42,
            n_jobs=2,
        ),
    )

    index += ["Balanced bag of histogram gradient boosting"]
    cv_result = cross_validate(bag_clf, df_res, y_res, scoring=scoring)
    scores["Accuracy"].append(cv_result["test_accuracy"].mean())
    scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

    df_scores = pd.DataFrame(scores, index=index)
    df_scores






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Accuracy</th>
          <th>Balanced accuracy</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>Dummy classifier</th>
          <td>0.967755</td>
          <td>0.500000</td>
        </tr>
        <tr>
          <th>Logistic regression</th>
          <td>0.970724</td>
          <td>0.568307</td>
        </tr>
        <tr>
          <th>Random forest</th>
          <td>0.970724</td>
          <td>0.622584</td>
        </tr>
        <tr>
          <th>Logistic regression with balanced class weights</th>
          <td>0.796630</td>
          <td>0.814884</td>
        </tr>
        <tr>
          <th>Random forest with balanced class weights</th>
          <td>0.963743</td>
          <td>0.620920</td>
        </tr>
        <tr>
          <th>Under-sampling + Logistic regression</th>
          <td>0.789285</td>
          <td>0.807170</td>
        </tr>
        <tr>
          <th>Under-sampling + Random forest</th>
          <td>0.787566</td>
          <td>0.798879</td>
        </tr>
        <tr>
          <th>Balanced random forest</th>
          <td>0.784987</td>
          <td>0.817841</td>
        </tr>
        <tr>
          <th>Balanced bag of histogram gradient boosting</th>
          <td>0.827912</td>
          <td>0.814246</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 363-366

This last approach is the most effective. The different under-sampling allows
to bring some diversity for the different GBDT to learn and not focus on a
portion of the majority class.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  38.115 seconds)


.. _sphx_glr_download_auto_examples_applications_plot_impact_imbalanced_classes.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example




    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_impact_imbalanced_classes.py <plot_impact_imbalanced_classes.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_impact_imbalanced_classes.ipynb <plot_impact_imbalanced_classes.ipynb>`


.. include:: plot_impact_imbalanced_classes.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
