Skip to content

Gradient boosting

gradient_boosting_classifier_train(X, y, validation_method='split', metrics=['accuracy'], split_size=0.2, cv_folds=5, loss='log_loss', learning_rate=0.1, n_estimators=100, max_depth=3, subsample=1.0, verbose=0, random_state=None, **kwargs)

Train and optionally validate a Gradient Boosting classifier model using Sklearn.

Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.

For more information about Sklearn Gradient Boosting classifier read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.

Parameters:

Name Type Description Default
X Union[ndarray, DataFrame]

Training data.

required
y Union[ndarray, Series]

Target labels.

required
validation_method Literal[split, kfold_cv, skfold_cv, loo_cv, none]

Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training).

'split'
metrics Sequence[Literal[accuracy, precision, recall, f1, auc]]

Metrics to use for scoring the model. Defaults to "accuracy".

['accuracy']
split_size float

Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2.

0.2
cv_folds int

Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5.

5
loss Literal[log_loss, exponential]

The loss function to be optimized. Defaults to "log_loss" (same as in logistic regression).

'log_loss'
learning_rate Number

Shrinks the contribution of each tree. Values must be >= 0. Defaults to 0.1.

0.1
n_estimators int

The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting so a large number can result in better performance. Values must be >= 1. Defaults to 100.

100
max_depth Optional[int]

Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to 3.

3
subsample Number

The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. Values must be in the range 0.0 < x <= 1.0. Defaults to 1.0.

1.0
verbose int

Specifies if modeling progress and performance should be printed. 0 doesn't print, 1 prints once in a while depending on the number of tress, 2 or above will print for every tree.

0
random_state Optional[int]

Seed for random number generation. Defaults to None.

None
**kwargs

Additional parameters for Sklearn's GradientBoostingClassifier.

{}

Returns:

Type Description
Tuple[GradientBoostingClassifier, dict]

The trained GradientBoostingClassifier and metric scores as a dictionary.

Raises:

Type Description
InvalidParameterValueException

If some of the numeric parameters are given invalid input values.

NonMatchingParameterLengthsException

X and y have mismatching sizes.

Source code in eis_toolkit/prediction/gradient_boosting.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
@beartype
def gradient_boosting_classifier_train(
    X: Union[np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series],
    validation_method: Literal["split", "kfold_cv", "skfold_cv", "loo_cv", "none"] = "split",
    metrics: Sequence[Literal["accuracy", "precision", "recall", "f1", "auc"]] = ["accuracy"],
    split_size: float = 0.2,
    cv_folds: int = 5,
    loss: Literal["log_loss", "exponential"] = "log_loss",
    learning_rate: Number = 0.1,
    n_estimators: int = 100,
    max_depth: Optional[int] = 3,
    subsample: Number = 1.0,
    verbose: int = 0,
    random_state: Optional[int] = None,
    **kwargs,
) -> Tuple[GradientBoostingClassifier, dict]:
    """
    Train and optionally validate a Gradient Boosting classifier model using Sklearn.

    Various options and configurations for model performance evaluation are available. No validation,
    split to train and validation parts, and cross-validation can be chosen. If validation is performed,
    metric(s) to calculate can be defined and validation process configured (cross-validation method,
    number of folds, size of the split). Depending on the details of the validation process,
    the output metrics dictionary can be empty, one-dimensional or nested.

    For more information about Sklearn Gradient Boosting classifier read the documentation here:
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.

    Args:
        X: Training data.
        y: Target labels.
        validation_method: Validation method to use. "split" divides data into two parts, "kfold_cv"
            performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation,
            "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all
            (in this case, all X and y will be used solely for training).
        metrics: Metrics to use for scoring the model. Defaults to "accuracy".
        split_size: Fraction of the dataset to be used as validation data (rest is used for training).
            Used only when validation_method is "split". Defaults to 0.2.
        cv_folds: Number of folds used in cross-validation. Used only when validation_method is "kfold_cv"
            or "skfold_cv". Defaults to 5.
        loss: The loss function to be optimized. Defaults to "log_loss" (same as in logistic regression).
        learning_rate: Shrinks the contribution of each tree. Values must be >= 0. Defaults to 0.1.
        n_estimators: The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting
            so a large number can result in better performance. Values must be >= 1. Defaults to 100.
        max_depth: Maximum depth of the individual regression estimators. The maximum depth limits the number
            of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves
            are pure or until all leaves contain less than min_samples_split samples. Defaults to 3.
        subsample: The fraction of samples to be used for fitting the individual base learners.
            If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the
            parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
            Values must be in the range 0.0 < x <= 1.0. Defaults to 1.0.
        verbose: Specifies if modeling progress and performance should be printed. 0 doesn't print,
            1 prints once in a while depending on the number of tress, 2 or above will print for every tree.
        random_state: Seed for random number generation. Defaults to None.
        **kwargs: Additional parameters for Sklearn's GradientBoostingClassifier.

    Returns:
        The trained GradientBoostingClassifier and metric scores as a dictionary.

    Raises:
        InvalidParameterValueException: If some of the numeric parameters are given invalid input values.
        NonMatchingParameterLengthsException: X and y have mismatching sizes.
    """
    if not learning_rate >= 0:
        raise InvalidParameterValueException("Learning rate must be non-negative.")
    if not n_estimators >= 1:
        raise InvalidParameterValueException("N-estimators must be at least 1.")
    if max_depth is not None and not max_depth >= 1:
        raise InvalidParameterValueException("Max depth must be at least 1 or None.")
    if not (0 < subsample <= 1):
        raise InvalidParameterValueException("Subsample must be more than 0 and at most 1.")
    if verbose < 0:
        raise InvalidParameterValueException("Verbose must be a non-negative number.")

    model = GradientBoostingClassifier(
        loss=loss,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        max_depth=max_depth,
        subsample=subsample,
        random_state=random_state,
        verbose=verbose,
        **kwargs,
    )

    model, metrics = _train_and_validate_sklearn_model(
        X=X,
        y=y,
        model=model,
        validation_method=validation_method,
        metrics=metrics,
        split_size=split_size,
        cv_folds=cv_folds,
        random_state=random_state,
    )

    return model, metrics

gradient_boosting_regressor_train(X, y, validation_method='split', metrics=['mse'], split_size=0.2, cv_folds=5, loss='squared_error', learning_rate=0.1, n_estimators=100, max_depth=3, subsample=1.0, verbose=0, random_state=None, **kwargs)

Train and optionally validate a Gradient Boosting regressor model using Sklearn.

Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.

For more information about Sklearn Gradient Boosting regressor read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html.

Parameters:

Name Type Description Default
X Union[ndarray, DataFrame]

Training data.

required
y Union[ndarray, Series]

Target labels.

required
validation_method Literal[split, kfold_cv, skfold_cv, loo_cv, none]

Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training).

'split'
metrics Sequence[Literal[mse, rmse, mae, r2]]

Metrics to use for scoring the model. Defaults to "mse".

['mse']
split_size float

Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2.

0.2
cv_folds int

Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5.

5
loss Literal[squared_error, absolute_error, huber, quantile]

The loss function to be optimized. Defaults to "squared_error".

'squared_error'
learning_rate Number

Shrinks the contribution of each tree. Values must be > 0. Defaults to 0.1.

0.1
n_estimators int

The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting so a large number can result in better performance. Values must be >= 1. Defaults to 100.

100
max_depth Optional[int]

Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to 3.

3
subsample Number

The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. Values must be in the range 0.0 < x <= 1.0. Defaults to 1.

1.0
verbose int

Specifies if modeling progress and performance should be printed. 0 doesn't print, 1 prints once in a while depending on the number of tress, 2 or above will print for every tree.

0
random_state Optional[int]

Seed for random number generation. Defaults to None.

None
**kwargs

Additional parameters for Sklearn's GradientBoostingRegressor.

{}

Returns:

Type Description
Tuple[GradientBoostingRegressor, dict]

The trained GradientBoostingRegressor and metric scores as a dictionary.

Raises:

Type Description
InvalidParameterValueException

If some of the numeric parameters are given invalid input values.

NonMatchingParameterLengthsException

X and y have mismatching sizes.

Source code in eis_toolkit/prediction/gradient_boosting.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
@beartype
def gradient_boosting_regressor_train(
    X: Union[np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series],
    validation_method: Literal["split", "kfold_cv", "skfold_cv", "loo_cv", "none"] = "split",
    metrics: Sequence[Literal["mse", "rmse", "mae", "r2"]] = ["mse"],
    split_size: float = 0.2,
    cv_folds: int = 5,
    loss: Literal["squared_error", "absolute_error", "huber", "quantile"] = "squared_error",
    learning_rate: Number = 0.1,
    n_estimators: int = 100,
    max_depth: Optional[int] = 3,
    subsample: Number = 1.0,
    verbose: int = 0,
    random_state: Optional[int] = None,
    **kwargs,
) -> Tuple[GradientBoostingRegressor, dict]:
    """
    Train and optionally validate a Gradient Boosting regressor model using Sklearn.

    Various options and configurations for model performance evaluation are available. No validation,
    split to train and validation parts, and cross-validation can be chosen. If validation is performed,
    metric(s) to calculate can be defined and validation process configured (cross-validation method,
    number of folds, size of the split). Depending on the details of the validation process,
    the output metrics dictionary can be empty, one-dimensional or nested.

    For more information about Sklearn Gradient Boosting regressor read the documentation here:
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html.

    Args:
        X: Training data.
        y: Target labels.
        validation_method: Validation method to use. "split" divides data into two parts, "kfold_cv"
            performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation,
            "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all
            (in this case, all X and y will be used solely for training).
        metrics: Metrics to use for scoring the model. Defaults to "mse".
        split_size: Fraction of the dataset to be used as validation data (rest is used for training).
            Used only when validation_method is "split". Defaults to 0.2.
        cv_folds: Number of folds used in cross-validation. Used only when validation_method is "kfold_cv"
            or "skfold_cv". Defaults to 5.
        loss: The loss function to be optimized. Defaults to "squared_error".
        learning_rate: Shrinks the contribution of each tree. Values must be > 0. Defaults to 0.1.
        n_estimators: The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting
            so a large number can result in better performance. Values must be >= 1. Defaults to 100.
        max_depth: Maximum depth of the individual regression estimators. The maximum depth limits the number
            of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves
            are pure or until all leaves contain less than min_samples_split samples. Defaults to 3.
        subsample: The fraction of samples to be used for fitting the individual base learners.
            If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the
            parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
            Values must be in the range 0.0 < x <= 1.0. Defaults to 1.
        verbose: Specifies if modeling progress and performance should be printed. 0 doesn't print,
            1 prints once in a while depending on the number of tress, 2 or above will print for every tree.
        random_state: Seed for random number generation. Defaults to None.
        **kwargs: Additional parameters for Sklearn's GradientBoostingRegressor.

    Returns:
        The trained GradientBoostingRegressor and metric scores as a dictionary.

    Raises:
        InvalidParameterValueException: If some of the numeric parameters are given invalid input values.
        NonMatchingParameterLengthsException: X and y have mismatching sizes.
    """
    if not learning_rate >= 0:
        raise InvalidParameterValueException("Learning rate must be non-negative.")
    if not n_estimators >= 1:
        raise InvalidParameterValueException("N-estimators must be at least 1.")
    if max_depth is not None and not max_depth >= 1:
        raise InvalidParameterValueException("Max depth must be at least 1 or None.")
    if not (0 < subsample <= 1):
        raise InvalidParameterValueException("Subsample must be more than 0 and at most 1.")
    if verbose < 0:
        raise InvalidParameterValueException("Verbose must be a non-negative number.")

    model = GradientBoostingRegressor(
        loss=loss,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        max_depth=max_depth,
        subsample=subsample,
        random_state=random_state,
        verbose=verbose,
        **kwargs,
    )

    model, metrics = _train_and_validate_sklearn_model(
        X=X,
        y=y,
        model=model,
        validation_method=validation_method,
        metrics=metrics,
        split_size=split_size,
        cv_folds=cv_folds,
        random_state=random_state,
    )

    return model, metrics