Gradient boosting
gradient_boosting_classifier_train(X, y, validation_method='split', metrics=['accuracy'], split_size=0.2, cv_folds=5, loss='log_loss', learning_rate=0.1, n_estimators=100, max_depth=3, subsample=1.0, verbose=0, random_state=None, **kwargs)
Train and optionally validate a Gradient Boosting classifier model using Sklearn.
Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.
For more information about Sklearn Gradient Boosting classifier read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[ndarray, DataFrame]
|
Training data. |
required |
y |
Union[ndarray, Series]
|
Target labels. |
required |
validation_method |
Literal[split, kfold_cv, skfold_cv, loo_cv, none]
|
Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training). |
'split'
|
metrics |
Sequence[Literal[accuracy, precision, recall, f1, auc]]
|
Metrics to use for scoring the model. Defaults to "accuracy". |
['accuracy']
|
split_size |
float
|
Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2. |
0.2
|
cv_folds |
int
|
Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5. |
5
|
loss |
Literal[log_loss, exponential]
|
The loss function to be optimized. Defaults to "log_loss" (same as in logistic regression). |
'log_loss'
|
learning_rate |
Number
|
Shrinks the contribution of each tree. Values must be >= 0. Defaults to 0.1. |
0.1
|
n_estimators |
int
|
The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting so a large number can result in better performance. Values must be >= 1. Defaults to 100. |
100
|
max_depth |
Optional[int]
|
Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to 3. |
3
|
subsample |
Number
|
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. Values must be in the range 0.0 < x <= 1.0. Defaults to 1.0. |
1.0
|
verbose |
int
|
Specifies if modeling progress and performance should be printed. 0 doesn't print, 1 prints once in a while depending on the number of tress, 2 or above will print for every tree. |
0
|
random_state |
Optional[int]
|
Seed for random number generation. Defaults to None. |
None
|
**kwargs |
Additional parameters for Sklearn's GradientBoostingClassifier. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[GradientBoostingClassifier, dict]
|
The trained GradientBoostingClassifier and metric scores as a dictionary. |
Raises:
Type | Description |
---|---|
InvalidParameterValueException
|
If some of the numeric parameters are given invalid input values. |
NonMatchingParameterLengthsException
|
X and y have mismatching sizes. |
Source code in eis_toolkit/prediction/gradient_boosting.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
gradient_boosting_regressor_train(X, y, validation_method='split', metrics=['mse'], split_size=0.2, cv_folds=5, loss='squared_error', learning_rate=0.1, n_estimators=100, max_depth=3, subsample=1.0, verbose=0, random_state=None, **kwargs)
Train and optionally validate a Gradient Boosting regressor model using Sklearn.
Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.
For more information about Sklearn Gradient Boosting regressor read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[ndarray, DataFrame]
|
Training data. |
required |
y |
Union[ndarray, Series]
|
Target labels. |
required |
validation_method |
Literal[split, kfold_cv, skfold_cv, loo_cv, none]
|
Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training). |
'split'
|
metrics |
Sequence[Literal[mse, rmse, mae, r2]]
|
Metrics to use for scoring the model. Defaults to "mse". |
['mse']
|
split_size |
float
|
Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2. |
0.2
|
cv_folds |
int
|
Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5. |
5
|
loss |
Literal[squared_error, absolute_error, huber, quantile]
|
The loss function to be optimized. Defaults to "squared_error". |
'squared_error'
|
learning_rate |
Number
|
Shrinks the contribution of each tree. Values must be > 0. Defaults to 0.1. |
0.1
|
n_estimators |
int
|
The number of boosting stages to run. Gradient boosting is fairly robust to over-fitting so a large number can result in better performance. Values must be >= 1. Defaults to 100. |
100
|
max_depth |
Optional[int]
|
Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to 3. |
3
|
subsample |
Number
|
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. Values must be in the range 0.0 < x <= 1.0. Defaults to 1. |
1.0
|
verbose |
int
|
Specifies if modeling progress and performance should be printed. 0 doesn't print, 1 prints once in a while depending on the number of tress, 2 or above will print for every tree. |
0
|
random_state |
Optional[int]
|
Seed for random number generation. Defaults to None. |
None
|
**kwargs |
Additional parameters for Sklearn's GradientBoostingRegressor. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[GradientBoostingRegressor, dict]
|
The trained GradientBoostingRegressor and metric scores as a dictionary. |
Raises:
Type | Description |
---|---|
InvalidParameterValueException
|
If some of the numeric parameters are given invalid input values. |
NonMatchingParameterLengthsException
|
X and y have mismatching sizes. |
Source code in eis_toolkit/prediction/gradient_boosting.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
|