Random forests
random_forest_classifier_train(X, y, validation_method='split', metrics=['accuracy'], split_size=0.2, cv_folds=5, n_estimators=100, criterion='gini', max_depth=None, verbose=0, random_state=None, **kwargs)
Train and optionally validate a Random Forest classifier model using Sklearn.
Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.
For more information about Sklearn Random Forest classifier, read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[ndarray, DataFrame]
|
Training data. |
required |
y |
Union[ndarray, Series]
|
Target labels. |
required |
validation_method |
Literal[split, kfold_cv, skfold_cv, loo_cv, none]
|
Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training). |
'split'
|
metrics |
Sequence[Literal[accuracy, precision, recall, f1]]
|
Metrics to use for scoring the model. Defaults to "accuracy". |
['accuracy']
|
split_size |
float
|
Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2. |
0.2
|
cv_folds |
int
|
Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5. |
5
|
n_estimators |
int
|
The number of trees in the forest. Defaults to 100. |
100
|
criterion |
Literal[gini, entropy, log_loss]
|
The function to measure the quality of a split. Defaults to "gini". |
'gini'
|
max_depth |
Optional[int]
|
The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None. |
None
|
verbose |
int
|
Specifies if modeling progress and performance should be printed. 0 doesn't print, values 1 or above will produce prints. |
0
|
random_state |
Optional[int]
|
Seed for random number generation. Defaults to None. |
None
|
**kwargs |
Additional parameters for Sklearn's RandomForestClassifier. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[RandomForestClassifier, dict]
|
The trained RandomForestClassifier and metric scores as a dictionary. |
Raises:
Type | Description |
---|---|
InvalidParameterValueException
|
If some of the numeric parameters are given invalid input values. |
NonMatchingParameterLengthsException
|
X and y have mismatching sizes. |
Source code in eis_toolkit/prediction/random_forests.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
random_forest_regressor_train(X, y, validation_method='split', metrics=['mse'], split_size=0.2, cv_folds=5, n_estimators=100, criterion='squared_error', max_depth=None, verbose=0, random_state=None, **kwargs)
Train and optionally validate a Random Forest regressor model using Sklearn.
Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.
For more information about Sklearn Random Forest regressor, read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[ndarray, DataFrame]
|
Training data. |
required |
y |
Union[ndarray, Series]
|
Target labels. |
required |
validation_method |
Literal[split, kfold_cv, skfold_cv, loo_cv, none]
|
Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training). |
'split'
|
metrics |
Sequence[Literal[mse, rmse, mae, r2]]
|
Metrics to use for scoring the model. Defaults to "mse". |
['mse']
|
split_size |
float
|
Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2. |
0.2
|
cv_folds |
int
|
Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5. |
5
|
n_estimators |
int
|
The number of trees in the forest. Defaults to 100. |
100
|
criterion |
Literal[squared_error, absolute_error, friedman_mse, poisson]
|
The function to measure the quality of a split. "absolute_error" results in significantly longer training time than "squared_error". Defaults to "squared_error". |
'squared_error'
|
max_depth |
Optional[int]
|
The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None. |
None
|
verbose |
int
|
Specifies if modeling progress and performance should be printed. 0 doesn't print, values 1 or above will produce prints. |
0
|
random_state |
Optional[int]
|
Seed for random number generation. Defaults to None. |
None
|
**kwargs |
Additional parameters for Sklearn's RandomForestRegressor. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[RandomForestRegressor, dict]
|
The trained RandomForestRegressor and metric scores as a dictionary. |
Raises:
Type | Description |
---|---|
InvalidParameterValueException
|
If some of the numeric parameters are given invalid input values. |
NonMatchingParameterLengthsException
|
X and y have mismatching sizes. |
Source code in eis_toolkit/prediction/random_forests.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|