Skip to content

Random forests

random_forest_classifier_train(X, y, validation_method='split', metrics=['accuracy'], split_size=0.2, cv_folds=5, n_estimators=100, criterion='gini', max_depth=None, verbose=0, random_state=None, **kwargs)

Train and optionally validate a Random Forest classifier model using Sklearn.

Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.

For more information about Sklearn Random Forest classifier, read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

Parameters:

Name Type Description Default
X Union[ndarray, DataFrame]

Training data.

required
y Union[ndarray, Series]

Target labels.

required
validation_method Literal[split, kfold_cv, skfold_cv, loo_cv, none]

Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training).

'split'
metrics Sequence[Literal[accuracy, precision, recall, f1]]

Metrics to use for scoring the model. Defaults to "accuracy".

['accuracy']
split_size float

Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2.

0.2
cv_folds int

Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5.

5
n_estimators int

The number of trees in the forest. Defaults to 100.

100
criterion Literal[gini, entropy, log_loss]

The function to measure the quality of a split. Defaults to "gini".

'gini'
max_depth Optional[int]

The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.

None
verbose int

Specifies if modeling progress and performance should be printed. 0 doesn't print, values 1 or above will produce prints.

0
random_state Optional[int]

Seed for random number generation. Defaults to None.

None
**kwargs

Additional parameters for Sklearn's RandomForestClassifier.

{}

Returns:

Type Description
Tuple[RandomForestClassifier, dict]

The trained RandomForestClassifier and metric scores as a dictionary.

Raises:

Type Description
InvalidParameterValueException

If some of the numeric parameters are given invalid input values.

NonMatchingParameterLengthsException

X and y have mismatching sizes.

Source code in eis_toolkit/prediction/random_forests.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
@beartype
def random_forest_classifier_train(
    X: Union[np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series],
    validation_method: Literal["split", "kfold_cv", "skfold_cv", "loo_cv", "none"] = "split",
    metrics: Sequence[Literal["accuracy", "precision", "recall", "f1"]] = ["accuracy"],
    split_size: float = 0.2,
    cv_folds: int = 5,
    n_estimators: int = 100,
    criterion: Literal["gini", "entropy", "log_loss"] = "gini",
    max_depth: Optional[int] = None,
    verbose: int = 0,
    random_state: Optional[int] = None,
    **kwargs,
) -> Tuple[RandomForestClassifier, dict]:
    """
    Train and optionally validate a Random Forest classifier model using Sklearn.

    Various options and configurations for model performance evaluation are available. No validation,
    split to train and validation parts, and cross-validation can be chosen. If validation is performed,
    metric(s) to calculate can be defined and validation process configured (cross-validation method,
    number of folds, size of the split). Depending on the details of the validation process,
    the output metrics dictionary can be empty, one-dimensional or nested.

    For more information about Sklearn Random Forest classifier, read the documentation here:
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

    Args:
        X: Training data.
        y: Target labels.
        validation_method: Validation method to use. "split" divides data into two parts, "kfold_cv"
            performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation,
            "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all
            (in this case, all X and y will be used solely for training).
        metrics: Metrics to use for scoring the model. Defaults to "accuracy".
        split_size: Fraction of the dataset to be used as validation data (rest is used for training).
            Used only when validation_method is "split". Defaults to 0.2.
        cv_folds: Number of folds used in cross-validation. Used only when validation_method is "kfold_cv"
            or "skfold_cv". Defaults to 5.
        n_estimators: The number of trees in the forest. Defaults to 100.
        criterion: The function to measure the quality of a split. Defaults to "gini".
        max_depth: The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are
            expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
            Defaults to None.
        verbose: Specifies if modeling progress and performance should be printed. 0 doesn't print,
            values 1 or above will produce prints.
        random_state: Seed for random number generation. Defaults to None.
        **kwargs: Additional parameters for Sklearn's RandomForestClassifier.

    Returns:
        The trained RandomForestClassifier and metric scores as a dictionary.

    Raises:
        InvalidParameterValueException: If some of the numeric parameters are given invalid input values.
        NonMatchingParameterLengthsException: X and y have mismatching sizes.
    """
    if not n_estimators >= 1:
        raise InvalidParameterValueException("N-estimators must be at least 1.")
    if max_depth is not None and not max_depth >= 1:
        raise InvalidParameterValueException("Max depth must be at least 1 or None.")
    if verbose < 0:
        raise InvalidParameterValueException("Verbose must be a non-negative number.")

    model = RandomForestClassifier(
        n_estimators=n_estimators,
        criterion=criterion,
        max_depth=max_depth,
        random_state=random_state,
        verbose=verbose,
        **kwargs,
    )

    model, metrics = _train_and_validate_sklearn_model(
        X=X,
        y=y,
        model=model,
        validation_method=validation_method,
        metrics=metrics,
        split_size=split_size,
        cv_folds=cv_folds,
        random_state=random_state,
    )

    return model, metrics

random_forest_regressor_train(X, y, validation_method='split', metrics=['mse'], split_size=0.2, cv_folds=5, n_estimators=100, criterion='squared_error', max_depth=None, verbose=0, random_state=None, **kwargs)

Train and optionally validate a Random Forest regressor model using Sklearn.

Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.

For more information about Sklearn Random Forest regressor, read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

Parameters:

Name Type Description Default
X Union[ndarray, DataFrame]

Training data.

required
y Union[ndarray, Series]

Target labels.

required
validation_method Literal[split, kfold_cv, skfold_cv, loo_cv, none]

Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training).

'split'
metrics Sequence[Literal[mse, rmse, mae, r2]]

Metrics to use for scoring the model. Defaults to "mse".

['mse']
split_size float

Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2.

0.2
cv_folds int

Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5.

5
n_estimators int

The number of trees in the forest. Defaults to 100.

100
criterion Literal[squared_error, absolute_error, friedman_mse, poisson]

The function to measure the quality of a split. "absolute_error" results in significantly longer training time than "squared_error". Defaults to "squared_error".

'squared_error'
max_depth Optional[int]

The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.

None
verbose int

Specifies if modeling progress and performance should be printed. 0 doesn't print, values 1 or above will produce prints.

0
random_state Optional[int]

Seed for random number generation. Defaults to None.

None
**kwargs

Additional parameters for Sklearn's RandomForestRegressor.

{}

Returns:

Type Description
Tuple[RandomForestRegressor, dict]

The trained RandomForestRegressor and metric scores as a dictionary.

Raises:

Type Description
InvalidParameterValueException

If some of the numeric parameters are given invalid input values.

NonMatchingParameterLengthsException

X and y have mismatching sizes.

Source code in eis_toolkit/prediction/random_forests.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
@beartype
def random_forest_regressor_train(
    X: Union[np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series],
    validation_method: Literal["split", "kfold_cv", "skfold_cv", "loo_cv", "none"] = "split",
    metrics: Sequence[Literal["mse", "rmse", "mae", "r2"]] = ["mse"],
    split_size: float = 0.2,
    cv_folds: int = 5,
    n_estimators: int = 100,
    criterion: Literal["squared_error", "absolute_error", "friedman_mse", "poisson"] = "squared_error",
    max_depth: Optional[int] = None,
    verbose: int = 0,
    random_state: Optional[int] = None,
    **kwargs,
) -> Tuple[RandomForestRegressor, dict]:
    """
    Train and optionally validate a Random Forest regressor model using Sklearn.

    Various options and configurations for model performance evaluation are available. No validation,
    split to train and validation parts, and cross-validation can be chosen. If validation is performed,
    metric(s) to calculate can be defined and validation process configured (cross-validation method,
    number of folds, size of the split). Depending on the details of the validation process,
    the output metrics dictionary can be empty, one-dimensional or nested.

    For more information about Sklearn Random Forest regressor, read the documentation here:
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

    Args:
        X: Training data.
        y: Target labels.
        validation_method: Validation method to use. "split" divides data into two parts, "kfold_cv"
            performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation,
            "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all
            (in this case, all X and y will be used solely for training).
        metrics: Metrics to use for scoring the model. Defaults to "mse".
        split_size: Fraction of the dataset to be used as validation data (rest is used for training).
            Used only when validation_method is "split". Defaults to 0.2.
        cv_folds: Number of folds used in cross-validation. Used only when validation_method is "kfold_cv"
            or "skfold_cv". Defaults to 5.
        n_estimators: The number of trees in the forest. Defaults to 100.
        criterion: The function to measure the quality of a split. "absolute_error" results in significantly
            longer training time than "squared_error". Defaults to "squared_error".
        max_depth: The maximum depth of the tree. Values must be >= 1 or None, in which case nodes are
            expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
            Defaults to None.
        verbose: Specifies if modeling progress and performance should be printed. 0 doesn't print,
            values 1 or above will produce prints.
        random_state: Seed for random number generation. Defaults to None.
        **kwargs: Additional parameters for Sklearn's RandomForestRegressor.

    Returns:
        The trained RandomForestRegressor and metric scores as a dictionary.

    Raises:
        InvalidParameterValueException: If some of the numeric parameters are given invalid input values.
        NonMatchingParameterLengthsException: X and y have mismatching sizes.
    """
    if not n_estimators >= 1:
        raise InvalidParameterValueException("N-estimators must be at least 1.")
    if max_depth is not None and not max_depth >= 1:
        raise InvalidParameterValueException("Max depth must be at least 1 or None.")
    if verbose < 0:
        raise InvalidParameterValueException("Verbose must be a non-negative number.")

    model = RandomForestRegressor(
        n_estimators=n_estimators,
        criterion=criterion,
        max_depth=max_depth,
        random_state=random_state,
        verbose=verbose,
        **kwargs,
    )

    model, metrics = _train_and_validate_sklearn_model(
        X=X,
        y=y,
        model=model,
        validation_method=validation_method,
        metrics=metrics,
        split_size=split_size,
        cv_folds=cv_folds,
        random_state=random_state,
    )

    return model, metrics