Skip to content

Class balancing

balance_SMOTETomek(X, y, sampling_strategy='auto', random_state=None)

Balances the classes of input dataset using SMOTETomek resampling method.

Parameters:

Name Type Description Default
X Union[DataFrame, ndarray]

The feature matrix (input data as a DataFrame).

required
y Union[Series, ndarray]

The target labels corresponding to the feature matrix.

required
sampling_strategy Union[float, str, dict]

Parameter controlling how to perform the resampling. If float, specifies the ratio of samples in minority class to samples of majority class, if str, specifies classes to be resampled ("minority", "not minority", "not majority", "all", "auto"), if dict, the keys should be targeted classes and values the desired number of samples for the class. Defaults to "auto", which will resample all classes except the majority class.

'auto'
random_state Optional[int]

Parameter controlling randomization of the algorithm. Can be given a seed (number). Defaults to None, which randomizes the seed.

None

Returns:

Type Description
tuple[Union[DataFrame, ndarray], Union[Series, ndarray]]

Resampled feature matrix and target labels.

Raises:

Type Description
NonMatchingParameterLengthsException

If X and y have different length.

Source code in eis_toolkit/training_data_tools/class_balancing.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@beartype
def balance_SMOTETomek(
    X: Union[pd.DataFrame, np.ndarray],
    y: Union[pd.Series, np.ndarray],
    sampling_strategy: Union[float, str, dict] = "auto",
    random_state: Optional[int] = None,
) -> tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, np.ndarray]]:
    """Balances the classes of input dataset using SMOTETomek resampling method.

    Args:
        X: The feature matrix (input data as a DataFrame).
        y: The target labels corresponding to the feature matrix.
        sampling_strategy: Parameter controlling how to perform the resampling.
            If float, specifies the ratio of samples in minority class to samples of majority class,
            if str, specifies classes to be resampled ("minority", "not minority", "not majority", "all", "auto"),
            if dict, the keys should be targeted classes and values the desired number of samples for the class.
            Defaults to "auto", which will resample all classes except the majority class.
        random_state: Parameter controlling randomization of the algorithm. Can be given a seed (number).
            Defaults to None, which randomizes the seed.

    Returns:
        Resampled feature matrix and target labels.

    Raises:
        NonMatchingParameterLengthsException: If X and y have different length.
    """

    if len(X) != len(y):
        raise NonMatchingParameterLengthsException("Feature matrix X and target labels y must have the same length.")

    X_res, y_res = SMOTETomek(sampling_strategy=sampling_strategy, random_state=random_state).fit_resample(X, y)
    return X_res, y_res