Skip to content

Logistic regression

logistic_regression_train(X, y, validation_method='split', metrics=['accuracy'], split_size=0.2, cv_folds=5, penalty='l2', max_iter=100, solver='lbfgs', verbose=0, random_state=None, **kwargs)

Train and optionally validate a Logistic Regression classifier model using Sklearn.

Various options and configurations for model performance evaluation are available. No validation, split to train and validation parts, and cross-validation can be chosen. If validation is performed, metric(s) to calculate can be defined and validation process configured (cross-validation method, number of folds, size of the split). Depending on the details of the validation process, the output metrics dictionary can be empty, one-dimensional or nested.

The choice of the algorithm depends on the penalty chosen. Supported penalties by solver: 'lbfgs' - ['l2', None] 'liblinear' - ['l1', 'l2'] 'newton-cg' - ['l2', None] 'newton-cholesky' - ['l2', None] 'sag' - ['l2', None] 'saga' - ['elasticnet', 'l1', 'l2', None]

For more information about Sklearn Logistic Regression, read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

Parameters:

Name Type Description Default
X Union[ndarray, DataFrame]

Training data.

required
y Union[ndarray, Series]

Target labels.

required
validation_method Literal[split, kfold_cv, skfold_cv, loo_cv, none]

Validation method to use. "split" divides data into two parts, "kfold_cv" performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation, "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all (in this case, all X and y will be used solely for training).

'split'
metrics Sequence[Literal[accuracy, precision, recall, f1, auc]]

Metrics to use for scoring the model. Defaults to "accuracy".

['accuracy']
split_size float

Fraction of the dataset to be used as validation data (rest is used for training). Used only when validation_method is "split". Defaults to 0.2.

0.2
cv_folds int

Number of folds used in cross-validation. Used only when validation_method is "kfold_cv" or "skfold_cv". Defaults to 5.

5
penalty Literal[l1, l2, elasicnet, None]

Specifies the norm of the penalty. Defaults to 'l2'.

'l2'
max_iter int

Maximum number of iterations taken for the solvers to converge. Defaults to 100.

100
solver Literal[lbfgs, liblinear, newton - cg, newton - cholesky, sag, saga]

Algorithm to use in the optimization problem. Defaults to 'lbfgs'.

'lbfgs'
verbose int

Specifies if modeling progress and performance should be printed. 0 doesn't print, values 1 or above will produce prints.

0
random_state Optional[int]

Seed for random number generation. Defaults to None.

None
**kwargs

Additional parameters for Sklearn's LogisticRegression.

{}

Returns:

Type Description
Tuple[LogisticRegression, dict]

The trained Logistric Regression classifier and metric scores as a dictionary.

Raises:

Type Description
InvalidParameterValueException

If some of the numeric parameters are given invalid input values.

NonMatchingParameterLengthsException

X and y have mismatching sizes.

Source code in eis_toolkit/prediction/logistic_regression.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
@beartype
def logistic_regression_train(
    X: Union[np.ndarray, pd.DataFrame],
    y: Union[np.ndarray, pd.Series],
    validation_method: Literal["split", "kfold_cv", "skfold_cv", "loo_cv", "none"] = "split",
    metrics: Sequence[Literal["accuracy", "precision", "recall", "f1", "auc"]] = ["accuracy"],
    split_size: float = 0.2,
    cv_folds: int = 5,
    penalty: Literal["l1", "l2", "elasicnet", None] = "l2",
    max_iter: int = 100,
    solver: Literal["lbfgs", "liblinear", "newton-cg", "newton-cholesky", "sag", "saga"] = "lbfgs",
    verbose: int = 0,
    random_state: Optional[int] = None,
    **kwargs
) -> Tuple[LogisticRegression, dict]:
    """
    Train and optionally validate a Logistic Regression classifier model using Sklearn.

    Various options and configurations for model performance evaluation are available. No validation,
    split to train and validation parts, and cross-validation can be chosen. If validation is performed,
    metric(s) to calculate can be defined and validation process configured (cross-validation method,
    number of folds, size of the split). Depending on the details of the validation process,
    the output metrics dictionary can be empty, one-dimensional or nested.

    The choice of the algorithm depends on the penalty chosen. Supported penalties by solver:
    'lbfgs' - ['l2', None]
    'liblinear' - ['l1', 'l2']
    'newton-cg' - ['l2', None]
    'newton-cholesky' - ['l2', None]
    'sag' - ['l2', None]
    'saga' - ['elasticnet', 'l1', 'l2', None]

    For more information about Sklearn Logistic Regression, read the documentation here:
    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

    Args:
        X: Training data.
        y: Target labels.
        validation_method: Validation method to use. "split" divides data into two parts, "kfold_cv"
            performs k-fold cross-validation, "skfold_cv" performs stratified k-fold cross-validation,
            "loo_cv" performs leave-one-out cross-validation and "none" will not validate model at all
            (in this case, all X and y will be used solely for training).
        metrics: Metrics to use for scoring the model. Defaults to "accuracy".
        split_size: Fraction of the dataset to be used as validation data (rest is used for training).
            Used only when validation_method is "split". Defaults to 0.2.
        cv_folds: Number of folds used in cross-validation. Used only when validation_method is "kfold_cv"
            or "skfold_cv". Defaults to 5.
        penalty: Specifies the norm of the penalty. Defaults to 'l2'.
        max_iter: Maximum number of iterations taken for the solvers to converge. Defaults to 100.
        solver: Algorithm to use in the optimization problem. Defaults to 'lbfgs'.
        verbose: Specifies if modeling progress and performance should be printed. 0 doesn't print,
            values 1 or above will produce prints.
        random_state: Seed for random number generation. Defaults to None.
        **kwargs: Additional parameters for Sklearn's LogisticRegression.

    Returns:
        The trained Logistric Regression classifier and metric scores as a dictionary.

    Raises:
        InvalidParameterValueException: If some of the numeric parameters are given invalid input values.
        NonMatchingParameterLengthsException: X and y have mismatching sizes.
    """
    if max_iter < 1:
        raise InvalidParameterValueException("Max iter must be > 0.")
    if verbose < 0:
        raise InvalidParameterValueException("Verbose must be a non-negative number.")

    model = LogisticRegression(
        penalty=penalty, max_iter=max_iter, random_state=random_state, solver=solver, verbose=verbose, **kwargs
    )

    model, metrics = _train_and_validate_sklearn_model(
        X=X,
        y=y,
        model=model,
        validation_method=validation_method,
        metrics=metrics,
        split_size=split_size,
        cv_folds=cv_folds,
        random_state=random_state,
    )

    return model, metrics