Skip to content

Chi-square test

chi_square_test(data, target_column, columns=None)

Perform a Chi-square test of independence between a target variable and one or more other variables.

Input data should be categorical data. Continuous data or non-categorical data should be discretized or binned before using this function, as Chi-square tests are not applicable to continuous variables directly.

The test assumes that the observed frequencies in each category are independent.

Parameters:

Name Type Description Default
data DataFrame

Dataframe containing the input data.

required
target_column str

Variable against which independence of other variables is tested.

required
columns Optional[Sequence[str]]

Variables that are tested against the variable in target_column. If None, every column is used.

None

Returns:

Type Description
Dict[str, Dict[str, float]]

Test statistics, p-value and degrees of freedom for each variable.

Raises:

Type Description
EmptyDataFrameException

Input Dataframe is empty.

InvalidParameterValueException

Invalid column is input.

Source code in eis_toolkit/exploratory_analyses/chi_square_test.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@beartype
def chi_square_test(
    data: pd.DataFrame, target_column: str, columns: Optional[Sequence[str]] = None
) -> Dict[str, Dict[str, float]]:
    """Perform a Chi-square test of independence between a target variable and one or more other variables.

    Input data should be categorical data. Continuous data or non-categorical data should be discretized or
    binned before using this function, as Chi-square tests are not applicable to continuous variables directly.

    The test assumes that the observed frequencies in each category are independent.

    Args:
        data: Dataframe containing the input data.
        target_column: Variable against which independence of other variables is tested.
        columns: Variables that are tested against the variable in target_column. If None, every column is used.

    Returns:
        Test statistics, p-value and degrees of freedom for each variable.

    Raises:
        EmptyDataFrameException: Input Dataframe is empty.
        InvalidParameterValueException: Invalid column is input.
    """
    if check_empty_dataframe(data):
        raise EmptyDataFrameException("The input Dataframe is empty.")

    if not check_columns_valid(data, [target_column]):
        raise InvalidParameterValueException("Target column not found in the Dataframe.")

    if columns:
        invalid_columns = [column for column in columns if column not in data.columns]
        if invalid_columns:
            raise InvalidParameterValueException(f"Invalid columns: {invalid_columns}")
    else:
        columns = [col for col in data.columns if col != target_column]

    statistics = {}
    for column in columns:
        contingency_table = pd.crosstab(data[target_column], data[column])
        chi_square, p_value, degrees_of_freedom, _ = chi2_contingency(contingency_table)
        statistics[column] = {"chi_square": chi_square, "p-value": p_value, "degrees_of_freedom": degrees_of_freedom}

    return statistics