Skip to content

PCA

compute_pca(data, number_of_components, columns=None, scaler_type='standard', nodata_handling='remove', nodata=None)

Compute defined number of principal components for numeric input data.

Before computation, data is scaled according to specified scaler and NaN values removed or replaced. Optionally, a nodata value can be given to handle similarly as NaN values.

If input data is a Numpy array, interpretation of the data depends on its dimensions. If array is 3D, it is interpreted as a multiband raster/stacked rasters format (bands, rows, columns). If array is 2D, it is interpreted as table-like data, where each column represents a variable/raster band and each row a data point (similar to a Dataframe).

Parameters:

Name Type Description Default
data Union[ndarray, DataFrame, GeoDataFrame]

Input data for PCA.

required
number_of_components int

The number of principal components to compute. Should be >= 1 and at most the number of numeric columns if input is (Geo)Dataframe.

required
columns Optional[Sequence[str]]

Select columns used for the PCA. Other columns are excluded from PCA, but added back to the result Dataframe intact. Only relevant if input is (Geo)Dataframe. Defaults to None.

None
scaler_type Literal[standard, min_max, robust]

Transform data according to a specified Sklearn scaler. Options are "standard", "min_max" and "robust". Defaults to "standard".

'standard'
nodata_handling Literal[remove, replace]

If observations with nodata (NaN and given nodata) should be removed for the time of PCA computation or replaced with column/band mean. Defaults to "remove".

'remove'
nodata Optional[Number]

Define a nodata value to remove. Defaults to None.

None

Returns:

Type Description
Union[ndarray, DataFrame, GeoDataFrame]

The computed principal components in corresponding format as the input data and the

ndarray

explained variance ratios for each component.

Raises:

Type Description
EmptyDataException

The input is empty.

InvalidColumnException

Selected columns are not found in the input Dataframe.

InvalidNumberOfPrincipalComponents

The number of principal components is less than 1 or more than number of columns if input was (Geo)DataFrame.

InvalidParameterValueException

If value for number_of_components is invalid.

Source code in eis_toolkit/exploratory_analyses/pca.py
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
@beartype
def compute_pca(
    data: Union[np.ndarray, pd.DataFrame, gpd.GeoDataFrame],
    number_of_components: int,
    columns: Optional[Sequence[str]] = None,
    scaler_type: Literal["standard", "min_max", "robust"] = "standard",
    nodata_handling: Literal["remove", "replace"] = "remove",
    nodata: Optional[Number] = None,
) -> Tuple[Union[np.ndarray, pd.DataFrame, gpd.GeoDataFrame], np.ndarray]:
    """
    Compute defined number of principal components for numeric input data.

    Before computation, data is scaled according to specified scaler and NaN values removed or replaced.
    Optionally, a nodata value can be given to handle similarly as NaN values.

    If input data is a Numpy array, interpretation of the data depends on its dimensions.
    If array is 3D, it is interpreted as a multiband raster/stacked rasters format (bands, rows, columns).
    If array is 2D, it is interpreted as table-like data, where each column represents a variable/raster band
    and each row a data point (similar to a Dataframe).

    Args:
        data: Input data for PCA.
        number_of_components: The number of principal components to compute. Should be >= 1 and at most
            the number of numeric columns if input is (Geo)Dataframe.
        columns: Select columns used for the PCA. Other columns are excluded from PCA, but added back
            to the result Dataframe intact. Only relevant if input is (Geo)Dataframe. Defaults to None.
        scaler_type: Transform data according to a specified Sklearn scaler.
            Options are "standard", "min_max" and "robust". Defaults to "standard".
        nodata_handling: If observations with nodata (NaN and given `nodata`) should be removed for the time
            of PCA computation or replaced with column/band mean. Defaults to "remove".
        nodata: Define a nodata value to remove. Defaults to None.

    Returns:
        The computed principal components in corresponding format as the input data and the
        explained variance ratios for each component.

    Raises:
        EmptyDataException: The input is empty.
        InvalidColumnException: Selected columns are not found in the input Dataframe.
        InvalidNumberOfPrincipalComponents: The number of principal components is less than 1 or more than
            number of columns if input was (Geo)DataFrame.
        InvalidParameterValueException: If value for `number_of_components` is invalid.
    """
    if scaler_type not in SCALERS:
        raise InvalidParameterValueException(f"Invalid scaler. Choose from: {list(SCALERS.keys())}")

    if number_of_components < 1:
        raise InvalidParameterValueException("The number of principal components should be >= 1.")

    # Get feature matrix (Numpy array) from various input types
    if isinstance(data, np.ndarray):
        feature_matrix = data
        feature_matrix = feature_matrix.astype(float)
        if feature_matrix.ndim == 2:  # Table-like data (assumme it is a DataFrame transformed to Numpy array)
            feature_matrix, nan_mask = _prepare_array_data(
                feature_matrix, nodata_handling=nodata_handling, nodata_value=nodata, reshape=False
            )
        elif feature_matrix.ndim == 3:  # Assume data represents multiband raster data
            rows, cols = feature_matrix.shape[1], feature_matrix.shape[2]
            feature_matrix, nan_mask = _prepare_array_data(
                feature_matrix, nodata_handling=nodata_handling, nodata_value=nodata, reshape=True
            )
        else:
            raise InvalidParameterValueException(
                f"Unsupported input data format. {feature_matrix.ndim} dimensions detected for given array."
            )

    elif isinstance(data, pd.DataFrame):
        df = data.copy()
        if df.empty:
            raise EmptyDataException("Input DataFrame is empty.")
        if isinstance(data, gpd.GeoDataFrame):
            geometries = data.geometry
            crs = data.crs
            df = df.drop(columns=["geometry"])
        if columns is not None and columns != []:
            if not check_columns_valid(df, columns):
                raise InvalidColumnException("All selected columns were not found in the input DataFrame.")
            df = df[columns]

        df = df.convert_dtypes()
        df = df.apply(pd.to_numeric, errors="ignore")
        df = df.select_dtypes(include=np.number)
        df = df.astype(dtype=np.number)
        feature_matrix = df.to_numpy()
        feature_matrix = feature_matrix.astype(float)
        feature_matrix, nan_mask = _handle_missing_values(feature_matrix, nodata_handling, nodata)

    if number_of_components > feature_matrix.shape[1]:
        raise InvalidParameterValueException("The number of principal components is too high for the given input data.")
    # Core PCA computation
    principal_components, explained_variances = _compute_pca(feature_matrix, number_of_components, scaler_type)

    if nodata_handling == "remove" and nan_mask is not None:
        principal_components_with_nans = np.full((nan_mask.size, principal_components.shape[1]), np.nan)
        principal_components_with_nans[~nan_mask, :] = principal_components
        principal_components = principal_components_with_nans

    # Convert PCA output to proper format
    if isinstance(data, np.ndarray):
        if data.ndim == 3:
            result_data = principal_components.reshape(rows, cols, -1).transpose(2, 0, 1)
        else:
            result_data = principal_components

    elif isinstance(data, pd.DataFrame):
        component_names = [f"principal_component_{i+1}" for i in range(number_of_components)]
        result_data = pd.DataFrame(data=principal_components, columns=component_names)
        if columns is not None:
            old_columns = [column for column in data.columns if column not in columns]
            for column in old_columns:
                result_data[column] = data[column]
        if isinstance(data, gpd.GeoDataFrame):
            result_data = gpd.GeoDataFrame(result_data, geometry=geometries, crs=crs)

    return result_data, explained_variances

plot_pca(pca_df, explained_variances=None, color_column_name=None, save_path=None)

Plot a scatter matrix of different principal component combinations.

Automatically filters columns that do not start with "principal_component" for plotting. This tool is designed to work smoothly on compute_pca outputs.

Parameters:

Name Type Description Default
pca_df DataFrame

A DataFrame containing computed principal components.

required
explained_variances Optional[ndarray]

The explained variance ratios for each principal component. Used for labeling axes in the plot. Optional parameter. Defaults to None.

None
color_column_name Optional[str]

Name of the column that will be used for color-coding data points. Typically a categorical variable in the original data. Optional parameter, no colors if not provided. Defaults to None.

None
save_path Optional[str]

The save path for the plot. Optional parameter, no saving if not provided. Defaults to None.

None

Returns:

Type Description
PairGrid

A Seaborn pairgrid containing the PCA scatter matrix.

Raises:

Type Description
InvalidColumnException

DataFrame does not contain the given color column.

Source code in eis_toolkit/exploratory_analyses/pca.py
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
@beartype
def plot_pca(
    pca_df: pd.DataFrame,
    explained_variances: Optional[np.ndarray] = None,
    color_column_name: Optional[str] = None,
    save_path: Optional[str] = None,
) -> sns.PairGrid:
    """
    Plot a scatter matrix of different principal component combinations.

    Automatically filters columns that do not start with "principal_component" for plotting.
    This tool is designed to work smoothly on `compute_pca` outputs.

    Args:
        pca_df: A DataFrame containing computed principal components.
        explained_variances: The explained variance ratios for each principal component. Used for labeling
            axes in the plot. Optional parameter. Defaults to None.
        color_column_name: Name of the column that will be used for color-coding data points. Typically a
            categorical variable in the original data. Optional parameter, no colors if not provided.
            Defaults to None.
        save_path: The save path for the plot. Optional parameter, no saving if not provided. Defaults to None.

    Returns:
        A Seaborn pairgrid containing the PCA scatter matrix.

    Raises:
        InvalidColumnException: DataFrame does not contain the given color column.
    """

    if color_column_name and color_column_name not in pca_df.columns:
        raise InvalidColumnException("DataFrame does not contain the given color column.")

    filtered_df = pca_df.filter(regex="^principal_component")
    filtered_df = pd.concat([filtered_df, pca_df[[color_column_name]]], axis=1)

    pair_grid = sns.pairplot(filtered_df, hue=color_column_name)

    # Add explained variances to axis labels if provided
    if explained_variances is not None:
        labels = [f"PC {i+1} ({var:.1f}%)" for i, var in enumerate(explained_variances * 100)]
    else:
        labels = [f"PC {i+1}" for i in range(len(pair_grid.axes))]

    # Iterate over axes objects and set the labels
    for i, ax_row in enumerate(pair_grid.axes):
        for j, ax in enumerate(ax_row):
            if j == 0:  # Only the first column
                ax.set_ylabel(labels[i], fontsize="large")
            if i == len(ax_row) - 1:  # Only the last row
                ax.set_xlabel(labels[j], fontsize="large")

    if save_path is not None:
        plt.savefig(save_path)

    return pair_grid