One-hot encoding
one_hot_encode(data, columns=None, drop_original_columns=True, drop_category=None, sparse_output=True, out_dtype=int, handle_unknown='infrequent_if_exist', min_frequency=None, max_categories=None)
Perform one-hot (or one-of-K or dummy) encoding on categorical data in a DataFrame or NumPy array.
This function converts categorical variables into a form that could be provided to machine learning algorithms for better prediction. For each unique category in the feature, a new binary column is created.
Continuous data should not be given to this function to avoid excessive amounts of binary features. If input is a DataFrame, continuous data can be excluded from encoding by specifying columns to encode.
The function allows control over aspects like handling unknown categories, controlling sparsity of the output, and setting data type of the encoded columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Union[DataFrame, ndarray]
|
Input data as a DataFrame or Numpy array. If a DataFrame is provided, the operation can be restricted to specified columns. |
required |
columns |
Optional[Sequence[str]]
|
Specifies the columns to encode if 'data' is a DataFrame. If None, all columns are considered for encoding. Ignored if 'data' is a Numpy array. Defaults to None. |
None
|
drop_original_columns |
bool
|
If True and 'data' is a DataFrame, the original columns being encoded will be dropped from the output. Defaults to True. |
True
|
drop_category |
Optional[Literal[first, if_binary]]
|
Specifies a method to drop one of the categories to avoid multicollinearity. 'first' drops the first category, 'if_binary' drops one category only if the feature is binary. If None, no category is dropped. Defaults to None. |
None
|
sparse_output |
bool
|
Determines whether the output matrix is sparse or dense. Defaults to True (sparse). |
True
|
out_dtype |
Union[type, dtype]
|
Numeric data type of the output. Defaults to int. |
int
|
handle_unknown |
Literal[error, ignore, infrequent_if_exist]
|
Specifies how to handle unknown categories encountered during transform. 'error' raises an error, 'ignore' ignores unknown categories, and 'infrequent_if_exist' treats them as infrequent. Defaults to 'infrequent_if_exist'. |
'infrequent_if_exist'
|
min_frequency |
Optional[Number]
|
The minimum frequency (as a float or an int) needed to include a category in encoding. Optional parameter. Defaults to None. |
None
|
max_categories |
Optional[int]
|
The maximum number of categories to include in encoding. Optional parameter. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
Union[DataFrame, ndarray, csr_matrix]
|
Encoded data as a DataFrame if input was a DataFrame, or as a Numpy array (dense or sparse) if input was a Numpy array. |
Raises:
Type | Description |
---|---|
EmptyDataFrameException
|
If the input DataFrame is empty. |
InvalidDatasetException
|
If the input Numpy array is empty. |
InvalidColumnException
|
If any specified column to encode does not exist in the input DataFrame. |
Source code in eis_toolkit/transformations/one_hot_encoding.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|