Skip to content

Winsorize

winsorize(raster, percentiles, bands=None, inside=False, nodata=None)

Winsorize data based on specified percentile values.

Takes one nodata value that will be ignored in calculations. Replaces values between [minimum, lower percentile] and [upper percentile, maximum] if provided. Works both one-sided and two-sided but raises error if no percentile values provided.

Percentiles are symmetrical, i.e. percentile_lower = 10 corresponds to the interval [min, 10%]. And percentile_upper = 10 corresponds to the intervall [90%, max]. I.e. percentile_lower = 0 refers to the minimum and percentile_upper = 0 to the data maximum.

Calculation of percentiles is ambiguous. Users can choose whether to use the value for replacement from inside or outside of the respective interval. Example: Given the np.array[5 10 12 15 20 24 27 30 35] and percentiles(10, 10), the calculated percentiles are (5, 35) for inside and (10, 30) for outside. This results in [5 10 12 15 20 24 27 30 35] and [10 10 12 15 20 24 27 30 30], respectively.

If no band/column selection specified, all bands/columns will be used. If a parameter contains only 1 entry, it will be applied for all bands. The percentiles can be set for each band individually, but inside parameter is same for all bands.

Parameters:

Name Type Description Default
raster DatasetReader

Data object to be transformed.

required
bands Optional[Sequence[int]]

Selection of bands to be transformed.

None
percentiles Sequence[Tuple[Optional[Number], Optional[Number]]]

Lower and upper percentile values (lower, upper) between [0, 100].

required
inside bool

Whether to use the value for replacement from the left or right of the calculated percentile.

False
nodata Optional[Number]

Nodata value to be considered.

None

Returns:

Name Type Description
out_array ndarray

The transformed data.

out_meta dict

Updated metadata.

out_settings dict

Log of input settings and calculated statistics if available.

Raises:

Type Description
InvalidRasterBandException

The input contains invalid band numbers.

NonMatchingParameterLengthsException

The input does not match the number of selected bands.

InvalidParameterValueException

The input does not match the requirements (values, order of values)

Source code in eis_toolkit/transformations/winsorize.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
@beartype
def winsorize(  # type: ignore[no-any-unimported]
    raster: rasterio.io.DatasetReader,
    percentiles: Sequence[Tuple[Optional[Number], Optional[Number]]],
    bands: Optional[Sequence[int]] = None,
    inside: bool = False,
    nodata: Optional[Number] = None,
) -> Tuple[np.ndarray, dict, dict]:
    """
    Winsorize data based on specified percentile values.

    Takes one nodata value that will be ignored in calculations.
    Replaces values between [minimum, lower percentile] and [upper percentile, maximum] if provided.
    Works both one-sided and two-sided but raises error if no percentile values provided.

    Percentiles are symmetrical, i.e. percentile_lower = 10 corresponds to the interval [min, 10%].
    And percentile_upper = 10 corresponds to the intervall [90%, max].
    I.e. percentile_lower = 0 refers to the minimum and percentile_upper = 0 to the data maximum.

    Calculation of percentiles is ambiguous. Users can choose whether to use the value
    for replacement from inside or outside of the respective interval. Example:
    Given the np.array[5 10 12 15 20 24 27 30 35] and percentiles(10, 10), the calculated
    percentiles are (5, 35) for inside and (10, 30) for outside.
    This results in [5 10 12 15 20 24 27 30 35] and [10 10 12 15 20 24 27 30 30], respectively.

    If no band/column selection specified, all bands/columns will be used.
    If a parameter contains only 1 entry, it will be applied for all bands.
    The percentiles can be set for each band individually, but inside parameter is same for all bands.

    Args:
        raster: Data object to be transformed.
        bands: Selection of bands to be transformed.
        percentiles: Lower and upper percentile values (lower, upper) between [0, 100].
        inside: Whether to use the value for replacement from the left or right of the calculated percentile.
        nodata: Nodata value to be considered.

    Returns:
        out_array: The transformed data.
        out_meta: Updated metadata.
        out_settings: Log of input settings and calculated statistics if available.

    Raises:
        InvalidRasterBandException: The input contains invalid band numbers.
        NonMatchingParameterLengthsException: The input does not match the number of selected bands.
        InvalidParameterValueException: The input does not match the requirements (values, order of values)
    """
    bands = list(range(1, raster.count + 1)) if bands is None else bands
    nodata = raster.nodata if nodata is None else nodata

    if check_raster_bands(raster, bands) is False:
        raise InvalidRasterBandException("Invalid band selection")

    if check_parameter_length(bands, percentiles) is False:
        raise NonMatchingParameterLengthsException("Invalid length for percentiles.")

    for item in percentiles:
        if item.count(None) == len(item):
            raise InvalidParameterValueException(f"Percentile values all None: {item}.")

        if None not in item and sum(item) >= 100:
            raise InvalidParameterValueException(f"Sum >= 100: {item}.")

        if item[0] is not None and not (0 < item[0] < 100):
            raise InvalidParameterValueException(f"Invalid lower percentile value: {item}.")

        if item[1] is not None and not (0 < item[1] < 100):
            raise InvalidParameterValueException(f"Invalid upper percentile value: {item}.")

    expanded_args = expand_and_zip(bands, percentiles)
    percentiles = [element[1] for element in expanded_args]

    out_settings = {}

    for i in range(0, len(bands)):
        band_array = raster.read(bands[i])
        inital_dtype = band_array.dtype

        band_array = cast_array_to_float(band_array, cast_int=True)
        band_array = nodata_to_nan(band_array, nodata_value=nodata)

        band_array, calculated_lower, calculated_upper = _winsorize(
            band_array, percentiles=percentiles[i], inside=inside
        )

        band_array = nan_to_nodata(band_array, nodata_value=nodata)
        band_array = cast_array_to_int(band_array, scalar=nodata, initial_dtype=inital_dtype)

        band_array = np.expand_dims(band_array, axis=0)

        if i == 0:
            out_array = band_array.copy()
        else:
            out_array = np.vstack((out_array, band_array))

        current_transform = f"transformation {i + 1}"
        current_settings = {
            "band_origin": bands[i],
            "percentile_lower": cast_scalar_to_int(percentiles[i][0]),
            "percentile_upper": cast_scalar_to_int(percentiles[i][1]),
            "calculated_lower": cast_scalar_to_int(calculated_lower),
            "calculated_upper": cast_scalar_to_int(calculated_upper),
            "nodata": cast_scalar_to_int(nodata),
        }

        out_settings[current_transform] = current_settings

    out_meta = raster.meta.copy()
    out_meta.update({"count": len(bands), "nodata": nodata, "dtype": out_array.dtype.name})

    return out_array, out_meta, out_settings