Fixes #190: GPU-enable all classification operations#852
Open
brendancol wants to merge 4 commits intomasterfrom
Open
Fixes #190: GPU-enable all classification operations#852brendancol wants to merge 4 commits intomasterfrom
brendancol wants to merge 4 commits intomasterfrom
Conversation
…breaks now support all 4 backends (#190) - Add Dask+CuPy backend for equal_interval via _run_dask_cupy_equal_interval - Replace quantile Dask+CuPy NotImplementedError with working implementation that materializes data to CPU for percentile computation - Add CuPy, Dask+NumPy, and Dask+CuPy backends for natural_breaks by extracting shared _compute_natural_break_bins helper - Add 7 new tests covering all new backend combinations - Update README feature matrix to reflect full backend support
- quantile dask+cupy: replace full materialization with map_blocks(cupy.asnumpy) to convert chunks to CPU one at a time, then delegate to dask's streaming approximate percentile - natural_breaks dask backends: sample lazily from the dask array and only materialize the sample (default 20k points), not the entire dataset. Add _generate_sample_indices helper that uses O(num_sample) memory via RandomState.choice() for large datasets, falling back to the original linspace+shuffle for small datasets to preserve determinism with numpy
- Remove unnecessary .ravel() in _run_equal_interval; nanmin/nanmax work on 2D - Combine double where(±inf) into single isinf pass in _run_equal_interval and _run_cupy_bin, halving temporary allocations - Use dask.compute(min, max) instead of two separate .compute() calls so dask reads data once instead of twice - Build cuts as numpy array for all backends (was needlessly dask for k elements) - Replace boolean fancy indexing in dask natural_break functions with da.where + da.nanmax to preserve chunk structure - Delete _run_dask_cupy_equal_interval; unified _run_equal_interval with module=da handles both dask+numpy and dask+cupy
… consistency - Missing backend: natural_breaks dask+cupy num_sample - Input mutation: verify all 5 functions don't modify input DataArray - Untested path: natural_breaks with num_sample=None - Edge cases: equal_interval k=1, all-NaN input for equal_interval and natural_breaks - Name parameter: verify default and custom name on all 5 functions - Cross-backend: verify natural_breaks cupy and dask match numpy results on a separate 10x10 dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
equal_intervalvia dedicated_run_dask_cupy_equal_intervalfunctionquantileDask+CuPyNotImplementedErrorwith a working implementation that materializes data to CPU for percentile computationnatural_breaksby extracting a shared_compute_natural_break_binshelper that runs the Jenks algorithm on CPU, then delegates to_bin()for GPU/Dask classificationAll 5 classification functions (
binary,reclassify,quantile,equal_interval,natural_breaks) now support all 4 backends (NumPy, Dask+NumPy, CuPy, Dask+CuPy).Closes #190
Test plan