Conversation
for more information, see https://pre-commit.ci
WalkthroughThis pull request adds feature analysis capabilities to the codebase. Two runtime dependencies (liac-arff and pyarrow) are added to pyproject.toml. The src/core/feature_analysis.py module introduces analyze_arff() and analyze_parquet() functions that extract metadata from ARFF and Parquet files respectively. Both functions parse file structures, infer feature data types (NUMERIC, NOMINAL, STRING), determine nominal value sets where applicable, count missing values per feature, and generate Feature objects with metadata. The implementation includes edge case handling for sparse ARFF representations and dictionary-encoded Parquet columns. Comprehensive unit tests in tests/core/test_feature_analysis.py validate both normal and sparse data paths, metadata tagging, and missing value handling. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- The Parquet analyzer currently reads the entire table into memory and computes full unique sets for string/dictionary columns, which may not scale for large datasets; consider using
read_tablewith column projections and row group/row limits, or sampling rows when inferring nominal values. - Treating all non-numeric Parquet string columns as nominal and fully materializing their unique values can be very expensive and may misclassify high‑cardinality text fields; you might want a heuristic (e.g., max distinct threshold) to fall back to
STRINGwithout collecting all unique values.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Parquet analyzer currently reads the entire table into memory and computes full unique sets for string/dictionary columns, which may not scale for large datasets; consider using `read_table` with column projections and row group/row limits, or sampling rows when inferring nominal values.
- Treating all non-numeric Parquet string columns as nominal and fully materializing their unique values can be very expensive and may misclassify high‑cardinality text fields; you might want a heuristic (e.g., max distinct threshold) to fall back to `STRING` without collecting all unique values.
## Individual Comments
### Comment 1
<location> `src/core/feature_analysis.py:111-110` </location>
<code_context>
+ elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
</code_context>
<issue_to_address>
**suggestion (performance):** Treating all string columns as nominal and enumerating all unique values may not scale for high-cardinality text.
This branch treats all string-typed Parquet columns as nominal and builds `nominal_values` by calling `chunk.unique()` over all chunks. For high-cardinality or free-text columns this can be very slow and memory-intensive, producing huge and not very useful `nominal_values` lists. Consider introducing a cardinality/uniqueness threshold or sampling limit to fall back to `STRING` instead of `NOMINAL`, and/or capping the number of unique values collected to avoid pathological cases.
Suggested implementation:
```python
elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
# For Parquet, strings might be nominal if they don't have a dictionary.
# However, treating all string columns as nominal and enumerating every
# unique value does not scale for high-cardinality / free-text columns.
#
# Heuristic:
# - Booleans are always nominal: [False, True].
# - For strings, sample up to a maximum number of rows and track unique
# values up to a cardinality cap. If the cap is exceeded, fall back
# to treating the column as STRING by not populating nominal_values.
#
# Downstream logic should interpret `nominal_values is None` as
# "not nominal" for string-typed columns.
MAX_NOMINAL_CARDINALITY = 1000 # Maximum distinct values to treat as nominal
MAX_NOMINAL_SAMPLE_ROWS = 100000 # Maximum rows to scan per column
if pa.types.is_boolean(pa_type):
# Booleans are low-cardinality by definition.
nominal_values = [False, True]
else:
column_data = table.column(i)
unique_values = set()
rows_seen = 0
for chunk in column_data.chunks:
if rows_seen >= MAX_NOMINAL_SAMPLE_ROWS:
break
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
# Respect the sampling limit by slicing the chunk if needed
chunk_to_scan = chunk
remaining_rows = MAX_NOMINAL_SAMPLE_ROWS - rows_seen
if len(chunk_to_scan) > remaining_rows:
chunk_to_scan = chunk_to_scan.slice(0, remaining_rows)
# Convert the chunk to Python values; skip nulls
for val in chunk_to_scan.to_pylist():
if val is None:
continue
unique_values.add(val)
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
rows_seen += len(chunk_to_scan)
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
# If we exceeded the cardinality cap (or saw no values),
# treat this as a generic string column (no nominal_values).
if 0 < len(unique_values) <= MAX_NOMINAL_CARDINALITY:
nominal_values = sorted(unique_values)
else:
nominal_values = None
```
1. Ensure that whatever structure you are building for each feature (e.g. an `attribute_type`/`feature_type` or similar) interprets `nominal_values is None` for string-typed columns as a signal to treat the column as `STRING` (or equivalent non-nominal type) rather than `NOMINAL`.
2. If you already have module-level configuration or constants for thresholds, you may want to hoist `MAX_NOMINAL_CARDINALITY` and `MAX_NOMINAL_SAMPLE_ROWS` to that configuration instead of hard-coding them inline.
</issue_to_address>
### Comment 2
<location> `tests/core/test_feature_analysis.py:7-16` </location>
<code_context>
+from core.feature_analysis import analyze_arff, analyze_parquet
+from schemas.datasets.openml import FeatureType
+
+def test_analyze_arff_dense():
+ arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b {x,y}
+@ATTRIBUTE c STRING
+@DATA
+1,x,hello
+2,y,world
+?,? ,?
+"""
+ features = analyze_arff(StringIO(arff_data))
+ assert len(features) == 3
+
+ assert features[0].name == "a"
+ assert features[0].data_type == FeatureType.NUMERIC
+ assert features[0].number_of_missing_values == 1
+ assert features[0].nominal_values is None
+
+ assert features[1].name == "b"
+ assert features[1].data_type == FeatureType.NOMINAL
+ assert features[1].nominal_values == ["x", "y"]
+ assert features[1].number_of_missing_values == 1
+
+ assert features[2].name == "c"
+ assert features[2].data_type == FeatureType.STRING
+ assert features[2].number_of_missing_values == 1
+
+def test_analyze_arff_sparse():
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for ARFF files with a schema but no data rows
Please add coverage for an ARFF file that declares attributes but has an empty `@DATA` section (no rows). This will confirm `analyze_arff` safely handles zero-row datasets and reports `number_of_missing_values == 0` for all features.
</issue_to_address>
### Comment 3
<location> `tests/core/test_feature_analysis.py:55-58` </location>
<code_context>
+ assert features[2].data_type == FeatureType.STRING
+ assert features[2].number_of_missing_values == 1
+
+def test_analyze_arff_sparse():
+ arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b NUMERIC
+@ATTRIBUTE c {X,Y}
+@DATA
+{0 1, 2 X}
+{1 5}
+{0 ?, 2 ?}
+"""
+ features = analyze_arff(StringIO(arff_data))
+ assert len(features) == 3
+
+ # index 0: 1, missing(0), ? -> 1 missing
+ assert features[0].name == "a"
+ assert features[0].number_of_missing_values == 1
+
+ # index 1: missing(0), 5, missing(0) -> 0 missing
+ assert features[1].name == "b"
+ assert features[1].number_of_missing_values == 0
+
+ # index 2: X, missing(None?), ?
+ # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
+ assert features[2].name == "c"
+ assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}
+
+def test_analyze_arff_sparse_all_missing():
</code_context>
<issue_to_address>
**suggestion (testing):** Also assert data type and nominal values for the sparse nominal attribute
In `test_analyze_arff_sparse`, feature `c` is declared as `{X,Y}`, so it should be validated as a nominal feature using the header’s allowed values. In addition to `number_of_missing_values` and the name, please also assert `features[2].data_type == FeatureType.NOMINAL` and `features[2].nominal_values == ["X", "Y"]` (or the expected order) to verify nominal metadata is handled correctly for sparse attributes.
```suggestion
# index 2: X, missing(None?), ?
# Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
assert features[2].name == "c"
assert features[2].data_type == FeatureType.NOMINAL
assert features[2].nominal_values == ["X", "Y"]
assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}
```
</issue_to_address>
### Comment 4
<location> `tests/core/test_feature_analysis.py:74-83` </location>
<code_context>
+ features = analyze_arff(StringIO(arff_data))
+ assert features[0].number_of_missing_values == 2
+
+def test_analyze_arff_metadata():
+ arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b NUMERIC
+@ATTRIBUTE c NUMERIC
+@DATA
+1,2,3
+"""
+ features = analyze_arff(
+ StringIO(arff_data),
+ target_features=["c"],
+ ignore_features=["b"],
+ row_id_features=["a"]
+ )
+ assert features[0].is_row_identifier is True
+ assert features[1].is_ignore is True
+ assert features[2].is_target is True
+ assert features[0].is_target is False
+
+def test_analyze_parquet():
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a similar metadata test for Parquet features (target/ignore/row identifier)
For Parquet, `test_analyze_parquet` currently only validates `target_features` and doesn’t cover `ignore_features` or `row_id_features`. Adding a dedicated metadata test for `analyze_parquet` would ensure these flags behave consistently with the ARFF path.
Suggested implementation:
```python
assert features[0].is_target is False
def test_analyze_parquet_metadata(tmp_path):
table = pa.table(
{
"a": pa.array([1, 2, 3]),
"b": pa.array([10, 20, 30]),
"c": pa.array([100, 200, 300]),
}
)
parquet_path = tmp_path / "metadata_test.parquet"
pq.write_table(table, parquet_path)
features = analyze_parquet(
parquet_path,
target_features=["c"],
ignore_features=["b"],
row_id_features=["a"],
)
assert features[0].is_row_identifier is True
assert features[1].is_ignore is True
assert features[2].is_target is True
assert features[0].is_target is False
def test_analyze_parquet():
```
1. Ensure `pyarrow.parquet` is imported as `pq` at the top of this test module if it isn’t already:
```python
import pyarrow.parquet as pq
```
2. Confirm that `analyze_parquet` in your codebase accepts the `ignore_features` and `row_id_features` keyword arguments and returns feature objects with `is_ignore`, `is_row_identifier`, and `is_target` attributes, consistent with the ARFF path.
</issue_to_address>
### Comment 5
<location> `tests/core/test_feature_analysis.py:93-102` </location>
<code_context>
+def test_analyze_parquet():
</code_context>
<issue_to_address>
**suggestion (testing):** Add a Parquet test case for columns that are entirely null or have only missing values
Please add a case where one column is entirely null (e.g., `pa.array([None, None, None])`) to verify that `null_count` is correct and `nominal_values` is empty/`None` for fully-null columns.
Suggested implementation:
```python
def test_analyze_parquet():
data = [
pa.array([1, 2, None]),
pa.array(["cat", "dog", "cat"]),
pa.array([True, False, None]),
pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string())),
pa.array([None, None, None]),
]
schema = pa.schema([
("f1", pa.int64()),
("f2", pa.string()),
("f3", pa.bool_()),
("f4_all_null", pa.int64()),
```
To fully implement the requested test behavior, you should also:
1. Locate the rest of the `test_analyze_parquet` body (after the `schema = ...` block). It likely constructs a `pa.Table` (or writes Parquet) and calls your analysis function.
2. Identify the feature/column entry corresponding to `"f4_all_null"` in the returned analysis result (e.g., `features[3]` or `features["f4_all_null"]`, depending on your API).
3. Add assertions similar to:
- `assert feature_for_f4.null_count == 3`
- Ensure `nominal_values` is empty or `None`, following the conventions used in the other assertions (e.g., `assert feature_for_f4.nominal_values in (None, [])` or `assert feature_for_f4.nominal_values == []` depending on how non-null columns are validated elsewhere in the test file.
Make sure the index/name you use for the `f4_all_null` feature matches how the test currently asserts other columns.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| dictionary = chunk.dictionary | ||
| for val in dictionary: | ||
| unique_values.add(val.as_py()) | ||
| nominal_values = sorted(list(unique_values)) |
There was a problem hiding this comment.
suggestion (performance): Treating all string columns as nominal and enumerating all unique values may not scale for high-cardinality text.
This branch treats all string-typed Parquet columns as nominal and builds nominal_values by calling chunk.unique() over all chunks. For high-cardinality or free-text columns this can be very slow and memory-intensive, producing huge and not very useful nominal_values lists. Consider introducing a cardinality/uniqueness threshold or sampling limit to fall back to STRING instead of NOMINAL, and/or capping the number of unique values collected to avoid pathological cases.
Suggested implementation:
elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
# For Parquet, strings might be nominal if they don't have a dictionary.
# However, treating all string columns as nominal and enumerating every
# unique value does not scale for high-cardinality / free-text columns.
#
# Heuristic:
# - Booleans are always nominal: [False, True].
# - For strings, sample up to a maximum number of rows and track unique
# values up to a cardinality cap. If the cap is exceeded, fall back
# to treating the column as STRING by not populating nominal_values.
#
# Downstream logic should interpret `nominal_values is None` as
# "not nominal" for string-typed columns.
MAX_NOMINAL_CARDINALITY = 1000 # Maximum distinct values to treat as nominal
MAX_NOMINAL_SAMPLE_ROWS = 100000 # Maximum rows to scan per column
if pa.types.is_boolean(pa_type):
# Booleans are low-cardinality by definition.
nominal_values = [False, True]
else:
column_data = table.column(i)
unique_values = set()
rows_seen = 0
for chunk in column_data.chunks:
if rows_seen >= MAX_NOMINAL_SAMPLE_ROWS:
break
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
# Respect the sampling limit by slicing the chunk if needed
chunk_to_scan = chunk
remaining_rows = MAX_NOMINAL_SAMPLE_ROWS - rows_seen
if len(chunk_to_scan) > remaining_rows:
chunk_to_scan = chunk_to_scan.slice(0, remaining_rows)
# Convert the chunk to Python values; skip nulls
for val in chunk_to_scan.to_pylist():
if val is None:
continue
unique_values.add(val)
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
rows_seen += len(chunk_to_scan)
if len(unique_values) > MAX_NOMINAL_CARDINALITY:
break
# If we exceeded the cardinality cap (or saw no values),
# treat this as a generic string column (no nominal_values).
if 0 < len(unique_values) <= MAX_NOMINAL_CARDINALITY:
nominal_values = sorted(unique_values)
else:
nominal_values = None- Ensure that whatever structure you are building for each feature (e.g. an
attribute_type/feature_typeor similar) interpretsnominal_values is Nonefor string-typed columns as a signal to treat the column asSTRING(or equivalent non-nominal type) rather thanNOMINAL. - If you already have module-level configuration or constants for thresholds, you may want to hoist
MAX_NOMINAL_CARDINALITYandMAX_NOMINAL_SAMPLE_ROWSto that configuration instead of hard-coding them inline.
| def test_analyze_arff_dense(): | ||
| arff_data = """@RELATION test | ||
| @ATTRIBUTE a NUMERIC | ||
| @ATTRIBUTE b {x,y} | ||
| @ATTRIBUTE c STRING | ||
| @DATA | ||
| 1,x,hello | ||
| 2,y,world | ||
| ?,? ,? | ||
| """ |
There was a problem hiding this comment.
suggestion (testing): Add a test for ARFF files with a schema but no data rows
Please add coverage for an ARFF file that declares attributes but has an empty @DATA section (no rows). This will confirm analyze_arff safely handles zero-row datasets and reports number_of_missing_values == 0 for all features.
tests/core/test_feature_analysis.py
Outdated
| # index 2: X, missing(None?), ? | ||
| # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal. | ||
| assert features[2].name == "c" | ||
| assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?} |
There was a problem hiding this comment.
suggestion (testing): Also assert data type and nominal values for the sparse nominal attribute
In test_analyze_arff_sparse, feature c is declared as {X,Y}, so it should be validated as a nominal feature using the header’s allowed values. In addition to number_of_missing_values and the name, please also assert features[2].data_type == FeatureType.NOMINAL and features[2].nominal_values == ["X", "Y"] (or the expected order) to verify nominal metadata is handled correctly for sparse attributes.
| # index 2: X, missing(None?), ? | |
| # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal. | |
| assert features[2].name == "c" | |
| assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?} | |
| # index 2: X, missing(None?), ? | |
| # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal. | |
| assert features[2].name == "c" | |
| assert features[2].data_type == FeatureType.NOMINAL | |
| assert features[2].nominal_values == ["X", "Y"] | |
| assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?} |
tests/core/test_feature_analysis.py
Outdated
| def test_analyze_arff_metadata(): | ||
| arff_data = """@RELATION test | ||
| @ATTRIBUTE a NUMERIC | ||
| @ATTRIBUTE b NUMERIC | ||
| @ATTRIBUTE c NUMERIC | ||
| @DATA | ||
| 1,2,3 | ||
| """ | ||
| features = analyze_arff( | ||
| StringIO(arff_data), |
There was a problem hiding this comment.
suggestion (testing): Consider adding a similar metadata test for Parquet features (target/ignore/row identifier)
For Parquet, test_analyze_parquet currently only validates target_features and doesn’t cover ignore_features or row_id_features. Adding a dedicated metadata test for analyze_parquet would ensure these flags behave consistently with the ARFF path.
Suggested implementation:
assert features[0].is_target is False
def test_analyze_parquet_metadata(tmp_path):
table = pa.table(
{
"a": pa.array([1, 2, 3]),
"b": pa.array([10, 20, 30]),
"c": pa.array([100, 200, 300]),
}
)
parquet_path = tmp_path / "metadata_test.parquet"
pq.write_table(table, parquet_path)
features = analyze_parquet(
parquet_path,
target_features=["c"],
ignore_features=["b"],
row_id_features=["a"],
)
assert features[0].is_row_identifier is True
assert features[1].is_ignore is True
assert features[2].is_target is True
assert features[0].is_target is False
def test_analyze_parquet():- Ensure
pyarrow.parquetis imported aspqat the top of this test module if it isn’t already:
import pyarrow.parquet as pq- Confirm that
analyze_parquetin your codebase accepts theignore_featuresandrow_id_featureskeyword arguments and returns feature objects withis_ignore,is_row_identifier, andis_targetattributes, consistent with the ARFF path.
tests/core/test_feature_analysis.py
Outdated
| def test_analyze_parquet(): | ||
| data = [ | ||
| pa.array([1, 2, None]), | ||
| pa.array(["cat", "dog", "cat"]), | ||
| pa.array([True, False, None]), | ||
| pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string())) | ||
| ] | ||
| schema = pa.schema([ | ||
| ("f1", pa.int64()), | ||
| ("f2", pa.string()), |
There was a problem hiding this comment.
suggestion (testing): Add a Parquet test case for columns that are entirely null or have only missing values
Please add a case where one column is entirely null (e.g., pa.array([None, None, None])) to verify that null_count is correct and nominal_values is empty/None for fully-null columns.
Suggested implementation:
def test_analyze_parquet():
data = [
pa.array([1, 2, None]),
pa.array(["cat", "dog", "cat"]),
pa.array([True, False, None]),
pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string())),
pa.array([None, None, None]),
]
schema = pa.schema([
("f1", pa.int64()),
("f2", pa.string()),
("f3", pa.bool_()),
("f4_all_null", pa.int64()),To fully implement the requested test behavior, you should also:
- Locate the rest of the
test_analyze_parquetbody (after theschema = ...block). It likely constructs apa.Table(or writes Parquet) and calls your analysis function. - Identify the feature/column entry corresponding to
"f4_all_null"in the returned analysis result (e.g.,features[3]orfeatures["f4_all_null"], depending on your API). - Add assertions similar to:
assert feature_for_f4.null_count == 3- Ensure
nominal_valuesis empty orNone, following the conventions used in the other assertions (e.g.,assert feature_for_f4.nominal_values in (None, [])orassert feature_for_f4.nominal_values == []depending on how non-null columns are validated elsewhere in the test file.
Make sure the index/name you use for thef4_all_nullfeature matches how the test currently asserts other columns.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (6)
tests/core/test_feature_analysis.py (1)
97-138: Good coverage of Parquet types, but missing edge cases.The Parquet test covers int64, string, bool, and dictionary-encoded columns. Consider adding test cases for:
- Empty tables (0 rows) to verify both analyzers handle edge cases
- Decimal columns (which currently fall through to STRING — see review on
feature_analysis.py)- Parquet temporal types (timestamp, date)
These would help catch the type-inference gaps flagged in the main module.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_feature_analysis.py` around lines 97 - 138, Update the test_analyze_parquet test to add edge-case Parquet tables: include an empty table (0 rows) to assert analyzers return features with correct names, types and zero counts; add a decimal column (pa.decimal128 or pa.decimal256) and assert analyze_parquet returns a NUMERIC or DECIMAL-expected FeatureType (and adjust expectations if feature_analysis.py currently maps decimals to STRING); and add temporal columns (pa.timestamp, pa.date32/64) and assert they are recognized as TEMPORAL FeatureType with appropriate parsing of values; reference analyze_parquet in your assertions so the new checks will detect the type-inference gaps mentioned in feature_analysis.py.src/core/feature_analysis.py (4)
88-88: Entire Parquet file is read into memory — consider schema-only reads for metadata and streaming for null counts.
pq.read_table(source)loads the entire file into memory. For large datasets, this can be expensive. If only schema + null counts are needed (for non-string/non-dictionary columns),pq.ParquetFile(source).schema_arrowcan retrieve the schema without reading data, andpq.read_table(source, columns=[col])can be used selectively. This is a future optimization worth noting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/feature_analysis.py` at line 88, The current use of pq.read_table(source) (assignment to variable table) reads the entire Parquet into memory; change this to a schema-only read via pq.ParquetFile(source).schema_arrow to obtain schema/metadata, and then perform per-column reads for any actual data needs (e.g., compute null counts by calling pq.read_table(source, columns=[col]) or streaming column reads) instead of loading everything at once; update the logic around table and any downstream code that expects the full table to handle metadata-first + selective column reads accordingly.
22-24: Parameter shadowing changes the type fromIterable[str] | Nonetoset.Both functions reassign the parameter names (
target_features,ignore_features,row_id_features) fromIterable[str] | Nonetoset[str]. This works at runtime but will confuse type checkers (and the project hasmypy strictenabled perpyproject.toml). Use distinct local variable names.♻️ Proposed fix
- target_features = set(target_features or []) - ignore_features = set(ignore_features or []) - row_id_features = set(row_id_features or []) + target_set = set(target_features or []) + ignore_set = set(ignore_features or []) + row_id_set = set(row_id_features or [])Then update usages to reference
target_set,ignore_set,row_id_set.Also applies to: 91-93
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/feature_analysis.py` around lines 22 - 24, Function parameters target_features, ignore_features, and row_id_features are being reassigned to sets (changing their type from Iterable[str] | None to set) which breaks strict mypy checks; instead create new local variables (e.g., target_set, ignore_set, row_id_set) initialized as set(target_features or []), set(ignore_features or []), set(row_id_features or []), and update all subsequent usages in the functions (including the other occurrence around the second block where the same reassignment appears) to reference these new local set variables rather than mutating the parameter names.
115-115:sorted(list(...))—list()is redundant.
sorted()already returns alist, so wrapping inlist()first is unnecessary on both lines 115 and 143.♻️ Fix
- nominal_values = sorted(list(unique_values)) + nominal_values = sorted(unique_values)Also applies to: 143-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/feature_analysis.py` at line 115, Replace the redundant list() wrapping around sorted(...) — e.g., change nominal_values = sorted(list(unique_values)) to nominal_values = sorted(unique_values) and do the same for the other occurrence that uses sorted(list(...)) (the same pattern elsewhere in this file). This removes the unnecessary list() call while preserving behavior.
42-64: O(attributes × rows) complexity for missing value counting — refactor to single pass.The missing value count loops over all rows for every attribute, making this O(attributes × rows). For a dataset with many attributes and rows, this is unnecessarily slow. A single pass over the data can compute missing counts for all attributes at once.
♻️ Proposed refactor: single-pass missing value counting
Compute the missing counts once before the per-attribute loop:
+ # Pre-compute missing counts in a single pass + missing_counts = [0] * len(attributes) + for row in data: + if isinstance(row, dict): + for idx in range(len(attributes)): + if idx in row and row[idx] is None: + missing_counts[idx] += 1 + else: + for idx, val in enumerate(row): + if val is None: + missing_counts[idx] += 1 + features = [] for i, (name, type_info) in enumerate(attributes): ... - # Count missing values - missing_count = 0 - if data: - for row in data: - if isinstance(row, dict): - if i not in row: - pass - elif row[i] is None: - missing_count += 1 - elif row[i] is None: - missing_count += 1 + missing_count = missing_counts[i]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/feature_analysis.py` around lines 42 - 64, The missing-value logic currently recomputes missing_count for each attribute (using variables/data structures like missing_count, data, row, i), producing O(attributes × rows) complexity; refactor by doing a single pass over data to build a missing_counts array (length = number of attributes) once: initialize missing_counts to zeros, iterate each row once, handle both dict (sparse) and list rows by only incrementing missing_counts[idx] when a value is explicitly None (and treating absent keys in sparse dict as default 0, not missing), then replace per-attribute missing_count uses with missing_counts[i] in the later attribute loop.pyproject.toml (1)
23-24: Address unmaintained dependency and potential pyarrow type handling gap.
liac-arffhas not been maintained upstream since August 31, 2020 (no PyPI releases since then). While actively used in this codebase, consider evaluating if it remains the best option for ARFF parsing, or plan to migrate if the library becomes problematic.
pyarrowadds string type flexibility in newer versions. The current code uses onlypa.types.is_string()for type detection, which specifically matches the 32-bit UTF-8 string type. Newer PyArrow versions also supportlarge_string(64-bit offsets) andstring_view(variable-length view), which the current predicates would not catch. If Parquet files use these types, the type detection logic may misclassify them. Consider expanding type checks to handle these variants, or document the assumption that onlypa.string()is used.Both dependencies follow the project's pattern of unpinned versions, so version constraints are not strictly necessary, but the above functional concerns are worth addressing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 23 - 24, Update handling for the two dependency concerns: for the unmaintained liac-arff dependency, add a TODO in the project metadata and evaluate replacing it (e.g., migrate ARFF parsing to an actively maintained library or an internal parser) so there's a planned migration path; for pyarrow type detection, update the predicate that currently calls pa.types.is_string() to also accept pa.types.is_large_string() and pa.types.is_string_view() (or use a combined helper like is_arrow_text_type) so Parquet/Arrow columns using 64-bit or view-backed string types are correctly recognized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/core/feature_analysis.py`:
- Around line 108-115: The code builds nominal_values by iterating
chunk.dictionary which can include unreferenced entries; update the logic in the
feature_analysis code that handles column_data/chunk (the block that fills
unique_values and sets nominal_values) to derive values from the actual data —
either call chunk.unique() and use those results, or iterate the chunk's indices
to filter dictionary entries by referenced indices — then sort that result into
nominal_values so it only contains values present in the chunk.
- Around line 130-143: The current loop in feature_analysis.py that builds
unique_values for string columns (using column_data.chunks, unique_values,
FeatureType.NOMINAL and nominal_values) can blow up memory on high-cardinality
data; modify it to enforce a cardinality threshold (e.g., MAX_NOMINAL=256):
while iterating chunks/values stop adding once the set size exceeds the
threshold, immediately classify the column as FeatureType.STRING and set
nominal_values = None (or an empty/marker value) instead of continuing to
collect uniques; ensure the early-exit prevents scanning remaining chunks to
avoid the unbounded memory/CPU use.
- Around line 100-146: The current feature type detection in the feature
analysis block misclassifies decimal and temporal Parquet types as STRING;
update the logic in the same branch that checks pa_type (the block that
currently tests
pa.types.is_floating/is_integer/is_dictionary/is_string/is_boolean) to
explicitly detect pa.decimal128/pa.decimal256 (use pa.types.is_decimal128 and
pa.types.is_decimal256 or pa.types.is_decimal) and mark them as
FeatureType.NUMERIC (with appropriate conversion/scale handling for numeric
processing), and add explicit checks for temporal types (pa.types.is_date,
pa.types.is_time, pa.types.is_timestamp) and apply the chosen policy (e.g.,
treat timestamps as FeatureType.NUMERIC or FeatureType.STRING consistently),
ensuring nominal_values is set or left None as appropriate; adjust any
downstream code that expects numeric/date handling accordingly (refer to the
variables pa_type, data_type, nominal_values, and the table.column(i) logic to
extract values if needed).
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 23-24: Update handling for the two dependency concerns: for the
unmaintained liac-arff dependency, add a TODO in the project metadata and
evaluate replacing it (e.g., migrate ARFF parsing to an actively maintained
library or an internal parser) so there's a planned migration path; for pyarrow
type detection, update the predicate that currently calls pa.types.is_string()
to also accept pa.types.is_large_string() and pa.types.is_string_view() (or use
a combined helper like is_arrow_text_type) so Parquet/Arrow columns using 64-bit
or view-backed string types are correctly recognized.
In `@src/core/feature_analysis.py`:
- Line 88: The current use of pq.read_table(source) (assignment to variable
table) reads the entire Parquet into memory; change this to a schema-only read
via pq.ParquetFile(source).schema_arrow to obtain schema/metadata, and then
perform per-column reads for any actual data needs (e.g., compute null counts by
calling pq.read_table(source, columns=[col]) or streaming column reads) instead
of loading everything at once; update the logic around table and any downstream
code that expects the full table to handle metadata-first + selective column
reads accordingly.
- Around line 22-24: Function parameters target_features, ignore_features, and
row_id_features are being reassigned to sets (changing their type from
Iterable[str] | None to set) which breaks strict mypy checks; instead create new
local variables (e.g., target_set, ignore_set, row_id_set) initialized as
set(target_features or []), set(ignore_features or []), set(row_id_features or
[]), and update all subsequent usages in the functions (including the other
occurrence around the second block where the same reassignment appears) to
reference these new local set variables rather than mutating the parameter
names.
- Line 115: Replace the redundant list() wrapping around sorted(...) — e.g.,
change nominal_values = sorted(list(unique_values)) to nominal_values =
sorted(unique_values) and do the same for the other occurrence that uses
sorted(list(...)) (the same pattern elsewhere in this file). This removes the
unnecessary list() call while preserving behavior.
- Around line 42-64: The missing-value logic currently recomputes missing_count
for each attribute (using variables/data structures like missing_count, data,
row, i), producing O(attributes × rows) complexity; refactor by doing a single
pass over data to build a missing_counts array (length = number of attributes)
once: initialize missing_counts to zeros, iterate each row once, handle both
dict (sparse) and list rows by only incrementing missing_counts[idx] when a
value is explicitly None (and treating absent keys in sparse dict as default 0,
not missing), then replace per-attribute missing_count uses with
missing_counts[i] in the later attribute loop.
In `@tests/core/test_feature_analysis.py`:
- Around line 97-138: Update the test_analyze_parquet test to add edge-case
Parquet tables: include an empty table (0 rows) to assert analyzers return
features with correct names, types and zero counts; add a decimal column
(pa.decimal128 or pa.decimal256) and assert analyze_parquet returns a NUMERIC or
DECIMAL-expected FeatureType (and adjust expectations if feature_analysis.py
currently maps decimals to STRING); and add temporal columns (pa.timestamp,
pa.date32/64) and assert they are recognized as TEMPORAL FeatureType with
appropriate parsing of values; reference analyze_parquet in your assertions so
the new checks will detect the type-inference gaps mentioned in
feature_analysis.py.
| # Determine data_type and nominal_values | ||
| nominal_values = None | ||
| if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type): | ||
| data_type = FeatureType.NUMERIC | ||
| elif pa.types.is_dictionary(pa_type): | ||
| data_type = FeatureType.NOMINAL | ||
| # Extract nominal values from dictionary | ||
| # We need to look at the data to get the dictionary values | ||
| column_data = table.column(i) | ||
| # A column can have multiple chunks | ||
| unique_values = set() | ||
| for chunk in column_data.chunks: | ||
| dictionary = chunk.dictionary | ||
| for val in dictionary: | ||
| unique_values.add(val.as_py()) | ||
| nominal_values = sorted(list(unique_values)) | ||
| elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type): | ||
| # For Parquet, strings might be nominal if they don't have a dictionary | ||
| # We needed to "Extract unique values from the data" for nominals in non-ARFF | ||
| # In OpenML, if it's used for classification, it's nominal. | ||
| # If we don't know, we might have to guess or treat all strings as nominal if they have | ||
| # few unique values. | ||
|
|
||
| # For Parquet, let's assume if it's not numeric, it's nominal for now, | ||
| # as that's common in ML datasets, unless it's explicitly string. | ||
|
|
||
| # If it's boolean, it's definitely nominal [False, True]. | ||
| if pa.types.is_boolean(pa_type): | ||
| data_type = FeatureType.NOMINAL | ||
| nominal_values = ["false", "true"] | ||
| else: | ||
| # For string, let's extract unique values and see. | ||
| column_data = table.column(i) | ||
| unique_values = set() | ||
| # For efficiency, we might not want to scan everything if it's huge | ||
| for chunk in column_data.chunks: | ||
| for val in chunk.unique(): | ||
| v = val.as_py() | ||
| if v is not None: | ||
| unique_values.add(str(v)) | ||
|
|
||
| # OpenML usually has a threshold, but let's just call it nominal if it's string | ||
| data_type = FeatureType.NOMINAL | ||
| nominal_values = sorted(list(unique_values)) | ||
| else: | ||
| data_type = FeatureType.STRING | ||
| nominal_values = None |
There was a problem hiding this comment.
Missing handling for pa.decimal128/pa.decimal256 and date/time types — they'll be classified as STRING.
Parquet files commonly contain decimal and temporal types. Currently these fall through to the else branch (line 144-146) and get classified as FeatureType.STRING with no nominal values. Decimal should be NUMERIC, and date/time types need a deliberate policy.
🔧 Proposed fix for decimal types
- if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type):
+ if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type) or pa.types.is_decimal(pa_type):
data_type = FeatureType.NUMERICFor temporal types (pa.types.is_date, pa.types.is_time, pa.types.is_timestamp), decide on a classification policy (NUMERIC timestamp or STRING) and handle explicitly rather than silently falling through.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/feature_analysis.py` around lines 100 - 146, The current feature
type detection in the feature analysis block misclassifies decimal and temporal
Parquet types as STRING; update the logic in the same branch that checks pa_type
(the block that currently tests
pa.types.is_floating/is_integer/is_dictionary/is_string/is_boolean) to
explicitly detect pa.decimal128/pa.decimal256 (use pa.types.is_decimal128 and
pa.types.is_decimal256 or pa.types.is_decimal) and mark them as
FeatureType.NUMERIC (with appropriate conversion/scale handling for numeric
processing), and add explicit checks for temporal types (pa.types.is_date,
pa.types.is_time, pa.types.is_timestamp) and apply the chosen policy (e.g.,
treat timestamps as FeatureType.NUMERIC or FeatureType.STRING consistently),
ensuring nominal_values is set or left None as appropriate; adjust any
downstream code that expects numeric/date handling accordingly (refer to the
variables pa_type, data_type, nominal_values, and the table.column(i) logic to
extract values if needed).
| column_data = table.column(i) | ||
| # A column can have multiple chunks | ||
| unique_values = set() | ||
| for chunk in column_data.chunks: | ||
| dictionary = chunk.dictionary | ||
| for val in dictionary: | ||
| unique_values.add(val.as_py()) | ||
| nominal_values = sorted(list(unique_values)) |
There was a problem hiding this comment.
Dictionary entries may include values not present in the data.
chunk.dictionary contains all entries stored in the dictionary, including those that may no longer be referenced by any index in the chunk (e.g., after filtering). This means nominal_values could list values that don't actually appear in the column data. If accurate nominal values are required, use chunk.unique() instead of chunk.dictionary, or filter dictionary entries by referenced indices.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/feature_analysis.py` around lines 108 - 115, The code builds
nominal_values by iterating chunk.dictionary which can include unreferenced
entries; update the logic in the feature_analysis code that handles
column_data/chunk (the block that fills unique_values and sets nominal_values)
to derive values from the actual data — either call chunk.unique() and use those
results, or iterate the chunk's indices to filter dictionary entries by
referenced indices — then sort that result into nominal_values so it only
contains values present in the chunk.
| else: | ||
| # For string, let's extract unique values and see. | ||
| column_data = table.column(i) | ||
| unique_values = set() | ||
| # For efficiency, we might not want to scan everything if it's huge | ||
| for chunk in column_data.chunks: | ||
| for val in chunk.unique(): | ||
| v = val.as_py() | ||
| if v is not None: | ||
| unique_values.add(str(v)) | ||
|
|
||
| # OpenML usually has a threshold, but let's just call it nominal if it's string | ||
| data_type = FeatureType.NOMINAL | ||
| nominal_values = sorted(list(unique_values)) |
There was a problem hiding this comment.
Unbounded unique-value scan on string columns — potential performance/memory issue on high-cardinality data.
For string columns, every value across all chunks is scanned into a set. If a column has millions of distinct values (e.g., free-text, UUIDs), this will consume excessive memory and produce an impractically large nominal_values list. The inline comment on line 134 acknowledges this but doesn't mitigate it.
Consider adding a cardinality threshold: if unique values exceed a limit (e.g., 256), classify as FeatureType.STRING instead of NOMINAL and set nominal_values = None.
🔧 Sketch of a threshold-based approach
for chunk in column_data.chunks:
for val in chunk.unique():
v = val.as_py()
if v is not None:
unique_values.add(str(v))
+ if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+ break
+ if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+ break
- data_type = FeatureType.NOMINAL
- nominal_values = sorted(unique_values)
+ if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+ data_type = FeatureType.STRING
+ nominal_values = None
+ else:
+ data_type = FeatureType.NOMINAL
+ nominal_values = sorted(unique_values)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/feature_analysis.py` around lines 130 - 143, The current loop in
feature_analysis.py that builds unique_values for string columns (using
column_data.chunks, unique_values, FeatureType.NOMINAL and nominal_values) can
blow up memory on high-cardinality data; modify it to enforce a cardinality
threshold (e.g., MAX_NOMINAL=256): while iterating chunks/values stop adding
once the set size exceeds the threshold, immediately classify the column as
FeatureType.STRING and set nominal_values = None (or an empty/marker value)
instead of continuing to collect uniques; ensure the early-exit prevents
scanning remaining chunks to avoid the unbounded memory/CPU use.
This PR implements a new feature analyzer in Python to calculate basic feature information for OpenML datasets. It supports both ARFF (dense and sparse) and Parquet formats.
Resolves #183
Key changes:
pyproject.toml: Addedliac-arffandpyarrowto project dependencies.src/core/feature_analysis.py: New module containing the analysis logic.analyze_arff: Usesliac-arffto parse ARFF files and extract features.analyze_parquet: Usespyarrowto analyze Parquet files, including dictionary-encoded columns and automated unique value extraction for nominal features.tests/core/test_feature_analysis.py: New unit tests for both formats, covering various edge cases like sparse data and missing values.The analyzer returns a list of
Featureschema objects, matching the expected output of the/data/features/{id}endpoint.