Skip to content

Comments

Build feature analyzer in Python#248

Open
Nidhicodes wants to merge 2 commits intoopenml:mainfrom
Nidhicodes:feature-analyzer
Open

Build feature analyzer in Python#248
Nidhicodes wants to merge 2 commits intoopenml:mainfrom
Nidhicodes:feature-analyzer

Conversation

@Nidhicodes
Copy link

This PR implements a new feature analyzer in Python to calculate basic feature information for OpenML datasets. It supports both ARFF (dense and sparse) and Parquet formats.

Resolves #183

Key changes:

  • pyproject.toml: Added liac-arff and pyarrow to project dependencies.
  • src/core/feature_analysis.py: New module containing the analysis logic.
    • analyze_arff: Uses liac-arff to parse ARFF files and extract features.
    • analyze_parquet: Uses pyarrow to analyze Parquet files, including dictionary-encoded columns and automated unique value extraction for nominal features.
  • tests/core/test_feature_analysis.py: New unit tests for both formats, covering various edge cases like sparse data and missing values.

The analyzer returns a list of Feature schema objects, matching the expected output of the /data/features/{id} endpoint.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Walkthrough

This pull request adds feature analysis capabilities to the codebase. Two runtime dependencies (liac-arff and pyarrow) are added to pyproject.toml. The src/core/feature_analysis.py module introduces analyze_arff() and analyze_parquet() functions that extract metadata from ARFF and Parquet files respectively. Both functions parse file structures, infer feature data types (NUMERIC, NOMINAL, STRING), determine nominal value sets where applicable, count missing values per feature, and generate Feature objects with metadata. The implementation includes edge case handling for sparse ARFF representations and dictionary-encoded Parquet columns. Comprehensive unit tests in tests/core/test_feature_analysis.py validate both normal and sparse data paths, metadata tagging, and missing value handling.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Build feature analyzer in Python' directly and concisely summarizes the main change—implementing a new feature analyzer module in Python.
Description check ✅ Passed The description clearly explains the feature analyzer implementation, lists key changes to pyproject.toml, feature_analysis.py, and tests, and references the resolved issue.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #183: feature analyzer with ARFF/Parquet support, Feature schema objects with required fields (index, name, data_type, flags, missing values, nominal values), and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the stated objective: dependency additions, feature analysis implementation, and test coverage—no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • The Parquet analyzer currently reads the entire table into memory and computes full unique sets for string/dictionary columns, which may not scale for large datasets; consider using read_table with column projections and row group/row limits, or sampling rows when inferring nominal values.
  • Treating all non-numeric Parquet string columns as nominal and fully materializing their unique values can be very expensive and may misclassify high‑cardinality text fields; you might want a heuristic (e.g., max distinct threshold) to fall back to STRING without collecting all unique values.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The Parquet analyzer currently reads the entire table into memory and computes full unique sets for string/dictionary columns, which may not scale for large datasets; consider using `read_table` with column projections and row group/row limits, or sampling rows when inferring nominal values.
- Treating all non-numeric Parquet string columns as nominal and fully materializing their unique values can be very expensive and may misclassify high‑cardinality text fields; you might want a heuristic (e.g., max distinct threshold) to fall back to `STRING` without collecting all unique values.

## Individual Comments

### Comment 1
<location> `src/core/feature_analysis.py:111-110` </location>
<code_context>
+        elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
</code_context>

<issue_to_address>
**suggestion (performance):** Treating all string columns as nominal and enumerating all unique values may not scale for high-cardinality text.

This branch treats all string-typed Parquet columns as nominal and builds `nominal_values` by calling `chunk.unique()` over all chunks. For high-cardinality or free-text columns this can be very slow and memory-intensive, producing huge and not very useful `nominal_values` lists. Consider introducing a cardinality/uniqueness threshold or sampling limit to fall back to `STRING` instead of `NOMINAL`, and/or capping the number of unique values collected to avoid pathological cases.

Suggested implementation:

```python
        elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
            # For Parquet, strings might be nominal if they don't have a dictionary.
            # However, treating all string columns as nominal and enumerating every
            # unique value does not scale for high-cardinality / free-text columns.
            #
            # Heuristic:
            # - Booleans are always nominal: [False, True].
            # - For strings, sample up to a maximum number of rows and track unique
            #   values up to a cardinality cap. If the cap is exceeded, fall back
            #   to treating the column as STRING by not populating nominal_values.
            #
            # Downstream logic should interpret `nominal_values is None` as
            # "not nominal" for string-typed columns.

            MAX_NOMINAL_CARDINALITY = 1000   # Maximum distinct values to treat as nominal
            MAX_NOMINAL_SAMPLE_ROWS = 100000  # Maximum rows to scan per column

            if pa.types.is_boolean(pa_type):
                # Booleans are low-cardinality by definition.
                nominal_values = [False, True]
            else:
                column_data = table.column(i)
                unique_values = set()
                rows_seen = 0

                for chunk in column_data.chunks:
                    if rows_seen >= MAX_NOMINAL_SAMPLE_ROWS:
                        break
                    if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                        break

                    # Respect the sampling limit by slicing the chunk if needed
                    chunk_to_scan = chunk
                    remaining_rows = MAX_NOMINAL_SAMPLE_ROWS - rows_seen
                    if len(chunk_to_scan) > remaining_rows:
                        chunk_to_scan = chunk_to_scan.slice(0, remaining_rows)

                    # Convert the chunk to Python values; skip nulls
                    for val in chunk_to_scan.to_pylist():
                        if val is None:
                            continue
                        unique_values.add(val)
                        if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                            break

                    rows_seen += len(chunk_to_scan)

                    if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                        break

                # If we exceeded the cardinality cap (or saw no values),
                # treat this as a generic string column (no nominal_values).
                if 0 < len(unique_values) <= MAX_NOMINAL_CARDINALITY:
                    nominal_values = sorted(unique_values)
                else:
                    nominal_values = None

```

1. Ensure that whatever structure you are building for each feature (e.g. an `attribute_type`/`feature_type` or similar) interprets `nominal_values is None` for string-typed columns as a signal to treat the column as `STRING` (or equivalent non-nominal type) rather than `NOMINAL`.
2. If you already have module-level configuration or constants for thresholds, you may want to hoist `MAX_NOMINAL_CARDINALITY` and `MAX_NOMINAL_SAMPLE_ROWS` to that configuration instead of hard-coding them inline.
</issue_to_address>

### Comment 2
<location> `tests/core/test_feature_analysis.py:7-16` </location>
<code_context>
+from core.feature_analysis import analyze_arff, analyze_parquet
+from schemas.datasets.openml import FeatureType
+
+def test_analyze_arff_dense():
+    arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b {x,y}
+@ATTRIBUTE c STRING
+@DATA
+1,x,hello
+2,y,world
+?,? ,?
+"""
+    features = analyze_arff(StringIO(arff_data))
+    assert len(features) == 3
+    
+    assert features[0].name == "a"
+    assert features[0].data_type == FeatureType.NUMERIC
+    assert features[0].number_of_missing_values == 1
+    assert features[0].nominal_values is None
+    
+    assert features[1].name == "b"
+    assert features[1].data_type == FeatureType.NOMINAL
+    assert features[1].nominal_values == ["x", "y"]
+    assert features[1].number_of_missing_values == 1
+    
+    assert features[2].name == "c"
+    assert features[2].data_type == FeatureType.STRING
+    assert features[2].number_of_missing_values == 1
+
+def test_analyze_arff_sparse():
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test for ARFF files with a schema but no data rows

Please add coverage for an ARFF file that declares attributes but has an empty `@DATA` section (no rows). This will confirm `analyze_arff` safely handles zero-row datasets and reports `number_of_missing_values == 0` for all features.
</issue_to_address>

### Comment 3
<location> `tests/core/test_feature_analysis.py:55-58` </location>
<code_context>
+    assert features[2].data_type == FeatureType.STRING
+    assert features[2].number_of_missing_values == 1
+
+def test_analyze_arff_sparse():
+    arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b NUMERIC
+@ATTRIBUTE c {X,Y}
+@DATA
+{0 1, 2 X}
+{1 5}
+{0 ?, 2 ?}
+"""
+    features = analyze_arff(StringIO(arff_data))
+    assert len(features) == 3
+    
+    # index 0: 1, missing(0), ? -> 1 missing
+    assert features[0].name == "a"
+    assert features[0].number_of_missing_values == 1
+    
+    # index 1: missing(0), 5, missing(0) -> 0 missing
+    assert features[1].name == "b"
+    assert features[1].number_of_missing_values == 0
+    
+    # index 2: X, missing(None?), ?
+    # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
+    assert features[2].name == "c"
+    assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}
+
+def test_analyze_arff_sparse_all_missing():
</code_context>

<issue_to_address>
**suggestion (testing):** Also assert data type and nominal values for the sparse nominal attribute

In `test_analyze_arff_sparse`, feature `c` is declared as `{X,Y}`, so it should be validated as a nominal feature using the header’s allowed values. In addition to `number_of_missing_values` and the name, please also assert `features[2].data_type == FeatureType.NOMINAL` and `features[2].nominal_values == ["X", "Y"]` (or the expected order) to verify nominal metadata is handled correctly for sparse attributes.

```suggestion
    # index 2: X, missing(None?), ?
    # Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
    assert features[2].name == "c"
    assert features[2].data_type == FeatureType.NOMINAL
    assert features[2].nominal_values == ["X", "Y"]
    assert features[2].number_of_missing_values == 1  # Only from row 2 {0 ?, 2 ?}
```
</issue_to_address>

### Comment 4
<location> `tests/core/test_feature_analysis.py:74-83` </location>
<code_context>
+    features = analyze_arff(StringIO(arff_data))
+    assert features[0].number_of_missing_values == 2
+
+def test_analyze_arff_metadata():
+    arff_data = """@RELATION test
+@ATTRIBUTE a NUMERIC
+@ATTRIBUTE b NUMERIC
+@ATTRIBUTE c NUMERIC
+@DATA
+1,2,3
+"""
+    features = analyze_arff(
+        StringIO(arff_data),
+        target_features=["c"],
+        ignore_features=["b"],
+        row_id_features=["a"]
+    )
+    assert features[0].is_row_identifier is True
+    assert features[1].is_ignore is True
+    assert features[2].is_target is True
+    assert features[0].is_target is False
+
+def test_analyze_parquet():
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a similar metadata test for Parquet features (target/ignore/row identifier)

For Parquet, `test_analyze_parquet` currently only validates `target_features` and doesn’t cover `ignore_features` or `row_id_features`. Adding a dedicated metadata test for `analyze_parquet` would ensure these flags behave consistently with the ARFF path.

Suggested implementation:

```python
    assert features[0].is_target is False

def test_analyze_parquet_metadata(tmp_path):
    table = pa.table(
        {
            "a": pa.array([1, 2, 3]),
            "b": pa.array([10, 20, 30]),
            "c": pa.array([100, 200, 300]),
        }
    )
    parquet_path = tmp_path / "metadata_test.parquet"
    pq.write_table(table, parquet_path)

    features = analyze_parquet(
        parquet_path,
        target_features=["c"],
        ignore_features=["b"],
        row_id_features=["a"],
    )

    assert features[0].is_row_identifier is True
    assert features[1].is_ignore is True
    assert features[2].is_target is True
    assert features[0].is_target is False

def test_analyze_parquet():

```

1. Ensure `pyarrow.parquet` is imported as `pq` at the top of this test module if it isn’t already:

```python
import pyarrow.parquet as pq
```

2. Confirm that `analyze_parquet` in your codebase accepts the `ignore_features` and `row_id_features` keyword arguments and returns feature objects with `is_ignore`, `is_row_identifier`, and `is_target` attributes, consistent with the ARFF path.
</issue_to_address>

### Comment 5
<location> `tests/core/test_feature_analysis.py:93-102` </location>
<code_context>
+def test_analyze_parquet():
</code_context>

<issue_to_address>
**suggestion (testing):** Add a Parquet test case for columns that are entirely null or have only missing values

Please add a case where one column is entirely null (e.g., `pa.array([None, None, None])`) to verify that `null_count` is correct and `nominal_values` is empty/`None` for fully-null columns.

Suggested implementation:

```python
def test_analyze_parquet():
    data = [
        pa.array([1, 2, None]),
        pa.array(["cat", "dog", "cat"]),
        pa.array([True, False, None]),
        pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string())),
        pa.array([None, None, None]),
    ]
    schema = pa.schema([
        ("f1", pa.int64()),
        ("f2", pa.string()),
        ("f3", pa.bool_()),
        ("f4_all_null", pa.int64()),

```

To fully implement the requested test behavior, you should also:
1. Locate the rest of the `test_analyze_parquet` body (after the `schema = ...` block). It likely constructs a `pa.Table` (or writes Parquet) and calls your analysis function.
2. Identify the feature/column entry corresponding to `"f4_all_null"` in the returned analysis result (e.g., `features[3]` or `features["f4_all_null"]`, depending on your API).
3. Add assertions similar to:
   - `assert feature_for_f4.null_count == 3`
   - Ensure `nominal_values` is empty or `None`, following the conventions used in the other assertions (e.g., `assert feature_for_f4.nominal_values in (None, [])` or `assert feature_for_f4.nominal_values == []` depending on how non-null columns are validated elsewhere in the test file.
Make sure the index/name you use for the `f4_all_null` feature matches how the test currently asserts other columns.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

dictionary = chunk.dictionary
for val in dictionary:
unique_values.add(val.as_py())
nominal_values = sorted(list(unique_values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Treating all string columns as nominal and enumerating all unique values may not scale for high-cardinality text.

This branch treats all string-typed Parquet columns as nominal and builds nominal_values by calling chunk.unique() over all chunks. For high-cardinality or free-text columns this can be very slow and memory-intensive, producing huge and not very useful nominal_values lists. Consider introducing a cardinality/uniqueness threshold or sampling limit to fall back to STRING instead of NOMINAL, and/or capping the number of unique values collected to avoid pathological cases.

Suggested implementation:

        elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
            # For Parquet, strings might be nominal if they don't have a dictionary.
            # However, treating all string columns as nominal and enumerating every
            # unique value does not scale for high-cardinality / free-text columns.
            #
            # Heuristic:
            # - Booleans are always nominal: [False, True].
            # - For strings, sample up to a maximum number of rows and track unique
            #   values up to a cardinality cap. If the cap is exceeded, fall back
            #   to treating the column as STRING by not populating nominal_values.
            #
            # Downstream logic should interpret `nominal_values is None` as
            # "not nominal" for string-typed columns.

            MAX_NOMINAL_CARDINALITY = 1000   # Maximum distinct values to treat as nominal
            MAX_NOMINAL_SAMPLE_ROWS = 100000  # Maximum rows to scan per column

            if pa.types.is_boolean(pa_type):
                # Booleans are low-cardinality by definition.
                nominal_values = [False, True]
            else:
                column_data = table.column(i)
                unique_values = set()
                rows_seen = 0

                for chunk in column_data.chunks:
                    if rows_seen >= MAX_NOMINAL_SAMPLE_ROWS:
                        break
                    if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                        break

                    # Respect the sampling limit by slicing the chunk if needed
                    chunk_to_scan = chunk
                    remaining_rows = MAX_NOMINAL_SAMPLE_ROWS - rows_seen
                    if len(chunk_to_scan) > remaining_rows:
                        chunk_to_scan = chunk_to_scan.slice(0, remaining_rows)

                    # Convert the chunk to Python values; skip nulls
                    for val in chunk_to_scan.to_pylist():
                        if val is None:
                            continue
                        unique_values.add(val)
                        if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                            break

                    rows_seen += len(chunk_to_scan)

                    if len(unique_values) > MAX_NOMINAL_CARDINALITY:
                        break

                # If we exceeded the cardinality cap (or saw no values),
                # treat this as a generic string column (no nominal_values).
                if 0 < len(unique_values) <= MAX_NOMINAL_CARDINALITY:
                    nominal_values = sorted(unique_values)
                else:
                    nominal_values = None
  1. Ensure that whatever structure you are building for each feature (e.g. an attribute_type/feature_type or similar) interprets nominal_values is None for string-typed columns as a signal to treat the column as STRING (or equivalent non-nominal type) rather than NOMINAL.
  2. If you already have module-level configuration or constants for thresholds, you may want to hoist MAX_NOMINAL_CARDINALITY and MAX_NOMINAL_SAMPLE_ROWS to that configuration instead of hard-coding them inline.

Comment on lines +7 to +16
def test_analyze_arff_dense():
arff_data = """@RELATION test
@ATTRIBUTE a NUMERIC
@ATTRIBUTE b {x,y}
@ATTRIBUTE c STRING
@DATA
1,x,hello
2,y,world
?,? ,?
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test for ARFF files with a schema but no data rows

Please add coverage for an ARFF file that declares attributes but has an empty @DATA section (no rows). This will confirm analyze_arff safely handles zero-row datasets and reports number_of_missing_values == 0 for all features.

Comment on lines 55 to 58
# index 2: X, missing(None?), ?
# Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
assert features[2].name == "c"
assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Also assert data type and nominal values for the sparse nominal attribute

In test_analyze_arff_sparse, feature c is declared as {X,Y}, so it should be validated as a nominal feature using the header’s allowed values. In addition to number_of_missing_values and the name, please also assert features[2].data_type == FeatureType.NOMINAL and features[2].nominal_values == ["X", "Y"] (or the expected order) to verify nominal metadata is handled correctly for sparse attributes.

Suggested change
# index 2: X, missing(None?), ?
# Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
assert features[2].name == "c"
assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}
# index 2: X, missing(None?), ?
# Row 1: {1 5} -> index 2 is missing. In sparse ARFF, if it's missing it's the 0-th element for nominal.
assert features[2].name == "c"
assert features[2].data_type == FeatureType.NOMINAL
assert features[2].nominal_values == ["X", "Y"]
assert features[2].number_of_missing_values == 1 # Only from row 2 {0 ?, 2 ?}

Comment on lines 74 to 83
def test_analyze_arff_metadata():
arff_data = """@RELATION test
@ATTRIBUTE a NUMERIC
@ATTRIBUTE b NUMERIC
@ATTRIBUTE c NUMERIC
@DATA
1,2,3
"""
features = analyze_arff(
StringIO(arff_data),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding a similar metadata test for Parquet features (target/ignore/row identifier)

For Parquet, test_analyze_parquet currently only validates target_features and doesn’t cover ignore_features or row_id_features. Adding a dedicated metadata test for analyze_parquet would ensure these flags behave consistently with the ARFF path.

Suggested implementation:

    assert features[0].is_target is False

def test_analyze_parquet_metadata(tmp_path):
    table = pa.table(
        {
            "a": pa.array([1, 2, 3]),
            "b": pa.array([10, 20, 30]),
            "c": pa.array([100, 200, 300]),
        }
    )
    parquet_path = tmp_path / "metadata_test.parquet"
    pq.write_table(table, parquet_path)

    features = analyze_parquet(
        parquet_path,
        target_features=["c"],
        ignore_features=["b"],
        row_id_features=["a"],
    )

    assert features[0].is_row_identifier is True
    assert features[1].is_ignore is True
    assert features[2].is_target is True
    assert features[0].is_target is False

def test_analyze_parquet():
  1. Ensure pyarrow.parquet is imported as pq at the top of this test module if it isn’t already:
import pyarrow.parquet as pq
  1. Confirm that analyze_parquet in your codebase accepts the ignore_features and row_id_features keyword arguments and returns feature objects with is_ignore, is_row_identifier, and is_target attributes, consistent with the ARFF path.

Comment on lines 93 to 102
def test_analyze_parquet():
data = [
pa.array([1, 2, None]),
pa.array(["cat", "dog", "cat"]),
pa.array([True, False, None]),
pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string()))
]
schema = pa.schema([
("f1", pa.int64()),
("f2", pa.string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a Parquet test case for columns that are entirely null or have only missing values

Please add a case where one column is entirely null (e.g., pa.array([None, None, None])) to verify that null_count is correct and nominal_values is empty/None for fully-null columns.

Suggested implementation:

def test_analyze_parquet():
    data = [
        pa.array([1, 2, None]),
        pa.array(["cat", "dog", "cat"]),
        pa.array([True, False, None]),
        pa.array(["v1", "v2", "v3"], type=pa.dictionary(pa.int8(), pa.string())),
        pa.array([None, None, None]),
    ]
    schema = pa.schema([
        ("f1", pa.int64()),
        ("f2", pa.string()),
        ("f3", pa.bool_()),
        ("f4_all_null", pa.int64()),

To fully implement the requested test behavior, you should also:

  1. Locate the rest of the test_analyze_parquet body (after the schema = ... block). It likely constructs a pa.Table (or writes Parquet) and calls your analysis function.
  2. Identify the feature/column entry corresponding to "f4_all_null" in the returned analysis result (e.g., features[3] or features["f4_all_null"], depending on your API).
  3. Add assertions similar to:
    • assert feature_for_f4.null_count == 3
    • Ensure nominal_values is empty or None, following the conventions used in the other assertions (e.g., assert feature_for_f4.nominal_values in (None, []) or assert feature_for_f4.nominal_values == [] depending on how non-null columns are validated elsewhere in the test file.
      Make sure the index/name you use for the f4_all_null feature matches how the test currently asserts other columns.

@Nidhicodes Nidhicodes changed the title feature analyzer added Build feature analyzer in Python Feb 20, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
tests/core/test_feature_analysis.py (1)

97-138: Good coverage of Parquet types, but missing edge cases.

The Parquet test covers int64, string, bool, and dictionary-encoded columns. Consider adding test cases for:

  • Empty tables (0 rows) to verify both analyzers handle edge cases
  • Decimal columns (which currently fall through to STRING — see review on feature_analysis.py)
  • Parquet temporal types (timestamp, date)

These would help catch the type-inference gaps flagged in the main module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_feature_analysis.py` around lines 97 - 138, Update the
test_analyze_parquet test to add edge-case Parquet tables: include an empty
table (0 rows) to assert analyzers return features with correct names, types and
zero counts; add a decimal column (pa.decimal128 or pa.decimal256) and assert
analyze_parquet returns a NUMERIC or DECIMAL-expected FeatureType (and adjust
expectations if feature_analysis.py currently maps decimals to STRING); and add
temporal columns (pa.timestamp, pa.date32/64) and assert they are recognized as
TEMPORAL FeatureType with appropriate parsing of values; reference
analyze_parquet in your assertions so the new checks will detect the
type-inference gaps mentioned in feature_analysis.py.
src/core/feature_analysis.py (4)

88-88: Entire Parquet file is read into memory — consider schema-only reads for metadata and streaming for null counts.

pq.read_table(source) loads the entire file into memory. For large datasets, this can be expensive. If only schema + null counts are needed (for non-string/non-dictionary columns), pq.ParquetFile(source).schema_arrow can retrieve the schema without reading data, and pq.read_table(source, columns=[col]) can be used selectively. This is a future optimization worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` at line 88, The current use of
pq.read_table(source) (assignment to variable table) reads the entire Parquet
into memory; change this to a schema-only read via
pq.ParquetFile(source).schema_arrow to obtain schema/metadata, and then perform
per-column reads for any actual data needs (e.g., compute null counts by calling
pq.read_table(source, columns=[col]) or streaming column reads) instead of
loading everything at once; update the logic around table and any downstream
code that expects the full table to handle metadata-first + selective column
reads accordingly.

22-24: Parameter shadowing changes the type from Iterable[str] | None to set.

Both functions reassign the parameter names (target_features, ignore_features, row_id_features) from Iterable[str] | None to set[str]. This works at runtime but will confuse type checkers (and the project has mypy strict enabled per pyproject.toml). Use distinct local variable names.

♻️ Proposed fix
-    target_features = set(target_features or [])
-    ignore_features = set(ignore_features or [])
-    row_id_features = set(row_id_features or [])
+    target_set = set(target_features or [])
+    ignore_set = set(ignore_features or [])
+    row_id_set = set(row_id_features or [])

Then update usages to reference target_set, ignore_set, row_id_set.

Also applies to: 91-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` around lines 22 - 24, Function parameters
target_features, ignore_features, and row_id_features are being reassigned to
sets (changing their type from Iterable[str] | None to set) which breaks strict
mypy checks; instead create new local variables (e.g., target_set, ignore_set,
row_id_set) initialized as set(target_features or []), set(ignore_features or
[]), set(row_id_features or []), and update all subsequent usages in the
functions (including the other occurrence around the second block where the same
reassignment appears) to reference these new local set variables rather than
mutating the parameter names.

115-115: sorted(list(...))list() is redundant.

sorted() already returns a list, so wrapping in list() first is unnecessary on both lines 115 and 143.

♻️ Fix
-            nominal_values = sorted(list(unique_values))
+            nominal_values = sorted(unique_values)

Also applies to: 143-143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` at line 115, Replace the redundant list()
wrapping around sorted(...) — e.g., change nominal_values =
sorted(list(unique_values)) to nominal_values = sorted(unique_values) and do the
same for the other occurrence that uses sorted(list(...)) (the same pattern
elsewhere in this file). This removes the unnecessary list() call while
preserving behavior.

42-64: O(attributes × rows) complexity for missing value counting — refactor to single pass.

The missing value count loops over all rows for every attribute, making this O(attributes × rows). For a dataset with many attributes and rows, this is unnecessarily slow. A single pass over the data can compute missing counts for all attributes at once.

♻️ Proposed refactor: single-pass missing value counting

Compute the missing counts once before the per-attribute loop:

+    # Pre-compute missing counts in a single pass
+    missing_counts = [0] * len(attributes)
+    for row in data:
+        if isinstance(row, dict):
+            for idx in range(len(attributes)):
+                if idx in row and row[idx] is None:
+                    missing_counts[idx] += 1
+        else:
+            for idx, val in enumerate(row):
+                if val is None:
+                    missing_counts[idx] += 1
+
     features = []
     for i, (name, type_info) in enumerate(attributes):
         ...
-        # Count missing values
-        missing_count = 0
-        if data:
-            for row in data:
-                if isinstance(row, dict):
-                    if i not in row:
-                        pass
-                    elif row[i] is None:
-                        missing_count += 1
-                elif row[i] is None:
-                    missing_count += 1
+        missing_count = missing_counts[i]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` around lines 42 - 64, The missing-value logic
currently recomputes missing_count for each attribute (using variables/data
structures like missing_count, data, row, i), producing O(attributes × rows)
complexity; refactor by doing a single pass over data to build a missing_counts
array (length = number of attributes) once: initialize missing_counts to zeros,
iterate each row once, handle both dict (sparse) and list rows by only
incrementing missing_counts[idx] when a value is explicitly None (and treating
absent keys in sparse dict as default 0, not missing), then replace
per-attribute missing_count uses with missing_counts[i] in the later attribute
loop.
pyproject.toml (1)

23-24: Address unmaintained dependency and potential pyarrow type handling gap.

liac-arff has not been maintained upstream since August 31, 2020 (no PyPI releases since then). While actively used in this codebase, consider evaluating if it remains the best option for ARFF parsing, or plan to migrate if the library becomes problematic.

pyarrow adds string type flexibility in newer versions. The current code uses only pa.types.is_string() for type detection, which specifically matches the 32-bit UTF-8 string type. Newer PyArrow versions also support large_string (64-bit offsets) and string_view (variable-length view), which the current predicates would not catch. If Parquet files use these types, the type detection logic may misclassify them. Consider expanding type checks to handle these variants, or document the assumption that only pa.string() is used.

Both dependencies follow the project's pattern of unpinned versions, so version constraints are not strictly necessary, but the above functional concerns are worth addressing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 23 - 24, Update handling for the two dependency
concerns: for the unmaintained liac-arff dependency, add a TODO in the project
metadata and evaluate replacing it (e.g., migrate ARFF parsing to an actively
maintained library or an internal parser) so there's a planned migration path;
for pyarrow type detection, update the predicate that currently calls
pa.types.is_string() to also accept pa.types.is_large_string() and
pa.types.is_string_view() (or use a combined helper like is_arrow_text_type) so
Parquet/Arrow columns using 64-bit or view-backed string types are correctly
recognized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/feature_analysis.py`:
- Around line 108-115: The code builds nominal_values by iterating
chunk.dictionary which can include unreferenced entries; update the logic in the
feature_analysis code that handles column_data/chunk (the block that fills
unique_values and sets nominal_values) to derive values from the actual data —
either call chunk.unique() and use those results, or iterate the chunk's indices
to filter dictionary entries by referenced indices — then sort that result into
nominal_values so it only contains values present in the chunk.
- Around line 130-143: The current loop in feature_analysis.py that builds
unique_values for string columns (using column_data.chunks, unique_values,
FeatureType.NOMINAL and nominal_values) can blow up memory on high-cardinality
data; modify it to enforce a cardinality threshold (e.g., MAX_NOMINAL=256):
while iterating chunks/values stop adding once the set size exceeds the
threshold, immediately classify the column as FeatureType.STRING and set
nominal_values = None (or an empty/marker value) instead of continuing to
collect uniques; ensure the early-exit prevents scanning remaining chunks to
avoid the unbounded memory/CPU use.
- Around line 100-146: The current feature type detection in the feature
analysis block misclassifies decimal and temporal Parquet types as STRING;
update the logic in the same branch that checks pa_type (the block that
currently tests
pa.types.is_floating/is_integer/is_dictionary/is_string/is_boolean) to
explicitly detect pa.decimal128/pa.decimal256 (use pa.types.is_decimal128 and
pa.types.is_decimal256 or pa.types.is_decimal) and mark them as
FeatureType.NUMERIC (with appropriate conversion/scale handling for numeric
processing), and add explicit checks for temporal types (pa.types.is_date,
pa.types.is_time, pa.types.is_timestamp) and apply the chosen policy (e.g.,
treat timestamps as FeatureType.NUMERIC or FeatureType.STRING consistently),
ensuring nominal_values is set or left None as appropriate; adjust any
downstream code that expects numeric/date handling accordingly (refer to the
variables pa_type, data_type, nominal_values, and the table.column(i) logic to
extract values if needed).

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 23-24: Update handling for the two dependency concerns: for the
unmaintained liac-arff dependency, add a TODO in the project metadata and
evaluate replacing it (e.g., migrate ARFF parsing to an actively maintained
library or an internal parser) so there's a planned migration path; for pyarrow
type detection, update the predicate that currently calls pa.types.is_string()
to also accept pa.types.is_large_string() and pa.types.is_string_view() (or use
a combined helper like is_arrow_text_type) so Parquet/Arrow columns using 64-bit
or view-backed string types are correctly recognized.

In `@src/core/feature_analysis.py`:
- Line 88: The current use of pq.read_table(source) (assignment to variable
table) reads the entire Parquet into memory; change this to a schema-only read
via pq.ParquetFile(source).schema_arrow to obtain schema/metadata, and then
perform per-column reads for any actual data needs (e.g., compute null counts by
calling pq.read_table(source, columns=[col]) or streaming column reads) instead
of loading everything at once; update the logic around table and any downstream
code that expects the full table to handle metadata-first + selective column
reads accordingly.
- Around line 22-24: Function parameters target_features, ignore_features, and
row_id_features are being reassigned to sets (changing their type from
Iterable[str] | None to set) which breaks strict mypy checks; instead create new
local variables (e.g., target_set, ignore_set, row_id_set) initialized as
set(target_features or []), set(ignore_features or []), set(row_id_features or
[]), and update all subsequent usages in the functions (including the other
occurrence around the second block where the same reassignment appears) to
reference these new local set variables rather than mutating the parameter
names.
- Line 115: Replace the redundant list() wrapping around sorted(...) — e.g.,
change nominal_values = sorted(list(unique_values)) to nominal_values =
sorted(unique_values) and do the same for the other occurrence that uses
sorted(list(...)) (the same pattern elsewhere in this file). This removes the
unnecessary list() call while preserving behavior.
- Around line 42-64: The missing-value logic currently recomputes missing_count
for each attribute (using variables/data structures like missing_count, data,
row, i), producing O(attributes × rows) complexity; refactor by doing a single
pass over data to build a missing_counts array (length = number of attributes)
once: initialize missing_counts to zeros, iterate each row once, handle both
dict (sparse) and list rows by only incrementing missing_counts[idx] when a
value is explicitly None (and treating absent keys in sparse dict as default 0,
not missing), then replace per-attribute missing_count uses with
missing_counts[i] in the later attribute loop.

In `@tests/core/test_feature_analysis.py`:
- Around line 97-138: Update the test_analyze_parquet test to add edge-case
Parquet tables: include an empty table (0 rows) to assert analyzers return
features with correct names, types and zero counts; add a decimal column
(pa.decimal128 or pa.decimal256) and assert analyze_parquet returns a NUMERIC or
DECIMAL-expected FeatureType (and adjust expectations if feature_analysis.py
currently maps decimals to STRING); and add temporal columns (pa.timestamp,
pa.date32/64) and assert they are recognized as TEMPORAL FeatureType with
appropriate parsing of values; reference analyze_parquet in your assertions so
the new checks will detect the type-inference gaps mentioned in
feature_analysis.py.

Comment on lines +100 to +146
# Determine data_type and nominal_values
nominal_values = None
if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type):
data_type = FeatureType.NUMERIC
elif pa.types.is_dictionary(pa_type):
data_type = FeatureType.NOMINAL
# Extract nominal values from dictionary
# We need to look at the data to get the dictionary values
column_data = table.column(i)
# A column can have multiple chunks
unique_values = set()
for chunk in column_data.chunks:
dictionary = chunk.dictionary
for val in dictionary:
unique_values.add(val.as_py())
nominal_values = sorted(list(unique_values))
elif pa.types.is_string(pa_type) or pa.types.is_boolean(pa_type):
# For Parquet, strings might be nominal if they don't have a dictionary
# We needed to "Extract unique values from the data" for nominals in non-ARFF
# In OpenML, if it's used for classification, it's nominal.
# If we don't know, we might have to guess or treat all strings as nominal if they have
# few unique values.

# For Parquet, let's assume if it's not numeric, it's nominal for now,
# as that's common in ML datasets, unless it's explicitly string.

# If it's boolean, it's definitely nominal [False, True].
if pa.types.is_boolean(pa_type):
data_type = FeatureType.NOMINAL
nominal_values = ["false", "true"]
else:
# For string, let's extract unique values and see.
column_data = table.column(i)
unique_values = set()
# For efficiency, we might not want to scan everything if it's huge
for chunk in column_data.chunks:
for val in chunk.unique():
v = val.as_py()
if v is not None:
unique_values.add(str(v))

# OpenML usually has a threshold, but let's just call it nominal if it's string
data_type = FeatureType.NOMINAL
nominal_values = sorted(list(unique_values))
else:
data_type = FeatureType.STRING
nominal_values = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing handling for pa.decimal128/pa.decimal256 and date/time types — they'll be classified as STRING.

Parquet files commonly contain decimal and temporal types. Currently these fall through to the else branch (line 144-146) and get classified as FeatureType.STRING with no nominal values. Decimal should be NUMERIC, and date/time types need a deliberate policy.

🔧 Proposed fix for decimal types
-        if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type):
+        if pa.types.is_floating(pa_type) or pa.types.is_integer(pa_type) or pa.types.is_decimal(pa_type):
             data_type = FeatureType.NUMERIC

For temporal types (pa.types.is_date, pa.types.is_time, pa.types.is_timestamp), decide on a classification policy (NUMERIC timestamp or STRING) and handle explicitly rather than silently falling through.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` around lines 100 - 146, The current feature
type detection in the feature analysis block misclassifies decimal and temporal
Parquet types as STRING; update the logic in the same branch that checks pa_type
(the block that currently tests
pa.types.is_floating/is_integer/is_dictionary/is_string/is_boolean) to
explicitly detect pa.decimal128/pa.decimal256 (use pa.types.is_decimal128 and
pa.types.is_decimal256 or pa.types.is_decimal) and mark them as
FeatureType.NUMERIC (with appropriate conversion/scale handling for numeric
processing), and add explicit checks for temporal types (pa.types.is_date,
pa.types.is_time, pa.types.is_timestamp) and apply the chosen policy (e.g.,
treat timestamps as FeatureType.NUMERIC or FeatureType.STRING consistently),
ensuring nominal_values is set or left None as appropriate; adjust any
downstream code that expects numeric/date handling accordingly (refer to the
variables pa_type, data_type, nominal_values, and the table.column(i) logic to
extract values if needed).

Comment on lines +108 to +115
column_data = table.column(i)
# A column can have multiple chunks
unique_values = set()
for chunk in column_data.chunks:
dictionary = chunk.dictionary
for val in dictionary:
unique_values.add(val.as_py())
nominal_values = sorted(list(unique_values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dictionary entries may include values not present in the data.

chunk.dictionary contains all entries stored in the dictionary, including those that may no longer be referenced by any index in the chunk (e.g., after filtering). This means nominal_values could list values that don't actually appear in the column data. If accurate nominal values are required, use chunk.unique() instead of chunk.dictionary, or filter dictionary entries by referenced indices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` around lines 108 - 115, The code builds
nominal_values by iterating chunk.dictionary which can include unreferenced
entries; update the logic in the feature_analysis code that handles
column_data/chunk (the block that fills unique_values and sets nominal_values)
to derive values from the actual data — either call chunk.unique() and use those
results, or iterate the chunk's indices to filter dictionary entries by
referenced indices — then sort that result into nominal_values so it only
contains values present in the chunk.

Comment on lines +130 to +143
else:
# For string, let's extract unique values and see.
column_data = table.column(i)
unique_values = set()
# For efficiency, we might not want to scan everything if it's huge
for chunk in column_data.chunks:
for val in chunk.unique():
v = val.as_py()
if v is not None:
unique_values.add(str(v))

# OpenML usually has a threshold, but let's just call it nominal if it's string
data_type = FeatureType.NOMINAL
nominal_values = sorted(list(unique_values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unbounded unique-value scan on string columns — potential performance/memory issue on high-cardinality data.

For string columns, every value across all chunks is scanned into a set. If a column has millions of distinct values (e.g., free-text, UUIDs), this will consume excessive memory and produce an impractically large nominal_values list. The inline comment on line 134 acknowledges this but doesn't mitigate it.

Consider adding a cardinality threshold: if unique values exceed a limit (e.g., 256), classify as FeatureType.STRING instead of NOMINAL and set nominal_values = None.

🔧 Sketch of a threshold-based approach
                 for chunk in column_data.chunks:
                     for val in chunk.unique():
                         v = val.as_py()
                         if v is not None:
                             unique_values.add(str(v))
+                        if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+                            break
+                    if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+                        break

-                data_type = FeatureType.NOMINAL
-                nominal_values = sorted(unique_values)
+                if len(unique_values) > MAX_NOMINAL_CARDINALITY:
+                    data_type = FeatureType.STRING
+                    nominal_values = None
+                else:
+                    data_type = FeatureType.NOMINAL
+                    nominal_values = sorted(unique_values)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/feature_analysis.py` around lines 130 - 143, The current loop in
feature_analysis.py that builds unique_values for string columns (using
column_data.chunks, unique_values, FeatureType.NOMINAL and nominal_values) can
blow up memory on high-cardinality data; modify it to enforce a cardinality
threshold (e.g., MAX_NOMINAL=256): while iterating chunks/values stop adding
once the set size exceeds the threshold, immediately classify the column as
FeatureType.STRING and set nominal_values = None (or an empty/marker value)
instead of continuing to collect uniques; ensure the early-exit prevents
scanning remaining chunks to avoid the unbounded memory/CPU use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build feature analyzer in Python

1 participant