Skip to content

Comments

Add Parquet dataset upload endpoint#252

Open
Vivekgupta008 wants to merge 2 commits intoopenml:mainfrom
Vivekgupta008:vivek/parquet-upload
Open

Add Parquet dataset upload endpoint#252
Vivekgupta008 wants to merge 2 commits intoopenml:mainfrom
Vivekgupta008:vivek/parquet-upload

Conversation

@Vivekgupta008
Copy link

Hey @PGijsbers, Please review this!!
This pr adds POST /datasets/upload - a multipart form endpoint that lets users upload .parquet datasets to OpenML.

What this does:

  • Accepts a .parquet file + JSON metadata (name, description, target attribute, visibility, licence, etc.)
  • Uses PyArrow to validate the file and automatically extract schema - column names, types (numeric/nominal/string), and per-column null counts. No manual feature specification needed.
  • Computes an MD5 checksum of the raw bytes for integrity.
  • Persists everything in a single consistent DB sequence: file → dataset → dataset_description → data_feature → data_quality → sets status to in_preparation.
  • Uploads the file to MinIO (credentials from env vars OPENML_MINIO_ACCESS_KEY / OPENML_MINIO_SECRET_KEY or config.toml).
  • Returns {"upload_dataset": {"id": <dataset_id>}}.

Testing:

pytest tests/core/parquet_test.py tests/routers/openml/dataset_upload_test.py
# 23 passed
pre-commit run --all-files
# All hooks passed

Closes #198

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

Walkthrough

Adds Parquet dataset upload end-to-end support: new dependencies in pyproject.toml; MinIO configuration in config.toml and a config loader; a core.parquet module to read Parquet metadata and compute MD5; a core.storage module to upload bytes to MinIO via boto3; database helpers to insert files, datasets, descriptions, features, and qualities; a POST /datasets/upload FastAPI endpoint accepting multipart Parquet + JSON metadata that validates, persists records, uploads to storage, and updates references; Pydantic schemas for upload metadata/response; and unit tests for Parquet handling and the upload flow.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add Parquet dataset upload endpoint' clearly and concisely summarizes the main change in the PR, which is the addition of a new POST /datasets/upload endpoint for Parquet file uploads.
Description check ✅ Passed The description is detailed and directly related to the changeset, explaining the endpoint functionality, metadata handling, PyArrow validation, MD5 checksum computation, DB persistence workflow, MinIO upload, and test results.
Linked Issues check ✅ Passed The PR implements the objectives from issue #198 to allow uploading of Parquet datasets, including validation, schema extraction, metadata persistence, and MinIO storage integration.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the Parquet dataset upload functionality and supporting infrastructure (dependencies, configuration, database helpers, schemas, and comprehensive tests).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • The upload flow performs multiple DB writes and then a MinIO upload without any explicit transaction/rollback handling, so consider wrapping the DB operations in a transaction and/or deferring persistence until after the object store upload to avoid partially created datasets when MinIO fails.
  • In insert_features and insert_qualities you execute one INSERT per item in Python; if these tables are expected to receive many rows per dataset, consider batching into a single bulk insert to reduce round-trips and improve performance.
  • The insert_file call stores an empty reference and the MinIO object key returned by upload_to_minio is never persisted, so the file record will not point to the actual uploaded object; consider either passing the key into insert_file or adding an update of the reference column after upload.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The upload flow performs multiple DB writes and then a MinIO upload without any explicit transaction/rollback handling, so consider wrapping the DB operations in a transaction and/or deferring persistence until after the object store upload to avoid partially created datasets when MinIO fails.
- In `insert_features` and `insert_qualities` you execute one `INSERT` per item in Python; if these tables are expected to receive many rows per dataset, consider batching into a single bulk insert to reduce round-trips and improve performance.
- The `insert_file` call stores an empty `reference` and the MinIO object key returned by `upload_to_minio` is never persisted, so the file record will not point to the actual uploaded object; consider either passing the key into `insert_file` or adding an update of the `reference` column after upload.

## Individual Comments

### Comment 1
<location> `src/routers/openml/datasets.py:93-95` </location>
<code_context>
+        ) from exc
+
+    # --- DB: insert file record (MinIO reference filled after we know the did) ---
+    file_id = database.datasets.insert_file(
+        file_name=filename,
+        reference="",
+        md5_hash=parquet_meta.md5_checksum,
+        connection=expdb_db,
</code_context>

<issue_to_address>
**issue (bug_risk):** The file `reference` is stored as an empty string and never updated with the MinIO object key.

`insert_file` is called with an empty `reference`, and the key returned by `upload_to_minio` is never persisted, so the DB never stores the real object location. Use the MinIO key either by (a) calling `insert_file` after `upload_to_minio` and passing the key, or (b) issuing a follow-up `UPDATE file SET reference = :key WHERE id = :file_id` with the returned key.
</issue_to_address>

### Comment 2
<location> `src/routers/openml/datasets.py:74-83` </location>
<code_context>
+
+    Raises ``ValueError`` if the bytes are not a valid Parquet file.
+    """
+    try:
+        buf = io.BytesIO(file_bytes)
+        pf = pq.ParquetFile(buf)
</code_context>

<issue_to_address>
**issue (bug_risk):** Dataset insert and MinIO upload are not wrapped in a transaction, so failures can leave inconsistent state.

If `upload_to_minio` fails after inserting the `file` and `dataset` rows, you’ll end up with a dataset record that has no corresponding object in MinIO. Consider treating the DB inserts and MinIO upload as a single transactional flow (DB transaction plus compensating delete on upload failure), or at least deleting the newly created DB records when the upload fails to avoid inconsistent state.
</issue_to_address>

### Comment 3
<location> `src/routers/openml/datasets.py:135-144` </location>
<code_context>
+        connection=expdb_db,
+    )
+
+    features = [
+        {
+            "index": col.index,
+            "name": col.name,
+            "data_type": col.data_type,
+            "is_target": col.name == upload_meta.default_target_attribute,
+            "is_row_identifier": False,
+            "is_ignore": False,
+            "number_of_missing_values": col.number_of_missing_values,
+        }
+        for col in parquet_meta.columns
+    ]
+    database.datasets.insert_features(
</code_context>

<issue_to_address>
**suggestion (bug_risk):** `default_target_attribute` is not validated against the Parquet columns.

`is_target` is derived from `upload_meta.default_target_attribute`, but we never verify that this attribute exists in `parquet_meta.columns`. If the client sends an invalid column name, the dataset is created without a target and no error. Please validate that all requested target columns exist in the Parquet schema and return a suitable 4xx error if they do not.

Suggested implementation:

```python
    # Validate that the requested default target attribute exists in the Parquet schema
    target_attribute = upload_meta.default_target_attribute
    if target_attribute is not None:
        parquet_column_names = {col.name for col in parquet_meta.columns}
        if target_attribute not in parquet_column_names:
            raise HTTPException(
                status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
                detail={
                    "code": "114",
                    "message": (
                        f"Default target attribute '{target_attribute}' "
                        "does not exist in the uploaded dataset columns."
                    ),
                },
            )

    features = [
        {
            "index": col.index,
            "name": col.name,
            "data_type": col.data_type,
            "is_target": col.name == target_attribute,
            "is_row_identifier": False,
            "is_ignore": False,
            "number_of_missing_values": col.number_of_missing_values,
        }
        for col in parquet_meta.columns
    ]

```

1. Ensure that `HTTPException` and `status` are already imported in this module (they likely are, given the surrounding code). If not, add:
   `from fastapi import HTTPException, status`.
2. If your API has a central error code registry, you may want to adjust `"code": "114"` to match your project's conventions or an existing code for invalid target attributes.
3. If there are additional ways to specify target columns (e.g. multiple targets) elsewhere in the payload schema, apply similar validation for those fields in the corresponding parts of the code.
</issue_to_address>

### Comment 4
<location> `src/config.toml:31-32` </location>
<code_context>
+bucket="datasets"
+# Credentials should be provided via environment variables:
+# OPENML_MINIO_ACCESS_KEY and OPENML_MINIO_SECRET_KEY
+access_key="minioadmin"
+secret_key="minioadmin"
</code_context>

<issue_to_address>
**🚨 suggestion (security):** Hard-coded MinIO credentials in config increase the risk of credential leakage.

The defaults are still committed in the file as `access_key`/`secret_key`. To avoid accidental use or leakage, remove these hard-coded values and require credentials to come only from environment variables (or an untracked local config).
</issue_to_address>

### Comment 5
<location> `tests/routers/openml/dataset_upload_test.py:109-118` </location>
<code_context>
+    assert response.status_code == HTTPStatus.UNPROCESSABLE_ENTITY
+
+
+def test_upload_parquet_success(api_client_authenticated: TestClient) -> None:
+    file_bytes = _make_parquet_bytes()
+
+    with (
+        patch("routers.openml.datasets.upload_to_minio", return_value="key"),
+        patch("database.datasets.insert_file", return_value=99),
+        patch("database.datasets.insert_dataset", return_value=42),
+        patch("database.datasets.insert_description"),
+        patch("database.datasets.insert_features"),
+        patch("database.datasets.insert_qualities"),
+        patch("database.datasets.update_status"),
+    ):
+        response = _upload(api_client_authenticated, file_bytes=file_bytes)
+
+    assert response.status_code == HTTPStatus.CREATED
+    body = response.json()
+    assert body["upload_dataset"]["id"] == 42
+
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test that validates the computed dataset qualities passed to `insert_qualities`

Currently we only assert the dataset id and that features are extracted correctly; the computed qualities passed to `insert_qualities` are never checked. Please add a test that patches `database.datasets.insert_qualities` with a capturing stub (like `capture_features`) and asserts that it is called with:
- `NumberOfInstances == num_rows` from the parquet file
- `NumberOfFeatures == num_columns`
- `NumberOfMissingValues` equal to the sum of per-column missing counts.

This will verify that `ParquetMeta` is correctly translated into OpenML data qualities.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (9)
src/routers/openml/datasets.py (1)

44-45: Annotated[UploadFile, ...] uses raw Ellipsis instead of File().

... (Ellipsis) is not a FastAPI form-field descriptor. FastAPI will infer this correctly for UploadFile, but the idiomatic declaration is Annotated[UploadFile, File()], which also generates richer OpenAPI docs.

♻️ Suggested change
-from fastapi import APIRouter, Body, Depends, Form, HTTPException, UploadFile
+from fastapi import APIRouter, Body, Depends, File, Form, HTTPException, UploadFile
 
 ...
 
 def upload_dataset(
-    file: Annotated[UploadFile, ...],
+    file: Annotated[UploadFile, File(description="The .parquet file to upload")],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 44 - 45, The parameter
annotation in upload_dataset uses Annotated[UploadFile, ...] (Ellipsis) which is
not the idiomatic FastAPI form-field descriptor; change the parameter to
Annotated[UploadFile, File()] so FastAPI treats it as an uploaded file and
generates richer OpenAPI docs, and add the necessary import for File from
fastapi; update the signature in upload_dataset and any related usages
accordingly.
src/core/storage.py (1)

11-11: Import private symbols _config_file / _load_configuration instead of the public load_minio_configuration.

load_minio_configuration was added to config.py in this very PR precisely for this use case. _minio_config should call it instead of reaching into the private internals of another module.

♻️ Proposed refactor
-from config import _config_file, _load_configuration
+from config import _config_file, load_minio_configuration
 
 ...
 
 def _minio_config(file: Path = _config_file) -> dict[str, str]:
-    cfg = _load_configuration(file).get("minio", {})
+    cfg = load_minio_configuration(file)
     return {
         "endpoint_url": cfg.get("endpoint_url", "http://minio:9000"),
         "bucket": cfg.get("bucket", "datasets"),
         "access_key": os.environ.get(MINIO_ACCESS_KEY_ENV, cfg.get("access_key", "minioadmin")),
         "secret_key": os.environ.get(MINIO_SECRET_KEY_ENV, cfg.get("secret_key", "minioadmin")),
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/storage.py` at line 11, The file imports private symbols
_config_file and _load_configuration; update _minio_config to call the public
load_minio_configuration from config instead of accessing those internals:
replace uses of _config_file/_load_configuration with a call to
load_minio_configuration() (imported from config) and adapt any variable names
accordingly so _minio_config derives its settings from load_minio_configuration
rather than the private functions. Ensure you only change the import line and
the body of _minio_config to use the public API (load_minio_configuration) and
remove reliance on _config_file/_load_configuration.
src/database/datasets.py (1)

296-313: Per-row INSERT in a Python loop — consider executemany for bulk operations.

insert_features and insert_qualities execute one INSERT per row in a Python for loop. For datasets with many columns this causes N round-trips. Passing a list of parameter dicts to connection.execute(stmt, list_of_params) (SQLAlchemy's executemany) would reduce this to a single call.

♻️ Suggested refactor for insert_features
-    for feat in features:
-        connection.execute(
-            text("""
-                INSERT INTO data_feature(
-                    `did`, `index`, `name`, `data_type`,
-                    `is_target`, `is_row_identifier`, `is_ignore`,
-                    `NumberOfMissingValues`
-                )
-                VALUES (
-                    :did, :index, :name, :data_type,
-                    :is_target, :is_row_identifier, :is_ignore,
-                    :number_of_missing_values
-                )
-            """),
-            parameters={"did": dataset_id, **feat},
-        )
+    connection.execute(
+        text("""
+            INSERT INTO data_feature(
+                `did`, `index`, `name`, `data_type`,
+                `is_target`, `is_row_identifier`, `is_ignore`,
+                `NumberOfMissingValues`
+            )
+            VALUES (
+                :did, :index, :name, :data_type,
+                :is_target, :is_row_identifier, :is_ignore,
+                :number_of_missing_values
+            )
+        """),
+        [{"did": dataset_id, **feat} for feat in features],
+    )

Also applies to: 328-337

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/datasets.py` around lines 296 - 313, The current per-row loop in
insert_features (iterating over features and calling connection.execute for
each) should be converted to a single bulk insert using SQLAlchemy's
executemany: build a list of parameter dicts where each dict contains "did":
dataset_id plus the keys expected by the INSERT (:index, :name, :data_type,
:is_target, :is_row_identifier, :is_ignore, :number_of_missing_values), then
call connection.execute(text(<same INSERT SQL>), list_of_param_dicts) once; do
the analogous change in insert_qualities (use the qualities list, include "did":
dataset_id in each param dict, and execute one connection.execute with the
qualities parameter list) so both functions perform one bulk call instead of N
individual inserts.
src/core/parquet.py (3)

44-61: Reading the full table into memory to count nulls may be expensive for large files.

pf.read() (Line 61) materialises the entire dataset as an Arrow table on top of the file_bytes already in memory, roughly doubling peak RAM usage. For a v1 upload endpoint this is likely fine, but for large datasets consider extracting null counts from Parquet row-group statistics (pf.metadata.row_group(i).column(j).statistics) instead, which avoids loading column data. Statistics may not always be written, so you'd need a fallback — but it's worth a # TODO for future optimisation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/parquet.py` around lines 44 - 61, read_parquet_metadata currently
calls pf.read() which materialises the entire Parquet file into memory (doubling
peak RAM) to compute per-column null counts; change the implementation to first
attempt to compute null/nonnull counts from Parquet row-group statistics via
pf.metadata.row_group(i).column(j).statistics (iterating row groups and columns)
and only fall back to pf.read() if statistics are missing for any column; add a
TODO comment in read_parquet_metadata noting this optimization and implement the
statistics-first approach so memory-heavy pf.read() is avoided for large files.

13-25: Consider adding a comment explaining temporal type handling.

The map_arrow_type function correctly falls back to STRING for temporal types (timestamp, date, duration, etc.) because FeatureType only supports NUMERIC, NOMINAL, and STRING. While mapping these to NUMERIC is a reasonable design choice worth discussing, the current implementation is correct given the available enum values. Adding a comment clarifying that temporal types intentionally map to STRING would help future maintainers understand this design decision.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/parquet.py` around lines 13 - 25, In map_arrow_type, add a brief
inline comment explaining that temporal PyArrow types (timestamp, date,
duration, etc.) are intentionally not mapped to NUMERIC or a separate temporal
FeatureType and therefore fall through to returning FeatureType.STRING because
FeatureType only supports NUMERIC, NOMINAL, and STRING; mention this is a
deliberate design choice to preserve temporal values as strings for downstream
processing and to aid future maintainers reviewing the mapping logic in
map_arrow_type.

49-54: Narrow exception handling to avoid conflating unrelated errors with invalid Parquet format.

The broad except Exception will catch and misreport unrelated failures (e.g., MemoryError, SystemError) as "not a valid Parquet file". While the suggested exceptions (ArrowInvalid, ArrowIOError) do exist and are commonly raised by ParquetFile, a more robust approach is to catch the parent ArrowException class, which covers all Parquet parsing errors (including ArrowTypeError and ArrowNotImplementedError).

Suggested narrower catch
     try:
         buf = io.BytesIO(file_bytes)
         pf = pq.ParquetFile(buf)
-    except Exception as exc:
+    except (pa.ArrowException, OSError) as exc:
         msg = "File is not a valid Parquet file."
         raise ValueError(msg) from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/parquet.py` around lines 49 - 54, The current broad except around
creating pq.ParquetFile (buf = io.BytesIO(file_bytes); pf = pq.ParquetFile(buf))
should be narrowed to catch pyarrow parsing errors only: replace the generic
"except Exception as exc" with "except ArrowException as exc" (import
ArrowException from pyarrow.errors or pyarrow) so only Arrow-related Parquet
parsing errors are reported as "File is not a valid Parquet file."; keep raising
ValueError(msg) from exc to preserve chaining.
tests/routers/openml/dataset_upload_test.py (3)

149-174: Good feature-extraction validation; unused args in capture_features are expected.

The capture_features callback correctly matches the expected signature of insert_features. The static analysis warnings about unused dataset_id and connection are false positives — those parameters are required to match the patched function's call signature. The assertions for is_target based on default_target_attribute are a valuable check.

One minor note: the test only verifies that sepal_length and label are present and checks their is_target flags. It doesn't verify sepal_width or the total feature count. Consider adding assert len(captured) == 3 for completeness.

Optional: add a count assertion
     assert response.status_code == HTTPStatus.CREATED
+    assert len(captured) == 3
     names = [f["name"] for f in captured]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/dataset_upload_test.py` around lines 149 - 174, Add an
assertion to verify the total number of features captured in
test_upload_features_extracted_correctly: after the upload and before the
existing per-feature checks, assert that len(captured) == 3 to ensure
sepal_length, sepal_width and label were all extracted; this uses the existing
capture_features callback (patched as database.datasets.insert_features) and the
captured list used later in the test.

58-62: Single execute().one() return value may mask multi-query issues.

The mock returns (42,) for every execute().one() call. If the endpoint issues multiple queries expecting different results (e.g., insert_file expects a file ID, insert_dataset expects a dataset ID), this single canned response could either silently produce wrong data or accidentally make tests pass. Consider configuring side_effect with a sequence of return values matching the expected call order if the endpoint makes multiple DB calls through the connection.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/dataset_upload_test.py` around lines 58 - 62, The
current mock_connection fixture always returns the same tuple from
execute().one(), which can hide multi-query bugs; update mock_connection (the
MagicMock named conn for spec=Connection) so conn.execute().one uses a
side_effect sequence of tuples matching the expected DB-call order (e.g., first
(file_id,), then (dataset_id,), etc.), or alternatively set
conn.execute.side_effect to a sequence of MagicMocks whose one() methods return
the different tuples in order, ensuring each DB call in the tested flow
(insert_file, insert_dataset, ...) gets the correct mock result.

104-110: Consider asserting the error code for invalid metadata JSON.

The other error-path tests assert both the HTTP status and the specific error code. This test only checks for 422 UNPROCESSABLE_ENTITY but doesn't verify the response body. Adding an error code assertion would keep the pattern consistent and guard against the wrong 422 being returned.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/dataset_upload_test.py` around lines 104 - 110, The
test_upload_invalid_metadata_json currently only asserts
HTTPStatus.UNPROCESSABLE_ENTITY; update it to also parse response.json() and
assert the specific error code returned for invalid metadata JSON (e.g. assert
response.json()["error"]["code"] == "INVALID_METADATA_JSON" or the project’s
equivalent constant), matching the pattern used in other error-path tests for
/datasets/upload so the test verifies the correct error is returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/config.toml`:
- Around line 23-27: The two config keys have inconsistent trailing
slashes—routing.minio_url vs minio.endpoint_url—causing divergent URL
construction; pick a canonical format and make both values consistent (either
both with a trailing slash or both without) so code that concatenates paths
using minio_url or minio.endpoint_url behaves identically, then update any call
sites that assume a specific trailing slash if needed; specifically modify the
values for minio_url and minio.endpoint_url to match each other.
- Around line 31-32: Replace the hard-coded MinIO credentials by removing the
plaintext values for access_key and secret_key and replacing them with
non-sensitive placeholders (e.g., ACCESS_KEY_PLACEHOLDER,
SECRET_KEY_PLACEHOLDER) so the repo no longer contains real secrets; ensure the
configuration continues to support the existing env-var override path (the env
comment above) so runtime credentials come from environment variables and not
these fields, and update any documentation or examples to instruct developers to
set the corresponding environment variables instead of committing real keys.

In `@src/database/datasets.py`:
- Line 205: The function definition for insert_dataset includes an unnecessary
"# noqa: PLR0913" directive; remove the trailing directive from the def line
(i.e., change "def insert_dataset(  # noqa: PLR0913" to simply "def
insert_dataset(") so the code no longer contains a non-applicable Ruff
suppression for PLR0913.

In `@src/routers/openml/datasets.py`:
- Around line 93-98: The file record's reference is never set because the return
value from upload_to_minio is discarded after calling
database.datasets.insert_file; capture the returned object key from
upload_to_minio (e.g., key = upload_to_minio(...)) and then persist it by adding
a new helper database.datasets.update_file_reference(*, file_id: int, reference:
str, connection: Connection) that executes an UPDATE on file SET reference =
:reference WHERE id = :file_id, and call this helper with the insert_file return
value (file_id) and the upload key immediately after a successful upload
(replace the current discarded call near insert_file/upload_to_minio locations).
- Around line 92-126: The insert_file and insert_dataset calls create DB rows
before upload_to_minio, so if upload_to_minio(file_bytes, dataset_id) raises,
file_id and dataset_id become orphaned; fix by performing the inserts and MinIO
upload inside a single transaction (use the expdb_db transaction context or
begin/rollback) so a failure rolls back both insert_file and insert_dataset, or
if transactions aren’t available, catch the RuntimeError in the except block and
delete the created rows using database.datasets.delete_file(file_id) and
database.datasets.delete_dataset(dataset_id) before re-raising; reference
insert_file, insert_dataset, upload_to_minio, file_id, dataset_id, and expdb_db
in your changes.
- Around line 56-63: The authentication branch that currently raises
HTTPException with HTTPStatus.UNAUTHORIZED in the user is None block should be
changed to use the existing create_authentication_failed_error() behavior so it
returns HTTPStatus.PRECONDITION_FAILED and the canonical code "103"; replace the
custom HTTPException (or change its status_code) in that user-is-None branch to
call or mirror create_authentication_failed_error() instead of using
HTTPStatus.UNAUTHORIZED to keep auth errors consistent across the API.

In `@tests/core/parquet_test.py`:
- Around line 84-86: The test currently hashes the same bytes object twice;
change test_read_parquet_metadata_md5_is_deterministic to call
_make_parquet_bytes twice (e.g., data1 = _make_parquet_bytes(...); data2 =
_make_parquet_bytes(...)) and then assert that
read_parquet_metadata(data1).md5_checksum ==
read_parquet_metadata(data2).md5_checksum (or assert data1 == data2) so the
determinism of _make_parquet_bytes / PyArrow output is actually verified;
references: test_read_parquet_metadata_md5_is_deterministic,
_make_parquet_bytes, read_parquet_metadata.

---

Nitpick comments:
In `@src/core/parquet.py`:
- Around line 44-61: read_parquet_metadata currently calls pf.read() which
materialises the entire Parquet file into memory (doubling peak RAM) to compute
per-column null counts; change the implementation to first attempt to compute
null/nonnull counts from Parquet row-group statistics via
pf.metadata.row_group(i).column(j).statistics (iterating row groups and columns)
and only fall back to pf.read() if statistics are missing for any column; add a
TODO comment in read_parquet_metadata noting this optimization and implement the
statistics-first approach so memory-heavy pf.read() is avoided for large files.
- Around line 13-25: In map_arrow_type, add a brief inline comment explaining
that temporal PyArrow types (timestamp, date, duration, etc.) are intentionally
not mapped to NUMERIC or a separate temporal FeatureType and therefore fall
through to returning FeatureType.STRING because FeatureType only supports
NUMERIC, NOMINAL, and STRING; mention this is a deliberate design choice to
preserve temporal values as strings for downstream processing and to aid future
maintainers reviewing the mapping logic in map_arrow_type.
- Around line 49-54: The current broad except around creating pq.ParquetFile
(buf = io.BytesIO(file_bytes); pf = pq.ParquetFile(buf)) should be narrowed to
catch pyarrow parsing errors only: replace the generic "except Exception as exc"
with "except ArrowException as exc" (import ArrowException from pyarrow.errors
or pyarrow) so only Arrow-related Parquet parsing errors are reported as "File
is not a valid Parquet file."; keep raising ValueError(msg) from exc to preserve
chaining.

In `@src/core/storage.py`:
- Line 11: The file imports private symbols _config_file and
_load_configuration; update _minio_config to call the public
load_minio_configuration from config instead of accessing those internals:
replace uses of _config_file/_load_configuration with a call to
load_minio_configuration() (imported from config) and adapt any variable names
accordingly so _minio_config derives its settings from load_minio_configuration
rather than the private functions. Ensure you only change the import line and
the body of _minio_config to use the public API (load_minio_configuration) and
remove reliance on _config_file/_load_configuration.

In `@src/database/datasets.py`:
- Around line 296-313: The current per-row loop in insert_features (iterating
over features and calling connection.execute for each) should be converted to a
single bulk insert using SQLAlchemy's executemany: build a list of parameter
dicts where each dict contains "did": dataset_id plus the keys expected by the
INSERT (:index, :name, :data_type, :is_target, :is_row_identifier, :is_ignore,
:number_of_missing_values), then call connection.execute(text(<same INSERT
SQL>), list_of_param_dicts) once; do the analogous change in insert_qualities
(use the qualities list, include "did": dataset_id in each param dict, and
execute one connection.execute with the qualities parameter list) so both
functions perform one bulk call instead of N individual inserts.

In `@src/routers/openml/datasets.py`:
- Around line 44-45: The parameter annotation in upload_dataset uses
Annotated[UploadFile, ...] (Ellipsis) which is not the idiomatic FastAPI
form-field descriptor; change the parameter to Annotated[UploadFile, File()] so
FastAPI treats it as an uploaded file and generates richer OpenAPI docs, and add
the necessary import for File from fastapi; update the signature in
upload_dataset and any related usages accordingly.

In `@tests/routers/openml/dataset_upload_test.py`:
- Around line 149-174: Add an assertion to verify the total number of features
captured in test_upload_features_extracted_correctly: after the upload and
before the existing per-feature checks, assert that len(captured) == 3 to ensure
sepal_length, sepal_width and label were all extracted; this uses the existing
capture_features callback (patched as database.datasets.insert_features) and the
captured list used later in the test.
- Around line 58-62: The current mock_connection fixture always returns the same
tuple from execute().one(), which can hide multi-query bugs; update
mock_connection (the MagicMock named conn for spec=Connection) so
conn.execute().one uses a side_effect sequence of tuples matching the expected
DB-call order (e.g., first (file_id,), then (dataset_id,), etc.), or
alternatively set conn.execute.side_effect to a sequence of MagicMocks whose
one() methods return the different tuples in order, ensuring each DB call in the
tested flow (insert_file, insert_dataset, ...) gets the correct mock result.
- Around line 104-110: The test_upload_invalid_metadata_json currently only
asserts HTTPStatus.UNPROCESSABLE_ENTITY; update it to also parse response.json()
and assert the specific error code returned for invalid metadata JSON (e.g.
assert response.json()["error"]["code"] == "INVALID_METADATA_JSON" or the
project’s equivalent constant), matching the pattern used in other error-path
tests for /datasets/upload so the test verifies the correct error is returned.

Comment on lines 23 to +27
minio_url="http://minio:9000/"
server_url="http://php-api:80/"

[minio]
endpoint_url="http://minio:9000"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate MinIO endpoint with mismatched trailing slash.

routing.minio_url (Line 23) ends with / while minio.endpoint_url (Line 27) does not. If both are used in URL construction these will diverge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.toml` around lines 23 - 27, The two config keys have inconsistent
trailing slashes—routing.minio_url vs minio.endpoint_url—causing divergent URL
construction; pick a canonical format and make both values consistent (either
both with a trailing slash or both without) so code that concatenates paths
using minio_url or minio.endpoint_url behaves identically, then update any call
sites that assume a specific trailing slash if needed; specifically modify the
values for minio_url and minio.endpoint_url to match each other.

Comment on lines +56 to +63
if user is None:
raise HTTPException(
status_code=HTTPStatus.UNAUTHORIZED,
detail={
"code": "103",
"message": "Authentication failed. Please provide a valid API key.",
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Authentication failure uses 401 UNAUTHORIZED; rest of API uses 412 PRECONDITION_FAILED.

create_authentication_failed_error() (Line 199) returns 412 with code: "103". The new endpoint raises 401 with the same code. Clients handling auth errors uniformly will break.

🔧 Proposed fix
-    if user is None:
-        raise HTTPException(
-            status_code=HTTPStatus.UNAUTHORIZED,
-            detail={
-                "code": "103",
-                "message": "Authentication failed. Please provide a valid API key.",
-            },
-        )
+    if user is None:
+        raise create_authentication_failed_error()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if user is None:
raise HTTPException(
status_code=HTTPStatus.UNAUTHORIZED,
detail={
"code": "103",
"message": "Authentication failed. Please provide a valid API key.",
},
)
if user is None:
raise create_authentication_failed_error()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 56 - 63, The authentication
branch that currently raises HTTPException with HTTPStatus.UNAUTHORIZED in the
user is None block should be changed to use the existing
create_authentication_failed_error() behavior so it returns
HTTPStatus.PRECONDITION_FAILED and the canonical code "103"; replace the custom
HTTPException (or change its status_code) in that user-is-None branch to call or
mirror create_authentication_failed_error() instead of using
HTTPStatus.UNAUTHORIZED to keep auth errors consistent across the API.

Comment on lines +84 to +86
def test_read_parquet_metadata_md5_is_deterministic() -> None:
data = _make_parquet_bytes(x=pa.array([1, 2, 3]))
assert read_parquet_metadata(data).md5_checksum == read_parquet_metadata(data).md5_checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

MD5 determinism test is trivially true — it proves nothing.

data is constructed once and the same bytes object is passed to both read_parquet_metadata calls. Hashing the same bytes instance twice will always match. To meaningfully verify that _make_parquet_bytes itself is deterministic (i.e., PyArrow produces byte-identical output across invocations), the test should build data twice independently:

♻️ Suggested fix
 def test_read_parquet_metadata_md5_is_deterministic() -> None:
-    data = _make_parquet_bytes(x=pa.array([1, 2, 3]))
-    assert read_parquet_metadata(data).md5_checksum == read_parquet_metadata(data).md5_checksum
+    data1 = _make_parquet_bytes(x=pa.array([1, 2, 3]))
+    data2 = _make_parquet_bytes(x=pa.array([1, 2, 3]))
+    assert read_parquet_metadata(data1).md5_checksum == read_parquet_metadata(data2).md5_checksum
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_read_parquet_metadata_md5_is_deterministic() -> None:
data = _make_parquet_bytes(x=pa.array([1, 2, 3]))
assert read_parquet_metadata(data).md5_checksum == read_parquet_metadata(data).md5_checksum
def test_read_parquet_metadata_md5_is_deterministic() -> None:
data1 = _make_parquet_bytes(x=pa.array([1, 2, 3]))
data2 = _make_parquet_bytes(x=pa.array([1, 2, 3]))
assert read_parquet_metadata(data1).md5_checksum == read_parquet_metadata(data2).md5_checksum
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/parquet_test.py` around lines 84 - 86, The test currently hashes
the same bytes object twice; change
test_read_parquet_metadata_md5_is_deterministic to call _make_parquet_bytes
twice (e.g., data1 = _make_parquet_bytes(...); data2 = _make_parquet_bytes(...))
and then assert that read_parquet_metadata(data1).md5_checksum ==
read_parquet_metadata(data2).md5_checksum (or assert data1 == data2) so the
determinism of _make_parquet_bytes / PyArrow output is actually verified;
references: test_read_parquet_metadata_md5_is_deterministic,
_make_parquet_bytes, read_parquet_metadata.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
src/core/storage.py (2)

11-11: Depending on private config symbols is fragile.

_config_file and _load_configuration are private symbols of config.py. If the config module ever changes its internals, storage.py breaks without a visible API boundary being crossed. Consider exposing a small public helper (e.g. get_minio_config()) from config.py instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/storage.py` at line 11, storage.py imports private symbols
_config_file and _load_configuration from config.py, which is brittle; add a
public helper in config.py (e.g. get_minio_config()) that returns the
MinIO-related settings and update storage.py to import and use that public
function instead of _config_file/_load_configuration; look for references to
_config_file and _load_configuration in storage.py (and any calls to them) and
replace them with the new get_minio_config() call so storage uses the public
API.

47-66: boto3 client is re-created on every upload call.

Each invocation of upload_to_minio constructs a new S3 client, which re-resolves credentials and allocates a new connection pool. For a low-volume endpoint this is acceptable, but a module-level or lazily-initialised cached client avoids repeated overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/storage.py` around lines 47 - 66, The upload_to_minio function
recreates a boto3 S3 client on every call (see upload_to_minio and
_minio_config), causing repeated credential resolution and connection pool
allocation; fix by creating a module-level cached client (or a lazy-initialised
singleton) that is created once using _minio_config and reused by
upload_to_minio (keep existing error handling and the same client.upload_fileobj
call), and ensure the cached client is refreshed if configuration changes or on
explicit reset.
tests/routers/openml/dataset_upload_test.py (1)

152-172: Add a test for uploading without a default_target_attribute.

The critical bug where an empty default_target_attribute (the default) always triggers a 422 is not caught by any test. A test like the one below would have caught the issue:

def test_upload_no_target_attribute_succeeds(api_client_authenticated: TestClient) -> None:
    """Uploading without default_target_attribute should succeed (no target is valid)."""
    with (
        patch("routers.openml.datasets.upload_to_minio", return_value="key"),
        patch("database.datasets.insert_file", return_value=_EXPECTED_FILE_ID),
        patch("database.datasets.insert_dataset", return_value=_EXPECTED_DATASET_ID),
        patch("database.datasets.update_file_reference"),
        patch("database.datasets.insert_description"),
        patch("database.datasets.insert_features"),
        patch("database.datasets.insert_qualities"),
        patch("database.datasets.update_status"),
    ):
        response = _upload(
            api_client_authenticated,
            file_bytes=_make_parquet_bytes(),
            extra_meta={"default_target_attribute": ""},
        )
    assert response.status_code == HTTPStatus.CREATED
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/dataset_upload_test.py` around lines 152 - 172, Add a
new test that verifies uploading without a default_target_attribute succeeds:
create a test named test_upload_no_target_attribute_succeeds in
tests/routers/openml/dataset_upload_test.py which uses _make_parquet_bytes() and
the same patch context as test_upload_parquet_success (patching
routers.openml.datasets.upload_to_minio,
database.datasets.insert_file/insert_dataset/update_file_reference/insert_description/insert_features/insert_qualities/update_status),
call _upload(api_client_authenticated, file_bytes=_make_parquet_bytes(),
extra_meta={"default_target_attribute": ""}), and assert response.status_code ==
HTTPStatus.CREATED and the returned upload_dataset id equals
_EXPECTED_DATASET_ID so the empty default_target_attribute case is covered.
src/schemas/datasets/upload.py (1)

17-20: default_target_attribute type contradicts the router's is not None guard.

The field is typed as str (never None), but the router checks if target_attribute is not None: (see src/routers/openml/datasets.py, Line 94), which is always True. As a result, any upload that omits default_target_attribute sends "" through the validation, which is not a valid column name and immediately raises a 422. Consider aligning the type with the intended optional semantics:

♻️ Proposed change
-    default_target_attribute: str = Field(
-        default="",
-        description="Comma-separated column name(s) to use as the prediction target.",
-    )
+    default_target_attribute: str | None = Field(
+        default=None,
+        description="Comma-separated column name(s) to use as the prediction target.",
+    )

And update the router guard accordingly:

-    if target_attribute is not None:
+    if target_attribute:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/datasets/upload.py` around lines 17 - 20, Change the schema field
default_target_attribute from a non-optional str to Optional[str] by setting its
type to Optional[str] and default to None (e.g., default_target_attribute:
Optional[str] = Field(default=None, ...)) so omitted values are None instead of
"". Then update the router code that reads target_attribute (the variable
checked as target_attribute) to keep the existing "if target_attribute is not
None:" guard (it will now behave correctly), and ensure any downstream
validation that assumes a string handles None appropriately (e.g., skip
validation or conversion when target_attribute is None).
src/routers/openml/datasets.py (1)

83-83: Unbounded in-memory read — no file-size limit.

file.file.read() buffers the entire upload into memory with no guard. A large or malicious file could exhaust worker memory. Consider enforcing an upper bound (e.g. 500 MB) before reading:

MAX_UPLOAD_BYTES = 500 * 1024 * 1024
file_bytes = file.file.read(MAX_UPLOAD_BYTES + 1)
if len(file_bytes) > MAX_UPLOAD_BYTES:
    raise HTTPException(status_code=HTTPStatus.REQUEST_ENTITY_TOO_LARGE, ...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` at line 83, The current unbounded read using
file.file.read() (producing file_bytes) can exhaust memory; replace it by
reading at most a configured MAX_UPLOAD_BYTES (e.g. 500*1024*1024) — call
file.file.read(MAX_UPLOAD_BYTES + 1), then if len(file_bytes) > MAX_UPLOAD_BYTES
raise an HTTPException with status HTTPStatus.REQUEST_ENTITY_TOO_LARGE and an
explanatory message; add the MAX_UPLOAD_BYTES constant near the top of the
module and use the same symbols (file.file.read, file_bytes, HTTPException,
HTTPStatus.REQUEST_ENTITY_TOO_LARGE) so the handler rejects oversized uploads
instead of buffering them into memory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 135-149: Post-upload DB failures can leave the MinIO object
orphaned; after calling upload_to_minio(...) and getting minio_key you must wrap
the subsequent DB work (database.datasets.update_file_reference,
insert_description/insert_features/insert_qualities, update_status) in an outer
try/except so that if any exception occurs you call the MinIO cleanup (e.g.,
delete_from_minio(minio_key) or equivalent) before re-raising the error; use the
existing expdb_db/expdb_connection transaction semantics (allowing it to
rollback the DB) but ensure the MinIO delete runs in the except block, log any
delete errors, and then raise the appropriate HTTPException to preserve current
behavior.
- Around line 190-195: Replace the hard-coded status string in the update call
with the enum constant: in the call to database.datasets.update_status (the line
passing status="in_preparation"), use DatasetStatus.IN_PREPARATION instead;
ensure DatasetStatus is imported where this file defines dataset handling so the
call becomes status=DatasetStatus.IN_PREPARATION while leaving dataset_id,
user.user_id, and connection=expdb_db unchanged.
- Around line 92-106: The guard checking the default target uses "if
target_attribute is not None" but upload_meta.default_target_attribute is always
a str, so empty string slips through and triggers a false-positive validation;
change the check to skip validation for empty/falsy targets (e.g., use "if
target_attribute" or explicit empty-string check) so that only non-empty
target_attribute values are validated against parquet_column_names
(parquet_meta.columns) before raising the HTTPException in this block.

In `@tests/routers/openml/dataset_upload_test.py`:
- Line 74: The trailing "# noqa: ANN401" on the function return annotation (the
")-> Any:  # noqa: ANN401" line) is unnecessary because ANN401 is not enabled;
remove the "# noqa: ANN401" directive from that function signature (the return
type annotation that currently uses "-> Any") so the line reads simply ")->
Any:" and run the tests/linter to confirm the RUF100 warning is resolved.

---

Duplicate comments:
In `@src/database/datasets.py`:
- Line 218: The function definition for insert_dataset includes an unused noqa
directive "# noqa: PLR0913"; remove this directive from the def line (the
"insert_dataset" function signature) so the linter warning RUF100 is
resolved—simply delete the trailing "# noqa: PLR0913" token in the
insert_dataset declaration and ensure no other noqa comments are present for the
same rule.

In `@src/routers/openml/datasets.py`:
- Around line 56-63: The endpoint raises HTTPException with
HTTPStatus.UNAUTHORIZED when user is None, which is inconsistent with the rest
of the API that uses create_authentication_failed_error() (which returns
HTTPStatus.PRECONDITION_FAILED); update the error handling in the user is None
block to use the same mechanism/status as create_authentication_failed_error()
(or change the status to HTTPStatus.PRECONDITION_FAILED) so authentication
failures are returned uniformly across the API, referencing the user is None
branch and create_authentication_failed_error().

---

Nitpick comments:
In `@src/core/storage.py`:
- Line 11: storage.py imports private symbols _config_file and
_load_configuration from config.py, which is brittle; add a public helper in
config.py (e.g. get_minio_config()) that returns the MinIO-related settings and
update storage.py to import and use that public function instead of
_config_file/_load_configuration; look for references to _config_file and
_load_configuration in storage.py (and any calls to them) and replace them with
the new get_minio_config() call so storage uses the public API.
- Around line 47-66: The upload_to_minio function recreates a boto3 S3 client on
every call (see upload_to_minio and _minio_config), causing repeated credential
resolution and connection pool allocation; fix by creating a module-level cached
client (or a lazy-initialised singleton) that is created once using
_minio_config and reused by upload_to_minio (keep existing error handling and
the same client.upload_fileobj call), and ensure the cached client is refreshed
if configuration changes or on explicit reset.

In `@src/routers/openml/datasets.py`:
- Line 83: The current unbounded read using file.file.read() (producing
file_bytes) can exhaust memory; replace it by reading at most a configured
MAX_UPLOAD_BYTES (e.g. 500*1024*1024) — call file.file.read(MAX_UPLOAD_BYTES +
1), then if len(file_bytes) > MAX_UPLOAD_BYTES raise an HTTPException with
status HTTPStatus.REQUEST_ENTITY_TOO_LARGE and an explanatory message; add the
MAX_UPLOAD_BYTES constant near the top of the module and use the same symbols
(file.file.read, file_bytes, HTTPException, HTTPStatus.REQUEST_ENTITY_TOO_LARGE)
so the handler rejects oversized uploads instead of buffering them into memory.

In `@src/schemas/datasets/upload.py`:
- Around line 17-20: Change the schema field default_target_attribute from a
non-optional str to Optional[str] by setting its type to Optional[str] and
default to None (e.g., default_target_attribute: Optional[str] =
Field(default=None, ...)) so omitted values are None instead of "". Then update
the router code that reads target_attribute (the variable checked as
target_attribute) to keep the existing "if target_attribute is not None:" guard
(it will now behave correctly), and ensure any downstream validation that
assumes a string handles None appropriately (e.g., skip validation or conversion
when target_attribute is None).

In `@tests/routers/openml/dataset_upload_test.py`:
- Around line 152-172: Add a new test that verifies uploading without a
default_target_attribute succeeds: create a test named
test_upload_no_target_attribute_succeeds in
tests/routers/openml/dataset_upload_test.py which uses _make_parquet_bytes() and
the same patch context as test_upload_parquet_success (patching
routers.openml.datasets.upload_to_minio,
database.datasets.insert_file/insert_dataset/update_file_reference/insert_description/insert_features/insert_qualities/update_status),
call _upload(api_client_authenticated, file_bytes=_make_parquet_bytes(),
extra_meta={"default_target_attribute": ""}), and assert response.status_code ==
HTTPStatus.CREATED and the returned upload_dataset id equals
_EXPECTED_DATASET_ID so the empty default_target_attribute case is covered.

Comment on lines +92 to +106
# --- Validate target attribute exists in the Parquet schema ---
target_attribute = upload_meta.default_target_attribute
if target_attribute is not None:
parquet_column_names = {col.name for col in parquet_meta.columns}
if target_attribute not in parquet_column_names:
raise HTTPException(
status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
detail={
"code": "114",
"message": (
f"Default target attribute '{target_attribute}' "
"does not exist in the uploaded dataset columns."
),
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

if target_attribute is not None is always True — empty default always fails validation.

default_target_attribute is typed as str (never None), so this guard never skips the column-presence check. A user who omits default_target_attribute gets target_attribute="", "" not in parquet_column_names is True, and every such upload fails with a misleading 422 error — even though no target was intended.

🐛 Proposed fix
-    if target_attribute is not None:
+    if target_attribute:
         parquet_column_names = {col.name for col in parquet_meta.columns}
         if target_attribute not in parquet_column_names:
             raise HTTPException(
                 status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
                 detail={
                     "code": "114",
                     "message": (
                         f"Default target attribute '{target_attribute}' "
                         "does not exist in the uploaded dataset columns."
                     ),
                 },
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 92 - 106, The guard checking the
default target uses "if target_attribute is not None" but
upload_meta.default_target_attribute is always a str, so empty string slips
through and triggers a false-positive validation; change the check to skip
validation for empty/falsy targets (e.g., use "if target_attribute" or explicit
empty-string check) so that only non-empty target_attribute values are validated
against parquet_column_names (parquet_meta.columns) before raising the
HTTPException in this block.

Comment on lines +135 to +149
try:
minio_key = upload_to_minio(file_bytes, dataset_id)
except RuntimeError as exc:
expdb_db.rollback()
raise HTTPException(
status_code=HTTPStatus.INTERNAL_SERVER_ERROR,
detail={"code": "113", "message": str(exc)},
) from exc

# Persist the real object location now that we have the key
database.datasets.update_file_reference(
file_id=file_id,
reference=minio_key,
connection=expdb_db,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Post-upload DB failures leave the MinIO object orphaned.

The rollback() on Line 138 correctly handles MinIO upload failure (pre-upload inserts are undone). However, if any of the operations after a successful MinIO upload fail (update_file_reference, insert_description, insert_features, insert_qualities, update_status) — e.g., due to a DB constraint violation or connectivity issue — the expdb_connection context manager will auto-rollback the entire DB transaction, leaving a file in MinIO with no DB record pointing to it.

Addressing this fully requires a compensation step (delete the MinIO object) in an outer try/except around the post-upload DB block, or a background cleanup job for orphaned objects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 135 - 149, Post-upload DB
failures can leave the MinIO object orphaned; after calling upload_to_minio(...)
and getting minio_key you must wrap the subsequent DB work
(database.datasets.update_file_reference,
insert_description/insert_features/insert_qualities, update_status) in an outer
try/except so that if any exception occurs you call the MinIO cleanup (e.g.,
delete_from_minio(minio_key) or equivalent) before re-raising the error; use the
existing expdb_db/expdb_connection transaction semantics (allowing it to
rollback the DB) but ensure the MinIO delete runs in the except block, log any
delete errors, and then raise the appropriate HTTPException to preserve current
behavior.

Comment on lines +190 to +195
database.datasets.update_status(
dataset_id=dataset_id,
status="in_preparation",
user_id=user.user_id,
connection=expdb_db,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the DatasetStatus enum constant instead of a bare string literal.

"in_preparation" is already defined as DatasetStatus.IN_PREPARATION; using the enum prevents silent breakage if the value changes.

♻️ Proposed fix
     database.datasets.update_status(
         dataset_id=dataset_id,
-        status="in_preparation",
+        status=DatasetStatus.IN_PREPARATION,
         user_id=user.user_id,
         connection=expdb_db,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 190 - 195, Replace the
hard-coded status string in the update call with the enum constant: in the call
to database.datasets.update_status (the line passing status="in_preparation"),
use DatasetStatus.IN_PREPARATION instead; ensure DatasetStatus is imported where
this file defines dataset handling so the call becomes
status=DatasetStatus.IN_PREPARATION while leaving dataset_id, user.user_id, and
connection=expdb_db unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow for uploading of parquet datasets

1 participant