Skip to content

feat(schema): ClickHouse index algorithms and engine SETTINGS#6

Merged
abnegate merged 17 commits intomainfrom
feat-clickhouse-skip-index-and-settings
Apr 30, 2026
Merged

feat(schema): ClickHouse index algorithms and engine SETTINGS#6
abnegate merged 17 commits intomainfrom
feat-clickhouse-skip-index-and-settings

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

@lohanidamodar lohanidamodar commented Apr 28, 2026

Summary

Adds two ClickHouse-only DDL features to the Schema builder so real ClickHouse-backed services can express the table layout they need:

  1. Index algorithm selection. Every ClickHouse INDEX is parameterised by a TYPE (minmax / set / bloom_filter / ngrambf_v1 / tokenbf_v1 / inverted) and a GRANULARITY. The schema previously hard-coded TYPE minmax GRANULARITY 3 for every index() call, which is the wrong choice for any high-cardinality column. Table::index() now accepts an optional algorithm, algorithmArgs, and granularity so callers can pick the correct algorithm per column.

  2. Engine SETTINGS clause. CREATE TABLE … SETTINGS k=v, … was unreachable. Real ClickHouse usage needs index_granularity, allow_nullable_key, etc. — Table::settings() now emits this after TTL in the rendered DDL.

The pattern follows engine() / ttl() precisely: ClickHouse-only DDL features live on Table and are silently ignored by the other dialect compilers.

Why

Both utopia-php/audit and utopia-php/usage currently hand-roll their ClickHouse setup() routines, emitting INDEX … TYPE bloom_filter GRANULARITY 1 and SETTINGS index_granularity = 8192, allow_nullable_key = 1 directly. Without these features the schema builder cannot express what those services already produce, so they cannot delegate setup() here.

API

use Utopia\Query\Schema\ClickHouse as Schema;
use Utopia\Query\Schema\ClickHouse\IndexAlgorithm;
use Utopia\Query\Schema\Table;

(new Schema())->create('events', function (Table $table) {
    $table->bigInteger('id')->primary();
    $table->string('user_id');
    $table->string('country');
    $table->string('text');

    // No algorithm specified → default `TYPE minmax GRANULARITY 3`
    $table->index(['id']);

    // BloomFilter — high-cardinality strings with `=` / `IN`
    $table->index(['user_id'], algorithm: IndexAlgorithm::BloomFilter);

    // Set(N) — small fixed value sets, custom granularity
    $table->index(['country'], algorithm: IndexAlgorithm::Set, algorithmArgs: [100], granularity: 4);

    // NgramBloomFilter(n, size_bytes, hashes, seed) — text search
    $table->index(['text'], algorithm: IndexAlgorithm::NgramBloomFilter, algorithmArgs: [4, 1024, 3, 0]);

    $table->settings([
        'index_granularity' => 8192,
        'allow_nullable_key' => true, // booleans become 1 / 0
    ]);
});

// CREATE TABLE `events` (
//   `id` Int64, `user_id` String, `country` String, `text` String,
//   INDEX `idx_id` `id` TYPE minmax GRANULARITY 3,
//   INDEX `idx_user_id` `user_id` TYPE bloom_filter GRANULARITY 1,
//   INDEX `idx_country` `country` TYPE set(100) GRANULARITY 4,
//   INDEX `idx_text` `text` TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1
// )
//   ENGINE = MergeTree() ORDER BY (`id`)
//   SETTINGS index_granularity = 8192, allow_nullable_key = 1

Indexes can also be added in alter() callbacks — they render as ALTER TABLE … ADD INDEX …. MinMax and Inverted reject algorithmArgs (no parenthesised args in DDL); float algorithmArgs and settings values use sprintf('%F', ...) so 1.0e-5 becomes 0.00001 instead of the rejected 1.0E-5.

Files

  • src/Query/Schema/ClickHouse/IndexAlgorithm.php — new enum (6 algorithms).
  • src/Query/Schema/Index.php — adds optional algorithm, algorithmArgs, granularity fields. Validates the no-args-on-MinMax/Inverted rule, granularity ≥ 1, and at-least-one-column. The strict identifier regex on name only fires when algorithm is set, since other dialects quote index names (so hyphens / dots are valid there).
  • src/Query/Schema/Table.phpindex() and uniqueIndex() accept the new fields. Auto-generated index names sanitise non-identifier characters in column names, so event-type produces idx_event_type. Adds settings() with a conservative key/value allow-list and float-safe formatting.
  • src/Query/Schema/ClickHouse.php — single compileSkipIndex(Index) helper used by both create() and alter(). SETTINGS rendered after TTL. alter() rejects SETTINGS (ClickHouse doesn't accept it in ALTER TABLE; callers must emit ALTER TABLE … MODIFY SETTING directly).

Test plan

  • composer lint passes
  • composer check (PHPStan max) passes
  • 16 new ClickHouse schema tests covering each algorithm shape, composite columns, alter ADD INDEX, all validation paths, float formatting, the algorithm-scoped name regex, and SETTINGS edge cases
  • Full Query test suite: 5086/5086 pass

🤖 Generated with Claude Code

The ClickHouse schema previously hard-coded `TYPE minmax GRANULARITY 3`
for every index and had no way to emit a `SETTINGS k=v` clause. Both are
needed by real ClickHouse-backed services that pick algorithms per
column and tune engine settings (e.g. `index_granularity`,
`allow_nullable_key`).

This change adds:

- `Schema\ClickHouse\SkipIndexAlgorithm` enum — `MinMax`, `Set`,
  `BloomFilter`, `NgramBloomFilter`, `TokenBloomFilter`, `Inverted`.
- `Schema\ClickHouse\SkipIndex` value object carrying the algorithm,
  granularity, and algorithm-specific args (e.g. `[100]` for `set(100)`,
  `[4, 1024, 3, 0]` for `ngrambf_v1(...)`).
- `Table::dataSkippingIndex()` — attach a skip index to the blueprint.
  Default granularity 1, default no algorithm args.
- `Table::settings()` — set table-level engine SETTINGS, with a
  conservative key/value allow-list to keep DDL safe.
- `Schema\ClickHouse::create()` renders both. SETTINGS is emitted after
  TTL. The existing minmax compile path is unchanged for backward
  compatibility — callers using `index()` keep their current output.

Other dialects ignore the new fields, matching how `engine()` and `ttl()`
are already handled.

Tests cover each skip algorithm shape, composite-column skip indexes,
SETTINGS rendering with and without TTL, and validation of bad
granularity, empty columns, bad setting names, and unsafe string values.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

📊 Coverage

Metric PR Baseline Δ
Lines 91.89% (7046/7668) 91.92% -0.03%
Methods 84.70% (1041/1229) 84.80% -0.10%
Classes 62.50% (105/168) 62.87% -0.37%

Full per-file breakdown in the job summary.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR adds two ClickHouse-only DDL features to the schema builder: per-index algorithm selection (minmax / set / bloom_filter / ngrambf_v1 / tokenbf_v1 / inverted with optional args and granularity) and a SETTINGS k=v, … clause rendered after TTL. All previously flagged issues from earlier review rounds have been addressed in this revision — the name-regex is correctly gated on algorithm !== null, no-arg algorithms reject algorithmArgs, float formatting uses sprintf('%F') throughout, alter() guards against non-skip index types and rejects SETTINGS, and index() retains its static return type for fluent chaining.

Confidence Score: 5/5

Safe to merge — no P0 or P1 issues found; all previously flagged defects have been resolved.

Every concern from prior review rounds is addressed in the current diff. The single remaining finding is a minor README/code discrepancy on the string-value regex quantifier (* vs +), which is P2-only and does not affect runtime behaviour.

No files require special attention.

Important Files Changed

Filename Overview
src/Query/Schema/ClickHouse.php compileSkipIndex() consolidates CREATE and ALTER paths; null-algorithm falls back to GRANULARITY 3; float args use sprintf('%F'); ALTER correctly rejects non-skip index types and SETTINGS.
src/Query/Schema/Table.php index() extended with algorithm/algorithmArgs/granularity and kept returning static; autoIndexName() sanitises column names; settings() validates keys/values and stores pre-formatted strings.
src/Query/Schema/Index.php Adds optional algorithm/algorithmArgs/granularity fields with correct validation: name regex gated on algorithm!=null, no-arg algorithms reject algorithmArgs, granularity >= 1 enforced.
src/Query/Schema/ClickHouse/IndexAlgorithm.php New enum with 6 ClickHouse skip-index algorithm values; straightforward and correct.
src/Query/Schema/Column.php Forwards new index() parameters and adds settings() proxy to Table; consistent with existing delegation pattern.
tests/Query/Schema/ClickHouseTest.php 16 new tests covering all algorithms, composite columns, ALTER ADD INDEX, validation paths, float formatting, name-regex scoping, and SETTINGS edge cases.
README.md Documents new API accurately; minor discrepancy: string-value pattern shown as * (allows empty) but code enforces + (requires non-empty).

Reviews (14): Last reviewed commit: "fix(schema): restore missing brace in se..." | Re-trigger Greptile

Comment thread src/Query/Schema/ClickHouse/SkipIndex.php Outdated
Comment thread src/Query/Schema/Table.php Outdated
Comment thread src/Query/Schema/ClickHouse.php Outdated
Four follow-ups based on the greptile review on PR #6:

- alter() now renders ADD INDEX for any skipIndexes set on the blueprint,
  matching ClickHouse's `ALTER TABLE ... ADD INDEX name expr TYPE algo
  GRANULARITY n` syntax. Previously dataSkippingIndex() inside an alter()
  callback was silently dropped (P1).

- alter() throws UnsupportedException if the blueprint has settings —
  ClickHouse does not accept SETTINGS in ALTER TABLE; callers must emit
  `ALTER TABLE ... MODIFY SETTING` directly. Surfaces what was previously
  a silent no-op.

- SkipIndex constructor rejects algorithmArgs for MinMax and Inverted
  (P2) — both are emitted without parentheses in ClickHouse DDL, so any
  args would have produced invalid SQL like `minmax(1)`.

- Table::dataSkippingIndex() now sanitises auto-generated index names
  when columns contain non-identifier characters (P2) — `event-type` no
  longer produces a confusing `Invalid skip index name: skip_event-type`
  exception. Non-identifier characters are collapsed to `_`.

- compileSkipAlgorithm() formats float args with sprintf('%F', ...)
  instead of (string) cast (P2) — the cast can produce scientific
  notation like `1.0E-5`, which ClickHouse rejects in index type
  arguments. Trailing zeros are trimmed for readability so 0.01 stays
  "0.01" rather than "0.010000".

Adds tests for each fix plus one for ALTER ADD INDEX with composite
columns and algorithm args.
The ClickHouse parts of this library are unreleased and have no external
consumers, so there is no need for a parallel `dataSkippingIndex()` /
`SkipIndex` API alongside the existing `index()` / `Index` API. Folding
them in produces a single consistent index API that works the same in
every dialect, with ClickHouse-specific parameters available as optional
named arguments (mirroring how `method`, `operatorClass`, `lengths`, and
`collations` are already dialect-specific options on `index()`).

Changes:

- `Table::dataSkippingIndex()` — removed.
- `Schema\ClickHouse\SkipIndex` value object — removed (folded into
  `Index`).
- `Schema\ClickHouse\SkipIndexAlgorithm` enum — kept; useful typed value.
- `Index` gains `algorithm`, `algorithmArgs`, and `granularity` optional
  fields. Validates name (must be a SQL identifier), at-least-one-column,
  granularity ≥ 1, and that no-arg algorithms (MinMax, Inverted) get no
  `algorithmArgs`.
- `Table::index()` and `uniqueIndex()` accept the new fields. Auto-name
  generation now sanitises non-identifier characters in column names so
  `event-type` produces `idx_event_type` instead of failing the
  identifier regex.
- `Schema\ClickHouse::create()` and `alter()` use a single
  `compileSkipIndex(Index)` helper that renders
  `INDEX <name> <expr> TYPE <algo>[(args)] GRANULARITY <n>`. Defaults
  to `TYPE minmax GRANULARITY 3` when no algorithm is set, matching the
  ClickHouse-canonical default for a generic `INDEX` without a chosen
  type.
- Float `algorithmArgs` are formatted with `sprintf('%F', ...)` and
  trimmed of trailing zeros, so `1.0e-5` becomes `0.00001` rather than
  `1.0E-5` (which ClickHouse rejects in index type arguments).

5083/5083 tests pass; lint and PHPStan max clean.
Comment thread src/Query/Schema/Table.php Outdated
Comment thread src/Query/Schema/Index.php Outdated
After collapsing `dataSkippingIndex()` into `index()` and `SkipIndex`
into `Index`, the `Skip` prefix on the algorithm enum is the last
holdout from the parallel-API design and is now inconsistent with the
new naming. The enum already lives in `Schema\ClickHouse\` so the
namespace says "ClickHouse"; the type name only needs to say "this is
the algorithm a ClickHouse index runs".

`SkipIndexAlgorithm` -> `IndexAlgorithm`. No behavioural change.
Two P1 follow-ups from the latest greptile review:

- The strict name regex on `Index` (`[A-Za-z_][A-Za-z0-9_]*`) was a
  cross-dialect breaking change: MySQL, MariaDB, PostgreSQL, and SQLite
  all quote index names, so hyphenated and dot-qualified names were
  valid before this PR. The strict identifier check now only fires when
  an `IndexAlgorithm` is set — i.e. for ClickHouse data-skipping indexes
  where the name appears unquoted in `INDEX <name> ... TYPE ...`.

- `Table::settings()` cast floats with `(string) $value`, which produces
  scientific notation for small values (e.g. `(string) 1.0e-5 ===
  "1.0E-5"`). ClickHouse rejects E-notation in SETTINGS values. Apply
  the same `sprintf('%F', ...)` + trim treatment that `compileSkipIndex`
  already uses for algorithm args.

Adds tests for both: hyphenated index names work without an algorithm
and throw with one; SETTINGS values formatted from `1.0e-5` come out as
`0.00001`.
@abnegate
Copy link
Copy Markdown
Member

@copilot Rebase

Co-authored-by: abnegate <5857008+abnegate@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot Rebase

Done — merged main into the branch in 34b2353. All merge conflicts resolved (the $blueprint$table rename from main, the fluent builder methods, and the updated index() signature).

Copilot AI requested a review from abnegate April 30, 2026 03:07
Comment thread src/Query/Schema/Table.php Outdated
@abnegate
Copy link
Copy Markdown
Member

@copilot address the greptile comment

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot address the greptile comment

Fixed in a3fcc2b — restored ): static { return type and return $this; on Table::index() so it matches uniqueIndex(), fulltextIndex(), and spatialIndex() and doesn't break fluent chaining.

Comment thread src/Query/Schema/ClickHouse.php
@abnegate
Copy link
Copy Markdown
Member

@copilot address the greptile comment

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot address the greptile comment

Fixed in b1df44e — both the compileAlter and compileCreate index loops now guard against non-IndexType::Index entries, throwing UnsupportedException instead of silently rendering wrong DDL for uniqueIndex(), fulltextIndex(), or spatialIndex() calls.

Comment thread src/Query/Schema/Table.php Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread src/Query/Schema/Table.php Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread src/Query/Schema/ClickHouse.php
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@abnegate
Copy link
Copy Markdown
Member

@copilot Fix the readme to use builder style for new additions instead of callback

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot Fix the readme to use builder style for new additions instead of callback

Updated in d60af4c — both the skip-index and SETTINGS code blocks now use $schema->table('events')->...->create() builder style, consistent with the engine and TTL examples above them.

Comment thread src/Query/Schema/ClickHouse.php
@abnegate
Copy link
Copy Markdown
Member

@copilot fix unit test and PHPstan checks

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot fix unit test and PHPstan checks

Fixed in 16ddd87 — all skip-index and SETTINGS tests were using a non-existent callback API ($schema->create('events', function (Table $table) {...})). Rewrote them to use the fluent builder style ($schema->table('events')->...->create()), which resolves both the PHPStan errors (undefined method, unknown class) and the unit test failures.

@lohanidamodar lohanidamodar changed the title feat(schema): ClickHouse data-skipping indexes and engine SETTINGS feat(schema): ClickHouse index algorithms and engine SETTINGS Apr 30, 2026
Tests in the fluent builder style call ->index(algorithm: ...) and
->settings([...]) on a Column returned by a column-creation method, but
Column lacked forwarders for the new ClickHouse skip-index params and
the table-level settings(). Add them so the chain compiles and runs.

Also hardcode GRANULARITY 3 for the no-algorithm fallback in
ClickHouse::compileSkipIndex (matches its docblock and existing test
expectations), and rewrite testTableSettingsWithTtlOrdering to call
ttl() on the Table directly — Column::ttl() is column-level and would
emit an inline TTL on the column, not the table-level TTL the test
asserts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Query/Schema/Table.php Outdated
abnegate and others added 3 commits April 30, 2026 18:40
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
The previous "hardcoded 3" fallback in compileSkipIndex discarded any
explicit granularity passed for a no-algorithm index. Make the field
nullable end-to-end (Index, Table::index, Column::index) and resolve
the dialect-specific default at compile time: 3 when no algorithm, 1
when an algorithm is set. Explicit values are now always honoured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@abnegate abnegate merged commit 34048c2 into main Apr 30, 2026
7 checks passed
@lohanidamodar lohanidamodar deleted the feat-clickhouse-skip-index-and-settings branch April 30, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants