Skip to content

feat: use text index to power filters and autocomplete#2376

Open
knudtty wants to merge 8 commits into
mainfrom
aaron/fts-index-as-aggregate-view
Open

feat: use text index to power filters and autocomplete#2376
knudtty wants to merge 8 commits into
mainfrom
aaron/fts-index-as-aggregate-view

Conversation

@knudtty
Copy link
Copy Markdown
Contributor

@knudtty knudtty commented May 29, 2026

Summary

The new text indexes can power filters and autocomplete and ease the metadataMVs. So let's do it!

References

  • Linear Issue: Closes HDX-4229

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 29, 2026

⚠️ No Changeset found

Latest commit: 5dc9015

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 29, 2026 11:05pm
hyperdx-storybook Ready Ready Preview, Comment May 29, 2026 11:05pm

Request Review

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1143 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 4
  • Production lines changed: 1143 (+ 880 in test files, excluded from tier calculation)
  • Branch: aaron/fts-index-as-aggregate-view
  • Author: knudtty

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

E2E Test Results

All tests passed • 188 passed • 3 skipped • 1236s

Status Count
✅ Passed 188
❌ Failed 0
⚠️ Flaky 7
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Deep Review

Scope: packages/common-utils/src/core/{clickhouseVersion,kvItems,metadata}.ts, packages/common-utils/src/queryParser.ts, plus tests. Adds mergeTreeTextIndex (CH ≥ 26.3) fast paths to getMapKeys / getMapValues / getAllKeyValues / getAllFieldsAndValues, extracts KV-items parsing into core/kvItems.ts, and broadens getMapValues to accept a keys: string[] batch returning Map<string, string[]>.

🔴 P0/P1 -- must fix

  • packages/common-utils/src/core/metadata.ts:449 -- getMapKeys now reads dateRange directly via partsOverlapFilter in the new text-index branches, but its cacheKey only encodes dateRange when timestampValueExpression is also set (dateRangeCacheSuffix = '' otherwise) or via alignedDateRange when metadataMVs is set; the internal getAllFieldsAndValues caller at metadata.ts:1884 passes dateRange with neither, so a second call with a different dateRange returns the first call's keys from the per-instance MetadataCache.
    • Fix: Include dateRange[0].getTime()dateRange[1].getTime() in cacheKey whenever dateRange is defined (or align it through partsOverlapFilter's implicit bucketing) so the cache slot can't return keys computed for a different range.

🟡 P2 -- recommended

  • packages/common-utils/src/core/metadata.ts:773 -- The rewrite of getMapValues drops the previous cache.getOrFetch(cacheKey, …) wrapper entirely, so neither successful results nor in-flight requests are deduplicated; repeated lookups for the same (column, key, dateRange) now hit ClickHouse every time, and the main-table fallback fans out Promise.all over keys.length separate SELECT DISTINCT scans per call.
    • Fix: Reintroduce per-(column, keys, dateRange) caching on non-empty results (mirroring the new getMapKeys pattern of cache.set only when length > 0) so empty index hits don't poison the cache but real results are still memoized.

🔵 P3 nitpicks (4)

🔵 P3 nitpicks (4)
  • packages/common-utils/src/core/metadata.ts:410 -- The comment claims partsOverlapFilter "Fails loudly via throwIf() if any active part has min_time at the epoch sentinel," but the implementation contains no throwIf and silently filters out epoch-min_time/max_time parts, which would skip a non-time-partitioned table without any signal to the caller.

    • Fix: Either add the promised throwIf(min_time = toDateTime(0), …) guard or rewrite the comment to describe the actual silent-fallthrough behavior.
  • packages/common-utils/src/core/metadata.ts:1880 -- The text-index outer cache key uses raw dateRange[0]/[1], while the MV cache key two blocks down uses alignedDateRange; identical visits whose range shifts by less than one MV bucket reuse the MV slot but allocate a fresh text-index slot.

    • Fix: Compute alignedDateRange once and use it consistently in both cacheKey strings so the bucketing is uniform.
  • packages/common-utils/src/core/metadata.ts:385 -- The expr.match(/^mapKeys\((.+)\)$/) capture is greedy and unanchored to a single identifier, so an index expression like mapKeys(toLowerCase(LogAttributes)) would patch a keysIndex keyed by the literal string toLowerCase(LogAttributes) — a column that doesn't exist — instead of falling through to the alias-lookup branch.

    • Fix: Tighten the regex to a bare identifier (/^mapKeys\(([A-Za-z_][\w]*)\)$/) or look the captured token up in columns before patching.
  • packages/common-utils/src/queryParser.ts:974 -- The for (const strategy of KV_ITEMS_STRATEGIES) loop in CustomSchemaSQLSerializerV2 now duplicates the body of parseKvItemsDefaultExpression exported from the new core/kvItems.ts.

    • Fix: Replace the local loop with a call to parseKvItemsDefaultExpression(candidate.default_expression) so both call sites stay in sync as new strategies are added.

Reviewers (5): correctness, maintainability, performance, testing, project-standards.

Testing gaps:

  • No test exercises getMapKeys being called twice with different dateRange values and no timestampValueExpression to lock in the cache-key contract.
  • No test covers partsOverlapFilter against a table whose system.parts rows carry epoch min_time/max_time (non-time-partitioned case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant