fix(sync-gbrain): generate gbrain-valid source ids for repos with dots or long names#1330
Closed
radubach wants to merge 1 commit into
Closed
fix(sync-gbrain): generate gbrain-valid source ids for repos with dots or long names#1330radubach wants to merge 1 commit into
radubach wants to merge 1 commit into
Conversation
…s or long names
`deriveCodeSourceId` previously concatenated the canonicalized remote with only `/`
and whitespace stripped, leaving dots from hostnames (`github.com`) and no length
cap. gbrain rejects any source id containing characters outside [a-z0-9-] or longer
than 32 chars, so `github.com/<org>/<repo>` produced `gstack-code-github.com-<org>-<repo>`
(40 chars, plus dots) and registration failed:
code source registration failed: Invalid source id
"gstack-code-github.com-radubach-platform". Must be 1-32 lowercase alnum
chars with optional interior hyphens.
Fix:
- Drop the host segment (`github.com` is the same for nearly every user and just
consumes the 32-char budget). Use only the last two path segments (org-repo).
- Sanitize any remaining non-alnum to hyphens, then collapse and trim.
- For genuinely long org/repo names that still exceed the budget, keep the tail
(most distinctive end of the slug) and append a 6-char sha1 hash for collision
resistance.
Adds a regression test that spawns the CLI in temp git repos with controlled
remotes (dot in hostname, SCP-style, multi-dot host, long names forcing
hash-truncation) and asserts every derived id is ≤32 chars and matches the
gbrain validator regex.
Merged
9 tasks
garrytan
added a commit
that referenced
this pull request
May 7, 2026
…n-valid source ids (#1344) * fix: use correct `gbrain put <slug>` CLI verb in memory ingest `put_page` is the MCP tool name, not a CLI subcommand. The actual gbrain verb is `put <slug>` with content via stdin and tags in YAML frontmatter. Every transcript / memory ingest fails today on clean installs. Switch to the right verb and inject title/type/tags into the frontmatter that buildTranscriptPage / buildArtifactPage already produce. Bundled in the same function: - timeout: 30s → 60s. Auto-link reconciliation hits 30s once the brain has a few hundred pages. - maxBuffer: 1MB → 16MB. Without it Node truncates gbrain's stderr and callers see only `Command failed:` with no detail. - Surface stderr/stdout in the returned error instead of the bare exception. Verified: bun test test/gstack-memory-ingest.test.ts -> 15/15 pass. bun test on the three test files touching this path -> 362/362. * fix(sync-gbrain): generate gbrain-valid source ids for repos with dots or long names `deriveCodeSourceId` previously concatenated the canonicalized remote with only `/` and whitespace stripped, leaving dots from hostnames (`github.com`) and no length cap. gbrain rejects any source id containing characters outside [a-z0-9-] or longer than 32 chars, so `github.com/<org>/<repo>` produced `gstack-code-github.com-<org>-<repo>` (40 chars, plus dots) and registration failed: code source registration failed: Invalid source id "gstack-code-github.com-radubach-platform". Must be 1-32 lowercase alnum chars with optional interior hyphens. Fix: - Drop the host segment (`github.com` is the same for nearly every user and just consumes the 32-char budget). Use only the last two path segments (org-repo). - Sanitize any remaining non-alnum to hyphens, then collapse and trim. - For genuinely long org/repo names that still exceed the budget, keep the tail (most distinctive end of the slug) and append a 6-char sha1 hash for collision resistance. Adds a regression test that spawns the CLI in temp git repos with controlled remotes (dot in hostname, SCP-style, multi-dot host, long names forcing hash-truncation) and asserts every derived id is ≤32 chars and matches the gbrain validator regex. * fix(memory-ingest): hybrid frontmatter writer + tightened gbrain availability probe PR #1328 (merged in the prior commit) correctly injects title/type/tags into the YAML frontmatter that buildTranscriptPage already prepends. But buildArtifactPage emits raw markdown without frontmatter, so design-docs, learnings, and builder-profile-entries were landing in gbrain with empty title/type/tags. Add the no-frontmatter wrap branch so artifact pages get the same metadata the inject branch provides for transcripts. Also bring in gbrainAvailable()'s --help probe (originally proposed in PR #1341 by Alex Medina), with the regex tightened from /(^|\s)put(\s|$)/m to /^\s+put\s/m. Anchoring on the indented subcommand format gbrain's help actually uses keeps the probe from matching "put" appearing as prose in help text, while still failing fast with one clean error if a future gbrain renames or removes the put subcommand. Updates the V1.5 NOTE doc block at the top of the file to describe the current put-via-stdin shape rather than the legacy put_page flag form. Co-Authored-By: Alex Medina <oficina@puntoverdemc.com> * test+fix(memory-ingest): strengthen regression tests, fix inject for malformed-close frontmatter Imports the shim-based regression tests from PR #1341 (Alex Medina) and strengthens them to assert title, type, and tags actually arrive in put stdin — not just `agent: claude-code`. Asserting the metadata fields matches the regression class that's caused this fix wave: writers can "succeed" while metadata is silently lost. The original PR #1341 tests would have passed even with title/type/tags missing. Strengthening the test surfaced a deeper issue. buildTranscriptPage joins frontmatter array elements with "\n" and does not append a trailing newline, so the close fence is "\n---<content>" directly, not "\n---\n". PR #1328's inject branch searched for "\n---\n" and never matched — which means even with PR #1328 alone, transcript pages were landing in gbrain with no title/type/tags. Two-line fix: search for "\n---" only, since the inject lands before the close fence regardless of what follows it. Also imports PR #1341's V1.5 NOTE doc-block update and the section comment refresh so the prose stays accurate against the new writer shape. Co-Authored-By: Alex Medina <oficina@puntoverdemc.com> * fix+test(gbrain-sync): handle empty-slug edge in constrainSourceId, add no-origin and basename-empty regression tests PR #1330 (merged in the prior commit) addressed the dot-in-host and length-overflow cases for source-id derivation, but constrainSourceId silently returned "${prefix}-" when the input sanitized to an empty slug — invalid per gbrain's `^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$` validator on the trailing hyphen. Adds an explicit empty-slug branch that falls back to a sha1-prefixed id ("gstack-code-<6hex>") so the output stays gbrain-valid for every input shape. Two new regression tests cover the corners PR #1330's coverage left exposed: - no-origin fallback: a cwd repo with no `origin` remote configured must still derive a valid id from the basename. - basename-sanitizes-to-empty: a repo whose path basename is all non-alnum (e.g. "___") must produce the hash-only fallback, not an invalid trailing-hyphen id. Both run the CLI inside temp git repos for genuine end-to-end coverage (matches the pattern PR #1330 established for its own four remote-shape cases). Co-Authored-By: Richard Dubach <radubach@gmail.com> * chore: bump VERSION to 1.26.5.0 + CHANGELOG entry for fix wave PATCH bump. Three bug fixes (memory-ingest put_page CLI verb mismatch, hybrid frontmatter writer for transcripts AND artifacts, gbrain-valid source-id derivation for github-hosted repos), no new user capability. CHANGELOG release-summary leads with what users can now do (clean- install transcripts populate the brain, github-hosted repos register code sources) and tabulates before/after numbers from real gbrain v0.25.1 smoke output. Itemized changes credit @smithjoshua, @AZ-1224, and @radubach for the originating PRs plus the additional hybrid branch + strengthened tests added on top per Codex plan-review. * docs(todos): file P2 (gbrain install-pin staleness) + P3 (source-id host-collision) follow-ups Two follow-ups surfaced during the v1.26.5.0 fix-wave plan review. P2 — Issue #1305 part 2: bin/gstack-gbrain-install pins gbrain to v0.18.2 (commit 08b3698) but doesn't move when gstack ships features that depend on newer gbrain ops or schema. Fresh /setup-gbrain on v1.26.x lands users on schema 24 with v1.26 features expecting 32+. Captured for a future fix-wave. P3 — Codex P1.3 from the v1.26.5.0 plan review: deriveCodeSourceId drops the host segment to fit gbrain's 32-char source-id budget, which means github.com/acme/foo and gitlab.com/acme/foo collapse to the same source id. Real but rare; PR #1330 author explicitly considered this and chose budget over cross-host uniqueness. Captured as a long-tail concern. --------- Co-authored-by: Joshua Smith <joshualowellsmith@gmail.com> Co-authored-by: Richard Dubach <radubach@gmail.com> Co-authored-by: Alex Medina <oficina@puntoverdemc.com>
Owner
|
Closing as superseded — gbrain-valid source ID generation shipped in v1.26.5.0 (#1344). Thanks for the contribution! |
gonnabe88
pushed a commit
to gonnabe88/gstack
that referenced
this pull request
May 9, 2026
…n-valid source ids (garrytan#1344) * fix: use correct `gbrain put <slug>` CLI verb in memory ingest `put_page` is the MCP tool name, not a CLI subcommand. The actual gbrain verb is `put <slug>` with content via stdin and tags in YAML frontmatter. Every transcript / memory ingest fails today on clean installs. Switch to the right verb and inject title/type/tags into the frontmatter that buildTranscriptPage / buildArtifactPage already produce. Bundled in the same function: - timeout: 30s → 60s. Auto-link reconciliation hits 30s once the brain has a few hundred pages. - maxBuffer: 1MB → 16MB. Without it Node truncates gbrain's stderr and callers see only `Command failed:` with no detail. - Surface stderr/stdout in the returned error instead of the bare exception. Verified: bun test test/gstack-memory-ingest.test.ts -> 15/15 pass. bun test on the three test files touching this path -> 362/362. * fix(sync-gbrain): generate gbrain-valid source ids for repos with dots or long names `deriveCodeSourceId` previously concatenated the canonicalized remote with only `/` and whitespace stripped, leaving dots from hostnames (`github.com`) and no length cap. gbrain rejects any source id containing characters outside [a-z0-9-] or longer than 32 chars, so `github.com/<org>/<repo>` produced `gstack-code-github.com-<org>-<repo>` (40 chars, plus dots) and registration failed: code source registration failed: Invalid source id "gstack-code-github.com-radubach-platform". Must be 1-32 lowercase alnum chars with optional interior hyphens. Fix: - Drop the host segment (`github.com` is the same for nearly every user and just consumes the 32-char budget). Use only the last two path segments (org-repo). - Sanitize any remaining non-alnum to hyphens, then collapse and trim. - For genuinely long org/repo names that still exceed the budget, keep the tail (most distinctive end of the slug) and append a 6-char sha1 hash for collision resistance. Adds a regression test that spawns the CLI in temp git repos with controlled remotes (dot in hostname, SCP-style, multi-dot host, long names forcing hash-truncation) and asserts every derived id is ≤32 chars and matches the gbrain validator regex. * fix(memory-ingest): hybrid frontmatter writer + tightened gbrain availability probe PR garrytan#1328 (merged in the prior commit) correctly injects title/type/tags into the YAML frontmatter that buildTranscriptPage already prepends. But buildArtifactPage emits raw markdown without frontmatter, so design-docs, learnings, and builder-profile-entries were landing in gbrain with empty title/type/tags. Add the no-frontmatter wrap branch so artifact pages get the same metadata the inject branch provides for transcripts. Also bring in gbrainAvailable()'s --help probe (originally proposed in PR garrytan#1341 by Alex Medina), with the regex tightened from /(^|\s)put(\s|$)/m to /^\s+put\s/m. Anchoring on the indented subcommand format gbrain's help actually uses keeps the probe from matching "put" appearing as prose in help text, while still failing fast with one clean error if a future gbrain renames or removes the put subcommand. Updates the V1.5 NOTE doc block at the top of the file to describe the current put-via-stdin shape rather than the legacy put_page flag form. Co-Authored-By: Alex Medina <oficina@puntoverdemc.com> * test+fix(memory-ingest): strengthen regression tests, fix inject for malformed-close frontmatter Imports the shim-based regression tests from PR garrytan#1341 (Alex Medina) and strengthens them to assert title, type, and tags actually arrive in put stdin — not just `agent: claude-code`. Asserting the metadata fields matches the regression class that's caused this fix wave: writers can "succeed" while metadata is silently lost. The original PR garrytan#1341 tests would have passed even with title/type/tags missing. Strengthening the test surfaced a deeper issue. buildTranscriptPage joins frontmatter array elements with "\n" and does not append a trailing newline, so the close fence is "\n---<content>" directly, not "\n---\n". PR garrytan#1328's inject branch searched for "\n---\n" and never matched — which means even with PR garrytan#1328 alone, transcript pages were landing in gbrain with no title/type/tags. Two-line fix: search for "\n---" only, since the inject lands before the close fence regardless of what follows it. Also imports PR garrytan#1341's V1.5 NOTE doc-block update and the section comment refresh so the prose stays accurate against the new writer shape. Co-Authored-By: Alex Medina <oficina@puntoverdemc.com> * fix+test(gbrain-sync): handle empty-slug edge in constrainSourceId, add no-origin and basename-empty regression tests PR garrytan#1330 (merged in the prior commit) addressed the dot-in-host and length-overflow cases for source-id derivation, but constrainSourceId silently returned "${prefix}-" when the input sanitized to an empty slug — invalid per gbrain's `^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$` validator on the trailing hyphen. Adds an explicit empty-slug branch that falls back to a sha1-prefixed id ("gstack-code-<6hex>") so the output stays gbrain-valid for every input shape. Two new regression tests cover the corners PR garrytan#1330's coverage left exposed: - no-origin fallback: a cwd repo with no `origin` remote configured must still derive a valid id from the basename. - basename-sanitizes-to-empty: a repo whose path basename is all non-alnum (e.g. "___") must produce the hash-only fallback, not an invalid trailing-hyphen id. Both run the CLI inside temp git repos for genuine end-to-end coverage (matches the pattern PR garrytan#1330 established for its own four remote-shape cases). Co-Authored-By: Richard Dubach <radubach@gmail.com> * chore: bump VERSION to 1.26.5.0 + CHANGELOG entry for fix wave PATCH bump. Three bug fixes (memory-ingest put_page CLI verb mismatch, hybrid frontmatter writer for transcripts AND artifacts, gbrain-valid source-id derivation for github-hosted repos), no new user capability. CHANGELOG release-summary leads with what users can now do (clean- install transcripts populate the brain, github-hosted repos register code sources) and tabulates before/after numbers from real gbrain v0.25.1 smoke output. Itemized changes credit @smithjoshua, @AZ-1224, and @radubach for the originating PRs plus the additional hybrid branch + strengthened tests added on top per Codex plan-review. * docs(todos): file P2 (gbrain install-pin staleness) + P3 (source-id host-collision) follow-ups Two follow-ups surfaced during the v1.26.5.0 fix-wave plan review. P2 — Issue garrytan#1305 part 2: bin/gstack-gbrain-install pins gbrain to v0.18.2 (commit 08b3698) but doesn't move when gstack ships features that depend on newer gbrain ops or schema. Fresh /setup-gbrain on v1.26.x lands users on schema 24 with v1.26 features expecting 32+. Captured for a future fix-wave. P3 — Codex P1.3 from the v1.26.5.0 plan review: deriveCodeSourceId drops the host segment to fit gbrain's 32-char source-id budget, which means github.com/acme/foo and gitlab.com/acme/foo collapse to the same source id. Real but rare; PR garrytan#1330 author explicitly considered this and chose budget over cross-host uniqueness. Captured as a long-tail concern. --------- Co-authored-by: Joshua Smith <joshualowellsmith@gmail.com> Co-authored-by: Richard Dubach <radubach@gmail.com> Co-authored-by: Alex Medina <oficina@puntoverdemc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/sync-gbrainfailed for me ongithub.com/radubach/platformwith:deriveCodeSourceIdinbin/gstack-gbrain-sync.ts:170builds the id by takingcanonicalizeRemote(originUrl())and replacing only/and whitespace with-. That leavesdots from hostnames (
github.com) intact and applies no length cap, so anygithub.com/<org>/<repo>produces
gstack-code-github.com-<org>-<repo>(40 chars in my case) — both invalid.Fix
github.com-is the same for nearly every user and just consumes the 32-char budget. Use only the last two path segments (org-repo).Resulting ids for the cases I checked:
github.com/radubach/platformgstack-code-github.com-radubach-platform(40, has.)gstack-code-radubach-platform(29)github.com/garrytan/gstackgstack-code-github.com-garrytan-gstack(38, has.)gstack-code-garrytan-gstack(27)github.com/some-very-long-org-name/some-very-long-repo-name.gstack-code-ong-repo-name-4cead7(32)Verified by re-running
/sync-gbrain --fullonradubach/platform: code stage now registers and indexes (110 pages, 1396 chunks).Test Coverage
test/gstack-gbrain-sync.test.ts(derived source ids are gbrain-valid…) spawns the CLI inside temp git repos with four controlled remote shapes (dot in hostname, SCP-style, multi-dot host, long org/repo forcing hash-truncate). Asserts every derived id is ≤32 chars, matches the gbrain validator regex^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$, and starts withgstack-code-.Expected: <= 32, Received: 40on the first case.dry-run derives a stable source id from the canonical git remotetest still passes (same prefix/regex contract — its expected slug for the gstack repo just changed fromgstack-code-github-com-garrytan-gstacktogstack-code-garrytan-gstack, both match the regex).bun test test/gstack-gbrain-sync.test.ts: 14/14 pass (was 13).Notes
github-com-prefix, may add hash suffix for long names). Users whose previous registration was already failing — i.e., everyone with a hostname-dot or org/repo over the budget — get a working sync. Users on shorter names get a slightly shorter, equally stable id.gstack-code-<basename>branch, which is unchanged.🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.