Skip to content

Use JS-compatible Unicode casing for intrinsic string mappings#3495

Merged
jakebailey merged 16 commits into
microsoft:mainfrom
Andarist:fix/intrinsic-lowercase-uppercase
Jun 10, 2026
Merged

Use JS-compatible Unicode casing for intrinsic string mappings#3495
jakebailey merged 16 commits into
microsoft:mainfrom
Andarist:fix/intrinsic-lowercase-uppercase

Conversation

@Andarist

@Andarist Andarist commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Fixes #3489

fixes #3489

@jakebailey jakebailey left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very uneasy about this. Why can't we use Go's unicode ranges to do this? How can we make sure this is up to date long term?

Comment thread internal/stringutil/generate.go Outdated
Comment thread internal/stringutil/js_case.go Outdated
@RyanCavanaugh RyanCavanaugh added this to the TypeScript 7.0 RC milestone May 7, 2026

@RyanCavanaugh RyanCavanaugh left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Jake's comments

Copilot AI review requested due to automatic review settings May 18, 2026 09:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intrinsic string mapping types (Lowercase<>, Uppercase<>, Capitalize<>, Uncapitalize<>) to use ECMAScript-compatible Unicode casing semantics (not Go’s strings.ToLower/ToUpper), addressing special-casing cases like dotted İ, ß expansions, ligatures, and Greek final sigma.

Changes:

  • Implemented JS-compatible ToLowerJS/ToUpperJS in internal/stringutil, backed by generated Unicode SpecialCasing + DerivedCoreProperties data (for Final_Sigma context handling).
  • Switched the checker’s intrinsic string mapping implementation to use the new JS-compatible casing functions.
  • Added compiler baseline coverage for special casing behavior + a Go unit test for the casing helpers; refactored a shared rune-range lookup helper into stringutil.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
testdata/tests/cases/compiler/stringMappingSpecialCasing.ts New compiler test covering special-casing string mappings (İ, Σ final sigma, ß, fi).
testdata/baselines/reference/compiler/stringMappingSpecialCasing.types Reference baseline for types output of the new test.
testdata/baselines/reference/compiler/stringMappingSpecialCasing.symbols Reference baseline for symbols output of the new test.
internal/stringutil/ranges.go Introduces reusable rune-range binary search helper.
internal/stringutil/js_case.go Adds JS-compatible case conversion with SpecialCasing + Final_Sigma context handling.
internal/stringutil/js_case_test.go Unit tests for JS casing behavior (dotted i, final sigma, ß/ligatures).
internal/stringutil/js_case_generated.go Generated Unicode special-casing mappings and property ranges used by JS casing.
internal/stringutil/generate.go Adds go:generate directives to regenerate and format the Unicode casing tables.
internal/stringutil/_scripts/generate-special-casing.mts Script to fetch/parse Unicode data and generate Go tables.
internal/scanner/scanner.go Reuses stringutil.IsInRuneRanges instead of a scanner-local range helper.
internal/checker/checker.go Uses JS-compatible casing for intrinsic string mapping types.
Files not reviewed (1)
  • internal/stringutil/js_case_generated.go: Language not supported

Comment thread internal/stringutil/_scripts/generate-special-casing.mts Outdated
@Andarist

Copy link
Copy Markdown
Contributor Author

Why can't we use Go's unicode ranges to do this?

From what I understand, that's not possible - I added the code comment explaining this in my last commit

How can we make sure this is up to date long term?

For that reason, I implemented a commitable code generator here and not hand-rolled everything. This could be used to verify that the output file is up to date

@Andarist Andarist requested a review from jakebailey May 18, 2026 09:56
@jakebailey

Copy link
Copy Markdown
Member

See also #3930 which uses x/text for this, though I'm wary of that too...

@Andarist Andarist force-pushed the fix/intrinsic-lowercase-uppercase branch from 8f3cf1c to 9f9ed49 Compare May 19, 2026 10:17
@Andarist

Copy link
Copy Markdown
Contributor Author

I won't lie, I used some heavy help from the agents now... I decided to write an audit script (I have not committed it yet):

internal/stringutil/_scripts/audit-js-casing.mts
#!/usr/bin/env -S node --experimental-strip-types --no-warnings

import {
    execFileSync,
    spawnSync,
} from "node:child_process";
import * as fs from "node:fs";
import * as os from "node:os";
import * as path from "node:path";

const repoRoot = path.resolve(import.meta.dirname, "../../..");
const helperSource = path.join(import.meta.dirname, "compare-js-casing.go");
const helperDir = fs.mkdtempSync(path.join(os.tmpdir(), "tsgo-js-casing-"));
const helperBin = path.join(helperDir, "compare-js-casing");

process.on("exit", () => {
    fs.rmSync(helperDir, { recursive: true, force: true });
});

execFileSync("go", ["build", "-o", helperBin, helperSource], {
    cwd: repoRoot,
    stdio: "inherit",
});

function firstCodePointSlice(s: string): [string, string] {
    if (!s) return ["", ""];
    const cp = s.codePointAt(0)!;
    const first = String.fromCodePoint(cp);
    const size = cp > 0xffff ? 2 : 1;
    return [first, s.slice(size)];
}

function jsCases(s: string) {
    const [first, rest] = firstCodePointSlice(s);
    return {
        upper: s.toUpperCase(),
        lower: s.toLowerCase(),
        cap: first ? first.toUpperCase() + rest : s,
        uncap: first ? first.toLowerCase() + rest : s,
    };
}

type GoResult = {
    upper: string;
    lower: string;
    cap: string;
    uncap: string;
};

function runBatch(inputs: string[]): GoResult[] {
    const proc = spawnSync(helperBin, {
        input: JSON.stringify(inputs.map(value => ({ value }))),
        encoding: "utf8",
        maxBuffer: 64 * 1024 * 1024,
    });
    if (proc.status !== 0) {
        throw new Error(proc.stderr || `helper exited with ${proc.status}`);
    }
    return JSON.parse(proc.stdout);
}

function escapeVisible(s: string): string {
    return JSON.stringify(s).slice(1, -1);
}

type Bucket = {
    count: number;
    samples: { input: string; expected: string; actual: string; meta?: string; }[];
};

function makeBucket(): Bucket {
    return { count: 0, samples: [] };
}

function noteMismatch(bucket: Bucket, input: string, expected: string, actual: string, meta?: string) {
    bucket.count++;
    if (bucket.samples.length < 20) {
        bucket.samples.push({
            input: escapeVisible(input),
            expected: escapeVisible(expected),
            actual: escapeVisible(actual),
            ...(meta ? { meta } : {}),
        });
    }
}

function makeResultStore() {
    return {
        upper: makeBucket(),
        lower: makeBucket(),
        cap: makeBucket(),
        uncap: makeBucket(),
    };
}

function compare(inputs: string[], metas: string[], store: ReturnType<typeof makeResultStore>) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const input = inputs[i];
        const meta = metas[i];
        const expected = jsCases(input);
        const actual = got[i];
        if (expected.upper !== actual.upper) noteMismatch(store.upper, input, expected.upper, actual.upper, meta);
        if (expected.lower !== actual.lower) noteMismatch(store.lower, input, expected.lower, actual.lower, meta);
        if (expected.cap !== actual.cap) noteMismatch(store.cap, input, expected.cap, actual.cap, meta);
        if (expected.uncap !== actual.uncap) noteMismatch(store.uncap, input, expected.uncap, actual.uncap, meta);
    }
}

function runSingleScalarSweep() {
    const store = makeResultStore();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    let testedScalars = 0;
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        batch.push(String.fromCodePoint(cp));
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compare(batch, metas, store);
            testedScalars += batch.length;
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compare(batch, metas, store);
        testedScalars += batch.length;
    }
    return { testedScalars, mismatches: store };
}

function runSigmaSweep(position: "before" | "after") {
    const lower = makeBucket();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        const ch = String.fromCodePoint(cp);
        batch.push(position === "before" ? `${ch}Σ` : ${ch}`);
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compareSigmaBatch(batch, metas, lower);
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compareSigmaBatch(batch, metas, lower);
    }
    return lower;
}

function compareSigmaBatch(inputs: string[], metas: string[], lower: Bucket) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const expected = inputs[i].toLowerCase();
        const actual = got[i].lower;
        if (expected !== actual) {
            noteMismatch(lower, inputs[i], expected, actual, metas[i]);
        }
    }
}

const result = {
    singleScalarSweep: runSingleScalarSweep(),
    sigmaSweep: {
        before: runSigmaSweep("before"),
        after: runSigmaSweep("after"),
    },
};

console.log(JSON.stringify(result, null, 2));
internal/stringutil/_scripts/compare-js-casing.go
package main

import (
	"encoding/json"
	"io"
	"log"
	"os"
	"unicode/utf8"

	"github.com/microsoft/typescript-go/internal/stringutil"
)

type Input struct {
	Value string `json:"value"`
}

type Result struct {
	Upper string `json:"upper"`
	Lower string `json:"lower"`
	Cap   string `json:"cap"`
	Uncap string `json:"uncap"`
}

func main() {
	var inputs []Input
	data, err := io.ReadAll(os.Stdin)
	if err != nil {
		log.Fatal(err)
	}
	if err := json.Unmarshal(data, &inputs); err != nil {
		log.Fatal(err)
	}

	results := make([]Result, len(inputs))
	for i, in := range inputs {
		results[i] = Result{
			Upper: stringutil.ToUpperJS(in.Value),
			Lower: stringutil.ToLowerJS(in.Value),
			Cap:   transformFirstRune(in.Value, stringutil.ToUpperJS),
			Uncap: transformFirstRune(in.Value, stringutil.ToLowerJS),
		}
	}

	if err := json.NewEncoder(os.Stdout).Encode(results); err != nil {
		log.Fatal(err)
	}
}

func transformFirstRune(s string, transform func(string) string) string {
	if s == "" {
		return s
	}
	_, size := utf8.DecodeRuneInString(s)
	return transform(s[:size]) + s[size:]
}

This helped me to find divergences between JS and x/text. So, based on this investigation, x/text isn't a suitable replacement for all of this is the goal is to match the JS semantics.

I also compared the v8 and SpiderMonkey implementations and decided to just match the v8 (that was the easiest with the oracle scripts above ;p and it probably matches the most common expectations anyway).

@jakebailey

Copy link
Copy Markdown
Member

What were the results of the analysis? What is the difference?

@Andarist

Copy link
Copy Markdown
Contributor Author

I couldn't find any divergences between the current implementation and v8.

When it comes to comparing against x/text (where actual is x/text and expected is what v8 returns here):

'ʰΣ'.toLowerCase() // actual: "ʰς", expected: "ʰσ"
"ͅΣ".toLowerCase() // actual: "ͅς", expected: "ͅσ"

The only divergence class I found in testing was toLowerCase final-sigma context for strings of the form (the oracle found 267 such divergences but they were all of this shape).

Comment thread internal/stringutil/_scripts/generate-special-casing.mts Outdated
@jakebailey

Copy link
Copy Markdown
Member

The only divergence class I found in testing was toLowerCase final-sigma context for strings of the form (the oracle found 267 such divergences but they were all of this shape).

Can you also ensure these are tested?

@Andarist

Andarist commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Can you also ensure these are tested?

Do u want to have tests for all 267 of them? The unit tests here already include a single example of that divergence (IIRC)

@jakebailey

Copy link
Copy Markdown
Member

No, not all of them I don't think, but a few. Or some sort of way to generate them and then use https://github.com/microsoft/typescript-go/tree/main/internal/testutil/jstest to cross check?

@RyanCavanaugh RyanCavanaugh requested a review from jakebailey June 3, 2026 18:04
@jakebailey

Copy link
Copy Markdown
Member

Just to be clear, I think this PR is fine after my suggested changes. Unfortunate that this is required but obviously JS and Unicode are a terrible pair

@jakebailey

Copy link
Copy Markdown
Member

I made a couple of changes:

  • Made this surrogate aware, now that Fix a slew of UTF-8/UTF-16 related issues #4181 is in
  • Made this not use Go's full string rune conversion, related to surrogates but also for perf
  • Pulled in the npm package that provides unicode data and use that, instead of fetching it.
  • While here, generated the identifier start data, which we previously did not have a code gen for and had copy/pasted from Strada.

This is using Unicode 15.1, but the case rules actually changed slightly in later versions! tsc would have inconsistent behavior, but we pin it since we have to, and I chose not to update it in this PR to 17 or something.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 19 changed files in this pull request and generated 1 comment.

Files not reviewed (2)
  • internal/stringutil/identifier_parts_generated.go: Language not supported
  • internal/stringutil/js_case_generated.go: Language not supported

Comment thread internal/stringutil/_scripts/generate-unicode-data.mts
@jakebailey

Copy link
Copy Markdown
Member

Okay, one other change; finally switching to unicode.RangeTable away from our surrogate pair slices. In addition to using 40% less binary space, it's also 3x faster and less code.

Workload Old (isInRuneRanges, flat []rune) New (unicode.Is, RangeTable) Speedup
Mixed (ASCII-heavy) IdentifierStart ~1145 ns ~350 ns 3.3×
Mixed (ASCII-heavy) IdentifierPart ~1268 ns ~390 ns 3.2×
Honest non-ASCII hot path (ch ≥ 0x80) IdentifierPart ~477 ns ~185 ns 2.5×

@jakebailey jakebailey added this pull request to the merge queue Jun 10, 2026
Merged via the queue into microsoft:main with commit 7d5b8f5 Jun 10, 2026
37 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lowercase<"İ">` differs between tsgo and tsc (Go strings.ToLower vs JS String.prototype.toLowerCase)

5 participants