Unicode Handling #681

kg583 · 2026-04-18T17:19:39Z

kg583
Apr 18, 2026

Dealing with Unicode strings is in a weird spot at the moment, as strings are treated as a list of bytes by most functions... except for string:ord, which can only return the codepoint for an entire character. If you have rather structured strings, you can use slicing to properly analyze their contents with the current builtins, but the general case requires the user to parse UTF-8 (or whatever encoding) themselves, which I think is pretty untenable (and might actually be impossible as-is?).

Given that strings are, to the user, nearly just lists of unstructured bytes already, I think making string handling explicitly 8-bit clean is an easy choice. Janet is a good model language for this approach: looping over a string loops over the bytes with no encoding knowledge, with a separate UTF-8 API if you need to encode/decode things yourself.

Janet returns a byte as an integer directly (so an ord function isn't even necessary), but in ArkScript's case, indexing into a string itself returns a string containing just that byte, so string:ord would need to be adapted to simply convert such a string to the byte it contains (breaking its current ability (and requirement) to parse a full character) (and string:chr should be changed to do the reverse). Escape sequences would need to be expanded to make representing lone high bytes more user-friendly. A separate API would then be used to split up a string, given some encoding (UTF-8 by default), into separate Unicode characters, probably as a list of strings; string:codePoints or string:graphemes align with choices from other langs (though "grapheme" isn't technically the right term for this).

Another option, which follows from ArkScript's Python-like approach of indexing into a string returning another string, is to make strings always Unicode aware, like Python. Indexing into a string always returns a full character; string:ord and string:chr can stay as they are, and something like string:codePoints or string:graphemes becomes unnecessary (just loop over the string). The catch is that this breaks @, whose current behavior would be realized via an encode/decode API which converts between lists of integers; string slicing would need a similar update.

Both approaches are definitely familiar, though I'm unsure which people tend to prefer; it's mostly a matter of how exposed the encoding actually is. Both also require some amount of breaking changes, though given that looping over a purely ASCII string would be unaltered in either case, there would likely be few ramifications for current users.

(tagging @edsrzf, @GolfingSuccess, and @Steffan153 for their takes) (though I do not want golf considerations to dominate the discourse)

SuperFola · 2026-04-18T18:53:43Z

SuperFola
Apr 18, 2026
Maintainer

Thanks for you input!

I'd be in the second team, for this to be logical and helpful (abstracting over details like encoding). Changing how @ behave will mean changing @=, @@=, len (which would render string:utf8len useless but probably require a string:bytesLen to be added). string:slice should not be affected, as well as anything implemented in ArkScript ; the C++ string builtins would have to be reviewed one by one.

I'll study the subject a bit more in the following weeks.

3 replies

kg583 Apr 18, 2026
Author

I'd be in the second team, for this to be logical and helpful (abstracting over details like encoding). Changing how @ behave will mean changing @=, @@=, len (which would render string:utf8len useless but probably require a string:bytesLen to be added). string:slice should not be affected, as well as anything implemented in ArkScript ; the C++ string builtins would have to be reviewed one by one.

Yeah, I was initially thinking 8-bit clean was the "simplest" transition, but the more I think about it the more I also fall in the second camp. string:bytesLen would probably be better realized by len on the output of a string:bytes function, which could accept an output encoding.

SuperFola Apr 18, 2026
Maintainer

which could accept an output encoding.

For you, would that mean adding handling for latin1, win cp stuff and all? That should be doable, but seems like a lot of work for little use (as of right now, and I prefer adding features when they're needed)

kg583 Apr 18, 2026
Author

I agree it's pointless work if nobody needs it yet, but might as well leave the function signature open to the prospect.

edsrzf · 2026-04-18T23:45:30Z

edsrzf
Apr 18, 2026

I broadly agree with the two options.

The tradeoffs are generally around:

Usability vs implementation size and complexity
Performance for certain operations (string code point length, random access for code points) vs memory usage
Interop with other languages/libraries (eg fmt)

I don't think you'd need to worry about non-Unicode encodings or things like normalization at this point.

Option 1

Strings are arrays of bytes, and UTF-8 by convention. This is generally easier to implement because you're pushing the problem of dealing with Unicode to your users, though you can provide helpers for some operations.

Memory usage is optimal, but string code point length and random access are O(n), and you have to think about how to deal with invalid UTF-8.

Generally interoperates well with other things, since they also speak UTF-8.

Option 2

Strings are arrays of Unicode code points. The easiest thing you can do is to implement strings as arrays of 32-bit integers and call it a day. However this requires up to 4x the memory in the common (Latin alphabet) case, where most characters would only requires a byte of space if UTF-8-encoded.

CPython has an adaptable string implementation, where the underlying representation changes depending on the contents of the string. This is a pretty good way to navigate the memory tradeoff at the cost of implementation complexity.

Interop is likely to be trickier. You may have to convert strings to UTF-8 to pass them elsewhere, incurring overhead.

1 reply

SuperFola Apr 21, 2026
Maintainer

Thanks for the detailed write up!

An argument for option 1 would be to avoid degrading performances and keeping the basic api O(1) for len and element access.

As much as I'd love to make everything uniform, option 1 makes it so @ is 0(1), and since it is the most common way to access an array as a user (or hidden in the stdlib), it shouldn't be the easy way to query code points since that's costly. That's what Rust does, first get the code points then get the one you want (make the slow path hard to reach, so that it's intentional to use it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArkScript

Unicode Handling #681

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ArkScript

Unicode Handling #681

Uh oh!

kg583 Apr 18, 2026

Replies: 2 comments · 4 replies

Uh oh!

SuperFola Apr 18, 2026 Maintainer

Uh oh!

Uh oh!

kg583 Apr 18, 2026 Author

Uh oh!

SuperFola Apr 18, 2026 Maintainer

Uh oh!

kg583 Apr 18, 2026 Author

Uh oh!

Uh oh!

edsrzf Apr 18, 2026

Option 1

Option 2

Uh oh!

SuperFola Apr 21, 2026 Maintainer

kg583
Apr 18, 2026

Replies: 2 comments 4 replies

SuperFola
Apr 18, 2026
Maintainer

kg583 Apr 18, 2026
Author

SuperFola Apr 18, 2026
Maintainer

kg583 Apr 18, 2026
Author

edsrzf
Apr 18, 2026

SuperFola Apr 21, 2026
Maintainer