Replies: 2 comments 4 replies
-
|
Thanks for you input! I'd be in the second team, for this to be logical and helpful (abstracting over details like encoding). Changing how I'll study the subject a bit more in the following weeks. |
Beta Was this translation helpful? Give feedback.
-
|
I broadly agree with the two options. The tradeoffs are generally around:
I don't think you'd need to worry about non-Unicode encodings or things like normalization at this point. Option 1Strings are arrays of bytes, and UTF-8 by convention. This is generally easier to implement because you're pushing the problem of dealing with Unicode to your users, though you can provide helpers for some operations. Memory usage is optimal, but string code point length and random access are O(n), and you have to think about how to deal with invalid UTF-8. Generally interoperates well with other things, since they also speak UTF-8. Option 2Strings are arrays of Unicode code points. The easiest thing you can do is to implement strings as arrays of 32-bit integers and call it a day. However this requires up to 4x the memory in the common (Latin alphabet) case, where most characters would only requires a byte of space if UTF-8-encoded. CPython has an adaptable string implementation, where the underlying representation changes depending on the contents of the string. This is a pretty good way to navigate the memory tradeoff at the cost of implementation complexity. Interop is likely to be trickier. You may have to convert strings to UTF-8 to pass them elsewhere, incurring overhead. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dealing with Unicode strings is in a weird spot at the moment, as strings are treated as a list of bytes by most functions... except for
string:ord, which can only return the codepoint for an entire character. If you have rather structured strings, you can use slicing to properly analyze their contents with the current builtins, but the general case requires the user to parse UTF-8 (or whatever encoding) themselves, which I think is pretty untenable (and might actually be impossible as-is?).Given that strings are, to the user, nearly just lists of unstructured bytes already, I think making string handling explicitly 8-bit clean is an easy choice. Janet is a good model language for this approach: looping over a string loops over the bytes with no encoding knowledge, with a separate UTF-8 API if you need to encode/decode things yourself.
Janet returns a byte as an integer directly (so an
ordfunction isn't even necessary), but in ArkScript's case, indexing into a string itself returns a string containing just that byte, sostring:ordwould need to be adapted to simply convert such a string to the byte it contains (breaking its current ability (and requirement) to parse a full character) (andstring:chrshould be changed to do the reverse). Escape sequences would need to be expanded to make representing lone high bytes more user-friendly. A separate API would then be used to split up a string, given some encoding (UTF-8 by default), into separate Unicode characters, probably as a list of strings;string:codePointsorstring:graphemesalign with choices from other langs (though "grapheme" isn't technically the right term for this).Another option, which follows from ArkScript's Python-like approach of indexing into a string returning another string, is to make strings always Unicode aware, like Python. Indexing into a string always returns a full character;
string:ordandstring:chrcan stay as they are, and something likestring:codePointsorstring:graphemesbecomes unnecessary (just loop over the string). The catch is that this breaks@, whose current behavior would be realized via an encode/decode API which converts between lists of integers; string slicing would need a similar update.Both approaches are definitely familiar, though I'm unsure which people tend to prefer; it's mostly a matter of how exposed the encoding actually is. Both also require some amount of breaking changes, though given that looping over a purely ASCII string would be unaltered in either case, there would likely be few ramifications for current users.
(tagging @edsrzf, @GolfingSuccess, and @Steffan153 for their takes) (though I do not want golf considerations to dominate the discourse)
Beta Was this translation helpful? Give feedback.
All reactions