It's not that simple. Obviously, you'd choose UTF-8, but that exposes another issue: a codepoint is not a character. There's no simple & elegant solution to this at the language level.
Precisely. That's why I think representing strings as character slices and letting third party libraries handle them is not a good solution.
Deciding what features to put in the language vs. the stdlib vs. third party libraries is one of the hardest parts of language design. Personally I believe strings are important and frequent enough they deserve special treatment at the language level.
Edit (late addition):
I think not treating strings specially is mostly fine in C, but Zig seems to aim at being a little less lowlevel.
Zig does not do this. It represents strings as byte slices. A UTF-8 character could be multiple codepoints with each codepoint being multiple bytes.
A big thing about Zig is not hiding complexity. If UTF-8 were implemented at a language level (whatever that means), then "language level" string operations would be non-linear, which would be very non-Ziggy. I could see value in a standard library UTF-8 implementation, but a LOT of forethought would need to be put into it. I think keeping UTF-8 string manipulation at the third-party library level is a good choice for now. Maybe once the language is finalized, the ecosystem is more developed, and lessons have been learned from the third-party libraries, then the standard library can implement this.
I'm not talking about language level string operations as in concatenation with + or something like that at all, because that certainly wouldn't make sense in language like Zig.
I'm not advocating for string functionality in the language, I'm advocating for a way to not allow byte slice functionality on a thing that is clearly not a byte slice.
> a way to not allow byte slice functionality on a thing that is clearly not a byte slice
This already exists in the form of structs or opaque types. Both of these approaches would end up being implemented in "userspace" anyways, whether that's standard library or third-party.
However, (UTF-8) strings are byte slices. You can do simple manipulation with them as byte slices safely and validly. Split on spaces? Sure. Tokenize? Sure. Find substring? Sure. You can't do things that depend on say UTF-8 graphemes, but you can safely do most things that depend on bytes. For most purposes, treating strings as byte slices is the safest and correct approach.
Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation. Treating strings as byte slices exposes a lot of footguns; it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.
Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented, but fair enough. The unknown input may be slightly malformed. Complexities like this is why one shouldn't underestimate the nuances (and runtime costs!) of implementing proper Unicode. As for representing byte sequences as byte sequences, that is the most basic way to represent strings of text without placing any assumptions on them. It's the assumption of potentially incorrect invariants that's the issue. If you have faculties to handle Unicode correctly (and very few languages do), then using something more opaque may be better fitting than a byte slice.
> Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented
Not necessarily the shortest (NFC means not using composed characters from later revisions of the standard), and you only get a normalised representation if you've actually normalised it - if you've just accepted and maybe validated some UTF-8 from outside then it probably won't be in normalized form. IMO it's worth having separate types for unicode strings and normalized unicode strings, and maybe the latter should expose more of the codepoint sequence representation, but I don't know if any language implements that.
> it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.
That's a nice analogy.
> Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation.
Unfortunately that's nearly impossible to do sanely in the general case, no matter how the string is represented.
I'm curious, what would be a good reason why treating floating point numbers as byte sequences should be any harder than what is required to make it obvious (provided their binary format is well defined)?
There are footguns in making that representation easy to access, e.g. if you try to hash the byte sequence to use floats as hash table keys then it will almost work but you'll get a very subtle error because 0 and -0 will hash differently. And frankly most of the things you'd do with the byte sequence are things that there are more semantically correct ways to do. There should be a way to access that representation but it shouldn't be something you'd stumble into doing accidentally, IMO.
You are talking about what stringy things can be done with byte slices and I'm talking about all the byteslicy things that shouldn't be done with strings.
Like subslicing. And accessing individual bytes in it.