Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Python (as of version 3.3) does something similar. Source code is assumed UTF-8 by default, and all other string creation requires explicit conversion from bytes. So when creating and storing a string object, Python looks at the widest code point that will be in the string, and chooses the narrowest encoding -- latin-1, UCS-2, or UCS-4 -- that can store the string as fixed-width units.

This gets all the benefits people usually want from "just use UTF-8", and then some -- strings containing only code points in the latin-1 range (not just the ASCII range) take one byte per code point -- and also keeps the property of code units being fixed width no matter what's in the string. Which means programmers don't have to deal with leaky abstractions that are all too common in languages that expose "Unicode" strings which are really byte sequences in some particular encoding of Unicode.



The tradeoff is that at unpredictable moments, memory requirements for string content can quadruple.

Python is inexorably committed to the idioms which depend assume fixed width characters -- there's no persuading the community to use e.g. functions to obtain substrings rather than array indexes. So this is an understandable design decision.


assume fixed width characters

Python strings are not iterables of characters. They're iterables of Unicode code points. This is why leaking the internal storage up to the programmer is problematic; prior to 3.3, you'd routinely see artifacts of the internal storage (like surrogate pairs) which broke the "strings are iterables of code points" abstraction.

e.g. functions to obtain substrings rather than array indexes

Strings are iterables of code points. Indexing into a string yields the code point at the requested index. While I'd like to have an abstraction for sequences of graphemes, strings-as-code-points is not the worst thing that a language could do. And all the "just use this thing that does exactly the same thing with a different name because I want indexing/length but also want to insist people don't call them that" is frankly pointless.


> And all the "just use this thing that does exactly the same thing with a different name because I want indexing/length but also want to insist people don't call them that" is frankly pointless.

Array index syntax over variable width data is problematic: either deceptively expensive -- O(n) for what looks like an O(1) operation -- or wrong. I suspect in that we agree.

As for the alternative, I'm talking about Tom Christiansen's argument here: https://bugs.python.org/msg142041

To paraphrase Tom's examples, this usage of array indexes is more idiomatic...

    s = "for finding the biggest of all the strings"
    x_at = s.index("big")
    y_at = s.index("the", x_at)
    some = s[x_at:y_at]
    print("GOT", some)
... than this, which the Python community would never adopt:

    import re
    s = "for finding the biggest of all the strings"
    some = re.search("(big.*?)the", s).group(1)
    print("GOT", some)
The first involves more logical actions indexing into the string. If those operations are both O(1) and correct, then that's not a serious problem, which is what justifies the Python 3.3+ design.

However, in terms of language design for handling Unicode strings, I prefer the tradeoffs of the second idiom: a single O(n) operation which is relatively easy to anticipate and plan for, rather than unpredictable memory blowups.


Array index syntax over variable width data is problematic

Which is why Python uses a solution that ensures fixed-width data. There's never a need to worry if a code point will extend over multiple code units of the internal storage model, because the way Python now handles strings ensures that won't happen.

I really think a lot of your problem with this is not actually with the string type, but with the existence of a string type. You want to talk in terms of bytes and indexes into arrays of bytes and iterating over bytes. But that's fundamentally not what a well-implemented string type should ever be, and Python has a bytes type for you (rather than a kinda-string-ish-sometimes type that's actually bytes and will blow up if you try to work with it as a string) if you really want to go there.


No, I believe that we need a string type to encapsulate Unicode, but that that it should encourage use of stream processing idioms and discourage random access idioms.


For what that's worth, Swift has string indexes, they're just opaque and only obtainable through searching or seeking operations.

Rust also has string indexes, they are not opaque, are byte indices into the backing buffer, and will straight panic if falling within a codepoint.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: