Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Rust strings seem hard (brandons.me)
254 points by brundolf on April 14, 2021 | hide | past | favorite | 241 comments


If you've been handling Unicode properly in other languages, then Rust strings seem easy in comparison.

• All the 2-byte-char languages which were designed for UCS-2 before the Unicode Consortium pulled the rug out from underneath them and obsolesced constant-width UCS-2 in favor of variable-width UTF-16.

• Languages which result in silent corruption when you concatenate encoded bytes with a string type (e.g. Perl but there are many examples.)

• C, where NUL-terminated strings are the rule, and the standard library is of no help and so Unicode string handling needs to be built from scratch.

All those checks which you have to fight to opt into, defying both the language and other lazy programmers (either inside your org, or at an org which develops dependencies you use)? Those checks either happen automatically or are much easier to use without making mistakes in Rust.


Alternatively: if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

Obviously the world is a big place and there is room for lots of paradigms and worldviews and we aren't supposed to judge too much.

But come on. If new code isn't working naturally in UTF-8 in 2021 then it's wrong, period.


> if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

Paradoxically, trying to do "the right thing" and being an "early adopter" of (the now called) UCS-2 was a "mistake", as both Java and Windows can attest, by getting "stuck" supporting the worst possible Unicode encoding ad-infinitum. UTF-8 is the "obviously correct" choice (from the hindsight afforded by us talking about this in 2021).

I still find it funny that emojis of all things are what actually got the anglosphere to actually write software that isn't completely broken for the other 5.5 billion people out there.


Correction to the date and timeline. UTF-8 has been the correct choice since:

" >From ken Tue Sep 8 03:22:07 EDT 1992 "

As discussed last week in the quoted history https://news.ycombinator.com/item?id=26735958 of UTF-8's mail messages: https://doc.cat-v.org/bell_labs/utf-8_history

It appears that Unicode arose between 1990 and 1991 for the initial versions https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#... with the first published version of the specification in 1993:

"ISO/IEC 10646-1:1993 = Unicode 1.1"

As discussed another time on Hacker News https://news.ycombinator.com/item?id=20600195 https://unascribed.com/b/2019-08-02-the-tragedy-of-ucs2.html

It was around 1996 when it became clearly obvious (software got shipped to end users who cared and they complained back) that UCS-2 (16 bit characters) would be insufficient.

+++

Pragmatically this is forever, as long as backwards compatibility must be maintained, the exiting APIs which are built around the crazy 16 bit standard need to exist; but there's little reason they have to be native, rather than wrappers for UTF-8 compatible APIs.

It would even be a good time to standardize on a single user space programming API and have implementations on every operating system. Preferably including basic drawing and font layout functions. So that finally, most programs could be written once, compiled on a platform of choice, and work.


> actually write software that isn't completely broken for the other 5.5 billion people out there

I thought that Chinese and Japanese are the only languages that UCS-2 has trouble fully representing. I believe all the other living languages can actually be represented by UCS-2.

So using UCS-2 would actually work for almost everyone except maybe 1.5 billion people.


Code that treats UCS-2 as "wide ASCII" can be subtly wrong even for western european languages. For multiple reasons, Unicode has multiple representations for the same glyph[1], so if you have ü, it can either be a the two bytes 00FC[2], or u followed by the ¨ diacritic U+0308. If you don't account for this things like "reversing a string" or "give me a substring" or "how long is the string on screen" will be subtly buggy. If you handle UCS-2 correctly, then it's fine and a reasonable technical limitation, but the emphasis of that sentence is on correctly.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence

[2]: https://www.compart.com/en/unicode/U+00FC


> Code that treats UCS-2 as "wide ASCII" can be subtly wrong even for western european languages.

It can be. But critically for this conversation, fixing your code to support emoji and other non-BMP characters doesn't necessarily fix those problems.


Except it does since a single emoji can also be made up of multiple code points.


But that only causes subtle wrongness, which can continue to go unnoticed.


> So using UCS-2 would actually work for almost everyone except maybe 1.5 billion people.

One of the great things about using Rust is that I don't have to have this argument. There doesn't have to be a debate about whether we should invest in fixing subtly broken code. Rust string-handling which generally works for Europeans will also work for Chinese!

This isn't the fault of the languages which were designed for UCS-2 (then known just as "Unicode"). But the fact that Rust emerged after UTF-8's ascendance means that Rust's users mostly get to avoid the UCS-2/UTF-16 legacy tarpit.


> Rust string-handling which generally works for Europeans will also work for Chinese!

I doubt that is any more true than Java. You can easily write code in Rust that assumes you can split a String anywhere and get two valid strings, that you can compute the length of a String and get information about how long the printed representation will be, that you can find a substring in that string by simply iterating through UTF-8 code points etc. All of these assumptions are about as wrong in UTF-8 Rust as they are in UTF-16 Java.


Do we not consider emoji a language unto itself?

I was actually quite bemused to discover that some code review software I was using allowed me to "cursor" halfway through a smiley face emoji and enter a space (typing too fast to pay attention)... causing the infamous "box characters" because I'd accidentally split the smiley down the middle.

I get the need for extreme backward compat in browsers, but... this seems like one of those things that just might be worth fixing. Maybe a "use utf8" directive? :)


Even with proper UTF-8, you'll get situations where you can insert a space in the middle of an emoji and split it into other emoji + unprintable characters. The encoding is irrelevant, you need proper Unicode support to avoid these problems.


Unfortunately proper unicode support for many operations means carrying around tons of data since important properties of code points cannot be derived from the codepoints themselves.


Exactly. UTF-8 doesn't and cant fix this problem, you need a full Unicode library if you want to correctly handle human text. If you don't, why bother with UTF-8 instead of something simpler, like ASCII?


> If you don't, why bother with UTF-8 instead of something simpler, like ASCII?

Because unicode support isn't binary. Being able to pass along and not mange blobs of unicode is already a lot better than ASCII-only.


Java hasn't been UCS-2 since Java 5.0 - nearly a decade.


Well if memory cost wasn't an issue then UCS-4 would be nicest. But 4x the memory for most strings is currently unacceptable.


That is true, but the benefits of UTF-32 is minor compared to UTF-8, and might not be worth the cost.

And I say that as someone who is developing a language which only has UTF-32 support.

The problem is that even with UTF-32, doing things like splitting strings is inherently unsafe, so you are still going to need a Unicode library to do proper splitting by grapheme cluster. In practice, almost all string splitting works on ASCII text, and assumes everything else is data that should not be manipulated. For this, UTF-8 is perfectly acceptable.


UCS-4 is still a variable-length encoding, because various accented characters and emoji use multiple code points. One advantage of UTF-8 is that it makes you confront variable-length characters head on.


There are a lot of instances where I wouldn't mind the memory cost, but would very much mind not automatically being compatible with ascii strings.


I wouldn't go that far, despite having made an early decision to support only UTF-8 in my own software.

UCS-2 is in fact broken, but UTF-16 is a valid encoding of Unicode, which can be implemented correctly. So is UTF-32, although I can't imagine why anyone would want to use that one.

I can imagine why someone would want to use UTF-16, though: interoperability with Windows, where it's the native encoding. It isn't "wrong, period" to do Unicode in a way which is more convenient for the platform.

There is, of course, a ton of work to really implement Unicode correctly, and UTF-16 and UTF-32 can make it tempting to do the wrong thing, instead of biting the bullet and implementing all of the many ways in which codepoints coalesce into grapheme clusters, and making sure all functions for working with strings can recognize the distinction.

But it certainly can be done in any of the full encodings.


> interoperability with Windows

Even on Windows, it's much easier to just use UTF-8 internally and perform conversion to/from UTF-16 when you're about to do/return from a WinAPI call.


My understanding is that UTF-8 is not a good representation for non-european alphabets.

So do you think UTF-8 is always the best internal string representation? Or just for English speakers?

For Mandarine what would be optimal?


Mandarin is an interesting case. Most of the Han characters used by Mandarin fall within the basic multilingual plane and thus occupy 2 bytes in UTF-16 but 3 bytes in UTF-8. However, for web documents, most markup is ASCII which is only one byte. So for Mandarin web documents, the space requirements for UTF-8 and UTF-16 are about a wash.

When you add in interoperability concerns, since so much text these days is UTF-8, for Mandarin at least UTF-8 is a perfectly defensible choice.

(A harder problem is Japanese — Japan really got screwed over with Han unification, so choosing Shift-JIS over any Unicode encoding is often best.)

FWIW I covered the space requirements of various encodings and various languages in this talk for Papers We Love Seattle:

https://www.youtube.com/watch?v=mhvaeHoIE24&t=39m14s


I genuinely wonder: is the space requirement of text encodings really an important issue in this age of large photo and video content?


Looking at my browser's memory reporting, strings take up about ~2-3% of total memory usage, most of which is probably ASCII. If it were UCS-2, that would make it ~5% of total memory usage, and UCS-4 ~10%. That's small numbers, but as a whole-program impact, it's significant enough to motivate performance engineers to actually try to compress those strings down a bit.


It depends on how strings are counted. If every String object is atomized/interned then e.g. the string "div" is stored once, but on a 64bit system you have 8 bytes for a pointer and another 8 bytes for bookkeeping things such as length.


Memory for strings is often more important because there's a lot more of them, and image memory can be file-backed more often but strings need to be swapped to disk.


If you ever want to do something stuff with text efficiently (full text search and associated processing) I’d argue that it’s quite important.


If you want to do search, you need collation, and you can't use any standard encoding for that data.


It's important to distinguish between sorting in code point order and sorting according what a user would expect for their language. However, sorting in code point order (which is actually equivalent to sorting by memory comparison for UTF-8) is enough to build an inverted index data structure, commonly used for fulltext search. And just like FridgeSeal asserted, the memory footprint of the text representation has performance implications for such an application.

(Source: I wrote a search engine library.)


Will that handle things like matching accented and unaccented characters? If I search for 'stefan' in a text that makes frequent references to 'Ștefan', will it correctly find those matches?


Typically you normalize both to a non-accented format, which is what you index...


Exactly. And a full-featured fulltext search library will allow for customizable normalization and tokenization.


> (A harder problem is Japanese — Japan really got screwed over with Han unification, so choosing Shift-JIS over any Unicode encoding is often best.)

This statement needs more support. I think “screwed over” is a bit harsh, since I’m not aware the impact on Japanese was anymore than the rest of CJK. Despite the Han unification controversy, Unicode has been heavily adopted in Japan. The space requirements are basically the same as all CJK. Half-width kana is heavier since they are one byte in shift jis but they’re relatively uncommon.


As far as I'm aware one problem is displaying text. Japanese readers generally need a Japanese font to correctly display Japanese text if it's Unicode. This becomes a problem when you potentially have text that can come from different languages. E.g. a Japanese font will display Chinese incorrectly.

On the web you can work around this using the lang attribute to tell the browser how text should be interpreted.

It's notable that, for example, traditional and simplified Chinese does not have this problem because they are encoded separately.

Another problem is missing characters. Some people have complained of not being able to write their own name. I'm not sure to what extent this has been solved through Unicode updates.


I have been learning Japanese for about a year now, so I don't have that much experience reading Japanese text yet [1]. I'm aware of some of the visual differences between Chinese and Japanese fonts, but I have not yet had trouble reading Japanese text set in a Chinese font. If you have any specific examples for Kanji that are difficult to recognize in a Chinese font, I'd be interested.

[1] Although on the other hand you could argue that I'm spending more conscious effort reading Japanese text than a native speaker would.


Sorry, I can't speak confidently on this because I can't read Japanese (or Chinese or Korean). I can only report what I've been told by users.

Including language metadata with text was felt to be especially important for Japanese and Korean users. I was told the difference was like having "595 kg" displayed as "5P5 kg". That is, it's possible to decipher the intended meaning but it looks wrong and it takes a moment to work out what was meant. Depending on the language some glyphs can be mirror images, have extra strokes, strokes missing or in different places or at different angles.


So, the advantage of UTF-16 is that CJK text will use 33% less space.

Does this mean that “UTF-8 is not a good representation for non-European alphabets?” It may be less efficient but the difference does not seem shocking to me, considering that for most applications, the storage required for text is not a major concern—and when it is, you can use compression.


It's the edge-cases that get you.


> if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

How so? Delphi for example has wide character-based strings as default, what's wrong with that?


Wide character based strings have a .length field which is easy to reach for and never what you want, because it’s value is meaningless:

- It isn’t the number of bytes, unless your string only contains ASCII characters. Works in testing, fails in production.

- It isn’t the number of characters because 16 bits isn’t enough space to store the newer Unicode characters. And even if it could, many code sequences (eg emoji) turn multiple code points into a single glyph.

I know all this, and I still get tripped up on a regular basis because .length is right there and works with simple strings I type. I have muscle memory. But no, in javascript at least the correct approaches require thought and sometimes pulling in libraries from npm to just make simple string operations be correct.

Rust does the right thing here. Strings are UTF-8 internally. They check the encoding is valid when they’re created (so you always know if you have a string, it is valid). You have string.chars().count() and other standard ways to figure out byte length and codepoint length and all the other things you want to know, all right there, built into the standard library.


What could string.chars().count() possibly be used for?

At least .length tells ypu how much memory the string will occupy - not likely to be very important in a memory managed language, but at least it has one potential use. I don't see a use for the number of code points in a string.


In collaborative editing systems, we usually treat strings semantically as if they were as arrays of codepoints. Its the only sensible way to do it, because the other options are bad:

- We don't use arrays of bytes because different languages have different native string encodings. Converting a UTF8 byte offset to a language that uses UCS2 is slow and complicated, and it opens the door to data corruption.

- And we don't use grapheme clusters because what counts as a grapheme cluster keeps changing with each unicode version. (And libraries are big and complicated).

So inserting an emoji into a document is treated the same way we would handle inserting a small list at some offset into a larger list.

> At least .length tells you how much memory the string will occupy

How much memory the string occupies isn't something I've ever wanted to know. I do sometimes want to know how many bytes the string will take up when I store it or send it over the network - but in those cases you pretty much always want UTF-8. And string.length doesn't help you at all with that.


If you don't handle grapheme clusters, do you allow users to position the caret between parts of a grapheme cluster (say, between the flag emoji and the country emoji in a country flag emoji)? Is this useful, or just an acceptable limitation? Would it be significantly different if you allowed them to place the caret between the parts of a UTF-16 surrogate pair?

I would have expected that your input method would normally insert grapheme clusters, and that these clusters would be treated as indivisible if they were inserted with a single 'keystroke'. I would also expect that you anyway need to count clusters when presenting something like a 'character count' to the user, as I don't think they would be very happy if a program reported that 'année' was 6 characters long.

> How much memory the string occupies isn't something I've ever wanted to know. I do sometimes want to know how many bytes the string will take up when I store it or send it over the network - but in those cases you pretty much always want UTF-8. And string.length doesn't help you at all with that.

This is what I was referring to, should have called it number of bytes instead of memory. Also note that probably the most used application protocol on the internet, HTTP, doesn't support UTF-8 and defaults to ISO-8859-1 (extended ASCII). For example, HTTP headers are not UTF-8 strings and you can't treat them as such. And even when sending a UTF-8 body the Content-Length header needs to be set to the number of bytes, not the number of unique codepoints, so again you'd use .length.


> do you allow users to position the caret between parts of a grapheme cluster

Where the carot can go is dependent on the editor, not the underlying protocol.

> I don't think they would be very happy if a program reported that 'année' was 6 characters long.

CRDT / OT edits don't make a lot of sense to the user in their raw form no matter what format you use for offsets. "Insert x at position 5043" is equally meaningless if 5043 stores a byte offset, a grapheme cluster index or codepoint index.

> Would it be significantly different if you allowed them to place the caret between the parts of a UTF-16 surrogate pair?

Yes - if you managed to insert something in the middle of a UTF-16 surrogate pair, the string contents would become invalid - and that causes weird language dependant problems. Rust will panic(). In comparison, a broken country flag just renders weirdly - which isn't ideal, but its fine. Mind you, in both cases you'd need one of the editors to do something weird to insert those characters in the document. As you say, the input method will normally treat grapheme clusters as indivisible anyway. But I much prefer invalid states to be impossible to represent in the first place when I can. I don't want a lack of input validation to allow wonky edits to crash my rust server. Invalid grapheme clusters are a much smaller problem in comparison.

And this all skips over how difficult it is to efficiently convert a UCS-2 offset position in javascript into a position in a UTF-8 string in rust. Its much easier to just count in codepoints everywhere.

> And even when sending a UTF-8 body the Content-Length header needs to be set to the number of bytes, not the number of unique codepoints, so again you'd use .length.

No, that'll break as soon as you insert non ASCII characters into your document. string.length will not tell you the number of bytes your string takes up. It will tell you the number of UCS-2 elements in your string, which is the number of codepoints + 1 for each UTF-16 surrogate pair. Which again, I've never wanted to know. You can't use that to calculate the UTF-8 byte length, where each codepoint takes somewhere between 1 and 4 bytes depending on its unicode table value.

Your example 'année' has a string.length of 5, but a UTF-8 byte length of 6. (From new TextEncoder().encode('année').length). If you set Content-Length to 5, bad things happen. Ask me how I know :/


> and that causes weird language dependant problems. Rust will panic(). In comparison, a broken country flag just renders weirdly - which isn't ideal, but its fine.

My point was exactly about whether languages should enforce string encoding. I maintain that strings should just be arbitrary byte arrays, and only text processing methods should enforce the appropriate encodings - that would mean that accidentally splitting a UTF-16 code point would be as painful as accidentally splitting a grapheme cluster: anything that wants to render the resulting string will have a problem, but anything in between won't.

> It will tell you the number of UCS-2 elements in your string, which is the number of codepoints + 1 for each UTF-16 surrogate pair.

Oops, here you are completely right. I was under the mistaken assumption that Java String.length() would return the number of 16—byte chars in the string, when in fact it returns the same useless number as the chars().count() method. Sorry about that!


The problem with languages treating strings as arbitrary byte arrays is that those byte arrays are different in each language. Java uses UCS-2 internally while rust uses UTF-8. Conversions happen somewhere, and converters don’t have a lot of good options when they see invalid data. For collaborative editing, if I send a byte level patch I made in rust for for a UTF8 string, it won’t make much sense in Java.

And a point of clarity - Java’s String.length() does not return the same value as rust’s chars().count(). The former returns the useless UCS-2 count. The latter returns the number of Unicode codepoints. Java’s length() will count many single codepoint emoji as having length 2 (same as javascript, C#) while rust will correctly, usefully count one codepoint as one character. (As will swift and go, depending on which methods you call.)


*16—bit


I was part of that. Delphi has all the string types you want, since you can declare your preferred code page. String is an alias for UnicodeString (to distinguish from COM WideString) and is UTF-16 for compatibility with Win32 API more than anything. UTF-8 would have meant a lot more temporaries and awkward memory management.


All in all, while the Unicode transition took its time, I must admit it's was very smooth when it did happen.

At work we have a codebase that does a lot of string handling. Both in reading and writing all kinds of text files, as well as doing string operations on entered data. Several hundred kLOC of code across the project.

We had one guy who spent less than week wall-time to move the whole project, and the only issue we've had since is when other people send us crappy data... if I got a dollar for each XML file with encoding="utf-8" in the header and Windows-1252 encoded data we've received I'd have a fair fortune.


The reasoning behind using UTF-16/UCS-2 is that then you can plug your ears and treat 1 char == 1 user-visible glyph on the screen, so programmers that acted as if ASCII was the only encoding in existence could continue treating strings in the same way (using their length to calculate their user-visible length, indexing directly on specific characters to change them, etc).

All of those practices are immediately wrong once UTF-32 came in existence and UTF-16 became a variable length encoding. But even if that hadn't happened, what you want to be operating on is not characters, but grapheme clusters, which are equivalent to a vector of chars. Otherwise you won't handle the distinction between ë and ë or emojis correctly.


But how is that different from the underlying encoding being UTF-8?

edit:

For example, we do a lot of string manipulation in Delphi. We might split a string in multiple pieces and glue them together again somehow. But our separators are fixed, say a tab character, or a semicolon. So this stiching and joining is oblivious to whatever emojis and other funky stuff that might be inbetween.

How is this doing it wrong?

I mean yea sure you CAN screw it up by individually manipulating characters. But I don't see how an UTF-8 encoded string in itself prevents you from doing the same kind of mistakes.


Splitting and glueing is fine. But imagine 3 systems: system A is obviously wrong. It crashes on any input. System B is subtly wrong. It works most of the time, but you’re getting reports that it crashes if you input Korean characters and you don’t know Korean or how to type those characters. System C is correct.

Obviously C is better than A or B, because you want people to have a good experience with your software. But weirdly, system A (broken always) is usually better than system B (broken in weird hard to test ways). The reason is that code that’s broken can be easily debugged and fixed, and will not be shipped to customers until it works. Code that is broken in subtle ways will get shipped and cause user frustration, churn, support calls, and so on.

The problem with UCS-2 is it falls into system B. It works most of the time, for all the languages I can type. It breaks with some inputs I can’t type on my keyboard. So the bugs make it through to production.

UTF-8 is more like system A than system B. You get multibyte code sequences as soon as you leave ASCII, so it’s easier to break. (Though it really took emoji for people to be serious about making everything work.)


Right, but the claim was this:

if you have been handling Unicode and using wide characters, you have not been handling Unicode properly

I agree that UTF-8 is a better encoding overall for the majority of cases. I don't think that means UTF-16, which for example Delphi UnicodeStrings are[1], is not proper.

edit: maybe this is a language confusion thing. For historically tragic reasons, we're stuck with "char" as the basic element of string types in lots of languages. In Delphi a "widechar" is technically a code unit[2], and may or may not represent a code point. This is how I interpreted the OP. Maybe he meant wide characters as code points, in which I would agree.

[1]: http://docwiki.embarcadero.com/RADStudio/Sydney/en/Unicode_i...

[2]: https://en.wikipedia.org/wiki/Character_encoding#Terminology


Yeah I hear you. Its definitely possible to write correct code using UCS-2 (where each "char" sometimes represents only half of a codepoint). But its easy to end up with subtly broken code, that only breaks for non-english speakers who don't know enough english to file a bug report.

The ergonomics of the language guide you in that direction when, as you say, a "char" doesn't actually represent a character. Or even an atomic unicode codepoint. And when string.length gives you an essentially meaningless value.

Luckily, code like this will also break when encountering emoji. Thats great, because it means my local users will complain about these bugs and they're easy for me to reproduce. As a result these problems are slowly being fixed.


> The reason is that code that’s broken can be easily debugged and fixed, and will not be shipped to customers until it works.

Having worked in tech support for a piece of very expensive (~$100k per install annual support/license fee in the late 90s) enterprise software that had a GA release shipped to customers with a syntax error in an install script, I would state that more like “code that’s non-subtly broken is less likely be shipped to customers before it works.”


> If new code isn't working naturally in UTF-8 in 2021 then it's wrong, period.

UTF-8 is a nuisance as an in-memory representation because the characters are variable size. You can't get the length of a string without parsing it start to end, and you can't get a character by index without parsing and counting all the previous ones. 16-bit characters (wchar_t, Java char, whatever NSStrings are made of, etc) work fine in 99% of the cases.

UTF-8 is indisputably a good encoding for when you're sending something over the network or putting it into a file or a database.


> 16-bit characters... work fine in 99% of the cases.

In other words, they don't work :-).

UTF-16 is also variable-length. Sometimes a character fits in 16 bits, and sometimes it doesn't. From a practical view it's worse than UTF-8, because tests are less likely to detect bugs before shipping.

Even UTF-32 is, in reality, variable-length. Many code points are combining characters, so you need multiple code points to get a single grapheme.

If your language or API requires you to do something, then you'll need to do that. But unless there's an API requirement, in most situations UTF-8 is the best choice for network, storage, and processing. There are exceptions, but they're just that... exceptions.


UTF-16 is a variable length encoding. As is UTF-32 for that matter. A single character can be made of multiple code points.


True, but text processing is still easier with UTF-32, because UTF-8 and UTF-16 strings need to be converted to UTF-32 before you can do anything with them.


The D programming language has from the beginning built-in support for UTF-8, UTF-16 and UTF-32 code units as basic types (char, wchar, and dchar). This was when it wasn't clear which encoding would dominate.

It's pretty clear today that UTF-8 dominates, and the other two are useful only for interfacing to systems that need them.


This is my way of thinking about the topic these days. It's not that strings are more complicated in Rust than in other languages, it's that a lot of the other low-level languages are presenting an abstraction that assumes implicitly that a string is some type of sequence of uniform-sized cells, one cell per character, and that representation was an artifact of a specific time in computational history. It's like many other abstractions those languages provide... Seemingly simple at first glance, but if you do the details wrong you're just going to get undefined behavior and your program will be incorrect.

Languages that don't expose strings as that abstraction are, in my humble opinion, more reflective of the underlying concept in the modern era.


What can you actually do with a known-valid UTF-8 (but otherwise of unkown structure) string that you can't do with a UTF-16 string or even byte-based string?

You can't concatenate, split, enumerate, assume they are valid human text, capitalize, count the number of characters, turn to lower/uppercase etc.

In general, you need a text handling library to do anything meaningful with text if you want to handle internationalization, and then the text handling library can also handle all of the encoding problems easily, all that matters is having a known encoding so you don't need to guess.

It's nice to have specific types for specific encodings of strings, but otherwise I think there is little to be gained by representing strings as anything other than either byte arrays or text.


Incorrect.

You CAN concatenate, assuming your inputs are valid the outputs will also be valid; BUT normalization is now ONLY correct if all inputs were of the same format. This can be fixed with libraries if you care, and if you don't, it often doesn't matter. Edit-Additional: If something DOES care, it SHOULD enforce the normalization format it wants on the input boundary.

UTF-16 cannot be split any differently than any other encoding, including UTF-32; all must pass through a library that understands compositing characters and sequences. There is no escaping this (other than your own code becoming such a library).

Most of the other issues you fault are shared by _any_ encoding of Unicode. However a notable thing is that for very specific functions, E.G. searching for a given valid Unicode sequence and replacing it with another... (E.G. replace someone's name) you'll nearly always be able to do this in without issue. To always do it without issue additional checks must be made around combining characters at the boundary edges. (Which I wouldn't want to maintain, so I'd still call a library unless matching against known will never be such a value, such as quotes or other configuration file control characters.)


> You CAN concatenate, assuming your inputs are valid the outputs will also be valid

Is that true even if you mix LTR and RTL text? I have a suspicion there could be problems there, but I'm happy to be wrong. I would still say that safe concatenation is a minor boon for the cost of enforced UTF-8 compliance.

> Most of the other issues you fault are shared by _any_ encoding of Unicode.

That is exactly my point - that UTF-8, UTF-16, UTF-32, byte strings that could be invalid UTF-8 are all basically equally bad if you want to guarantee meaningful text. A text consisting of two valid UTF-8 code points representing combining characters is no more meaningful than a text consisting of 4 bytes that are invalid UTF-8 code points.

> searching for a given valid Unicode sequence and replacing it with another... (E.G. replace someone's name) you'll nearly always be able to do this in without issue.

I don't agree, unless you are talking about something extremely 'programmatic' like someone's username or email. For real text apps, you'd want something even more complex than Unicode, that could recognize that 'Ștefan' and 'Stefan' are the same name.


Blindly concatenating any two byte-streams, irrespective of encodings, would also be an invalid assumption; however at that point you've got massive design issues.

You cannot concatenate any random 16 or 32 bit character sequences as the underlying byte forms might be Little or Big Endian stored. UTF-8 does not need that consideration, BUT you must know (or at least contractually expect) valid UTF-8 input; anything else is garbage in garbage out.

The LTR RTL issue... depends on if all RTL strings _must_ end with an LTR character; in your use case. https://en.wikipedia.org/wiki/Bidirectional_text

Your suggestion of delegating all concatenation operations to that is a safe default. Other options might also be valid, depending on the exact context of inputs and use case.


What's really interesting to me is that a rust character is 32 bits even though a rust string is encoded in UTF-8. I'm still just beginning to explore rust so I'd only gotten as far in the book as to learn about characters but not strings and their relationship to UTF-8. This bit of handling of Unicode has me especially intrigued now.


I believe the strings are still encoded internally as byte arrays however, it's just that when you pull out a character it can be multiple bytes (emojis for example are often 3 bytes), so you need a 4 byte datatype to store them.


Yes, that's what I understand about how it works. There's a lot to be said against the idea of dealing with Unicode at the codepoint level in most applications though. I'm writing code that would allow commands to be either \ + non-letter character or \ + one or more letters (a la TeX). So that means that I would allow commands to include \bird, \pták, \طائر and \鳥, but what happens if the á in \pták is input as ´+ a rather than á? Is that supposed to be the same code? Or perhaps the user has \Spi¨nalTap (and there code editor has a type compositor that's willing to put the umlaut on the n?) Some of this can be dealt with through Unicode normalization, although there's also the question of whether, e.g., the ohm and angstrom symbols should be treated as symbols or letters if they're input at their symbolic code points rather than as a Greek or Latin letters. Would \+white man shrugging be a valid \+non-letter command since it's multiple code points? It's amazing how much a "simple" specification gets complicated when you start to look at all the ways that Unicode can complicate matters.


All of this is true, IF you assume that you want Unicode string. That, especially on system/embedded software (the kind of software that Rust is targeting) you don't really care about Unicode and you can simply treat strings as array of bytes.

And I live in a country where you usually use Unicode characters. But for the purpose of the software that I write, I mostly stick with ASCII. For example I use strings to print debug messages to a serial terminal, or read commands from the serial terminal, or to put URL in the code, make HTTP requests, publish on MQTT topics... for all of these application I just use ASCII strings.

Even if I have to represent something on the screen... as long as I have a compiler that supports Unicode as input files (all do these days) I can put Unicode string constants in the code and even print them on screen. It's the terminal (or the GUI I guess, but I don't write software with a GUI) that translates the bytes that I send on the line as Unicode characters.

And yes, of course the length of the string doesn't correspond at the characters shown on the screen... but even with Unicode you cannot say that! You can count (and that what Rust does) how many Unicode code points you have, but a characters could be made of more code points (stupid example, the black emoji is composed by a code point that says "make the following black" and then the emoji itself).

So to me it's pointless, and I care more about knowing how many bytes a string takes and being able to index the string in O(1), or take pointers in the middle of the string (useful when you are parsing some kind of structured data), and so on.

In conclusion Rust is better when you have to handle Unicode string, but most application doesn't have to handle them, and handling them I don't mean passing them around as a black box, not caring how they contain (yes, in theory you should care about not truncating the string in the middle of a code point when truncating strings... in reality, how often do you truncate strings?)


It's funny that you mention MQTT, as the spec requires strings (including topic names) to be encoded in UTF-8: http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/cos02/mqtt-v3.1....

Granted, ASCII is a subset of UTF-8, so as long as you control all the publishers and enforce the ASCII-only rule, you should be ok. But if some day you need to integrate third-party systems and they use characters outside of ASCII...


Literally nothing is stopping you from using `&[u8]` for 7-bit ASCII.


> So to me it's pointless, and I care more about knowing how many bytes a string takes and being able to index the string in O(1), or take pointers in the middle of the string (useful when you are parsing some kind of structured data), and so on.

Which is why you can still sub-slice Strings[1]

> Literally nothing is stopping you from using `&[u8]` for 7-bit ASCII.

Not only that, there's first party support for literals that are that[2].

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio...

[2]: https://play.rust-lang.org/?version=stable&mode=debug&editio...


Can you explain more specifically how Rust's String/str type does not handle your ASCII case? They are just bytes and ASCII is a subset of UTF-8.


This article was absolutely fantastic.

It kind of lost me right up until the first example code, where I did in fact have different expectations. Then how it broke down why, and gave different potential solutions was just wonderful. I learned a lot.

I will add though that other languages have a distinction between const and string object too (at least under the hood), they just go to great lengths to hide it from programmers. For example a const + const call might do the same thing as Rust under the hood, but it is transparent. Rust seems like it requires the extra steps because it wants the programmer to make a choice here (or more specifically stop them from making mistakes: like having immutable data stay immutable, rather than automatic conversion to mutable sting inc. the data cost involved even, automatic conversion is more convenient but also a performance footgun).

I don't think Rust is wrong, I think it is opinionated and honestly as someone that like immutability I kinda dig it.


Glad you enjoyed it :)

And yeah, under the hood those other languages do all kinds of wild optimizations with their "immutable" strings like sharing substrings between different strings and pooling to reduce allocations. I intentionally left out those nuances because from the user's perspective, those are all implementation details (even if they can surface in the form of performance changes).


It is well-enough written where it treats Rust, but almost lost me, too, for its abuse of "C/C++", treating the two very different languages as if they were trivial variations. Where string handling is concerned, as in so many other places, they are fundamentally different.

This "C/C++" bad habit is very commonly used, around Rust, to slyly imply there is no effective difference, but in a way that permits an injured response to criticism that "it just means C or C++". But it doesn't, unless you are talking about object file symbols or compiler optimizers; and often enough not then. What it does is encourage sloppy thinking, and resultant falsehoods. These falsehoods show up in the article, revealed if you change "C/C++" to "C or C++" in each case.

In several places, it says just "C++" yet is clearly still talking about C. It is OK not to know C++, or not like it. But things are hard enough without active falsehoods. It is not hard to get it right: if you are talking about C, write "C". That will always be correct. If you are talking about C++, say "C++".

Rust stands or falls on its own. It doesn't need invidious editorializing.


The only thing I claimed about the two as a category was that "strings are not immutable, they're data structures" (which applies to Rust too). I purposely didn't go into much more detail than that because it wasn't really the point of the article. I did mention that C works with strings as raw char arrays, and C++ has a struct around a char array that manages length-tracking and reallocation automatically.

I believe these two statements are accurate, though I'm happy to be corrected if they aren't. It's been a few years since I wrote C++. Beyond that, I see my claims as "abstract" and not "falsehoods".


A false abstraction is no less false.

I don't know when you went to school, but most places still teach using C++98, which is two generations removed from current practice. It is not unusual for TAs assigned to teach C++ to resent being taken away from their glorious combinators, and teach their resentment.

If coding C++ was not immense fun, you may be certain you were Doing It Wrong.


I am not sure what you mean by "false abstractions" in this article. Could you cite some concrete examples in the article that support your claim?

The author isn't claiming to go into a deep dive into C or C++. They are trying to talk about Rust. Your argument seems to be against the author's use of "C/C++" vs. "C and/or C++", which seems to me like some nitpicking, as the author uses "C/C++/Rust" later on. It isn't like they are pooling the languages as being the "same". Rather, they are talking about having similar, lower-level of control over "strings" as a data structure, as opposed to the primitives they seem to be in other high-level languages like C#, Python, Java, Go, Kotlin, Swift, or JavaScript.


You're correct. I don't know why people downvote you. A lot of C++ comments are grayed out. Does HN has an agenda or what?


This "C/C++" bad habit is used everywhere across the C and C++ industry, including reference books, FAANG, WG14 and WG21 papers.

And yes even if you give them C++20, many devs will code in the C subset, because a large majority doesn't care, isn't in HN or Reddit, or even bothers to spend hours of their life watching CppCon, C++Now or ACCU talks, unless it is mandated by their boss.

I imagine you know who Lakos and Victorio are on the C++ community,

"C++11/14 at Scale - What Have We Learned?"

https://www.youtube.com/watch?v=H8wzuvynV78


One thing I've often wondered about Rust strings. I often hear that &str is 'a string slice'. But, Rust has a notation for slices -- &[T]. Why are strings the only thing (that I know of) that don't use the same slice notation as everything else?


So, there's like, a few things here. First is, technically they both do use the same notation, &T, where T=str and T=[u8]. This is the whole "unsized types" thing. &Path is another example of this, String : &str :: PathBuf : &Path.

Beyond that though, &[T] implies a slice of Ts, that is, multiple Ts in a row. But a &str is a slice of a single string. So &[str] would feel wrong; that is, a &str is a slice of a String or another &str or something else, but isn't like, a list of multiple things. It's String, not Vec<str>.

Basically, Strings are just weird.


In particular: strings aren’t actually simple arrays of characters in Rust like they are in C, but there is an underlying array on the heap, and the notion of “slicing” that array of characters still makes sense semantically.


Slicing strings is a difficult concept when dealing with Unicode because slice indices and lengths can come in multiple flavors, but the only flavor that makes sense (grapheme or grapheme cluster) is difficult to count.


>strings aren’t actually simple arrays of characters in Rust like they are in C, but there is an underlying array on the heap

Not a Rust expert, but to my understanding, 'string' is an array of characters, not necessarily living in the heap.

The object 'String' (with capital S) might be, but with &str I can have a constant array of characters that is not in the heap and not even in the stack: it's in the code (.text code segment[1]?, or EEPROM or flash, for embedded folks).

If I understood it correctly, &str is a slice with a pointer to that piece of code. Of course being const, it cannot be changed. I can copy it into a String and manipulate it (append, insert, etc.), and that's where the heap is used.

&str, as an hypothetical struct, it may be in the stack, initialized in-scope with (probably) a pointer to somewhere in .text plus some bytes for length or other information.

[1] https://en.wikipedia.org/wiki/Code_segment


&str is not necessarily a static string literal; it's a slice of any string, living anywhere (including on the heap inside a String). Technically speaking, it can't be mutated because it's a non-mutable reference (like any other non-mutable reference). &mut str is a thing that can exist, and is mutable, it's just that you can't get one of those for a static (literal) string that lives in the code.


Thanks. Yes, of course it can point to something in the heap. My comment was related to GP's "but there is an underlying array on the heap" and I wanted to say that &str is not always necessarily in the heap. I missed that part.


> strings aren’t actually simple arrays of characters in Rust

Are you talking about them being a container with a pointer to the actual array, and also a size and etc?


In C, strings can hold invalid unicode. However, in Rust a str is guaranteed to be valid utf-8.

For added confusion, Rust has a `char` type which is actually 32-bits. You can create arrays of them, but the resulting string would be in utf-32 and thus incompatible with the normal `str` type.


    let s = "hello world".to_string();
    for ch in s.chars() {
        print!({},ch);
    }
will iterate through a string character by character. That's the most common use of the "char" type - one at a a time, not arrays of them.

Although the proper grapheme form is:

    use unicode_segmentation::UnicodeSegmentation; // 1.7.1
    let s = "hello world".to_string();
    for gr in UnicodeSegmentation::graphemes(s.as_str(), true) {
        print!("{}", gr)
    }
This will handle accented characters and emoji modifiers. A line break in the middle of a grapheme will mess up output.

By the way, open season for proposing new emoji starts tomorrow.[1]

[1] http://blog.unicode.org/2020/09/emoji-150-submissions-re-ope...


Thanks for the code snippet, however (kind of a Rust newbie here)

  $ cargo install unicode_segmentation (chokes)

  $ cargo install unicode-segmentation (seems to work)

  and in Cargo.toml
  [dependencies]
  unicode-segmentation = "1.7.1" (seems to work)

  yet in the code, its:
  use unicode_segmentation::UnicodeSegmentation;

  Why couldn't they be consistent using a dash vs. an underscore ?


Identifiers in Rust can't have dashes (rustc will never rely on whitespace for proper tokenization). It is arguable that because of that crates should not have dashes in their name. cargo was changed to automatically convert dashes to underscores for convenience because some dislike underscores in names. I personally think this is an unnecessary source of confusion and now I would make all tools (mainly cargo) treat them interchangeably.


> rustc will never rely on whitespace for proper tokenization

Huh? That seems clearly untrue, eg `fnfoo` vs `fn foo` or `x && y` vs `x & &y` or `x<<shift_or_type > ::foo` vs `x < <shift_or_type>::foo`? Presumably for some or all of those, one version ends up being a error (eg bitwise and with a pointer from `x & &y` probably doesn't work), but that's not at the level of tokenization.


Given the context, I think they simply mean that a-b is interpreted the same as a - b.

It can't be interpreted as a single identifier.


You don’t cargo install libraries; the cargo.toml line is the right thing.

You can’t use -s in Rust identifiers, so they need to be normalized to _ to be referred to in code.


this is way too common in Python world too - packages (crates) use dashes (they look nicer?), but the packaged library is named with underscores, because `-` is math operation and can't be in a name


char is usually some single byte of data. Characters can be multiple bytes. Slicing a string on character boundaries is more coarse grained than slicing on byte boundaries.


char in Rust is 32bits, so it has a 1:1 mapping to Unicode glyphs. You might also want to care about grapheme clusters, but those are not part of the stdlib.


Unicode codepoints; glyphs may take multiple codepoints.


You are, of course, correct.


Right - the point is that &str isn't syntactic sugar / alias / etc for &[u8] and it would be confusing to have a notation that suggested otherwise.


Are they though? I've long wondered why the Rust team hasn't imitated C# or other languages ease of use with strings while also retaining the existing functionality for lower-level use cases. I suppose it's a kind of gauntlet that a Rust dev would have to go through which could be a good thing but personally hitting walls with strings really turned me off on Rust the first time I tried it simply because my expectations were diametrically opposed to the reality of strings in rust.


Different languages trade off different things. I don't know enough about C#'s string handling to reply effectively, but Rust does the best it can given the things it cares about. Other languages don't care about the same things, and therefore, can make different tradeoffs.


Related question (if you don't mind another).

Why is 'str' a "primitive type"? What about 'str' means it has to be primitive, instead of being a light-weight wrapper around a '&[u8]' (that obviously enforces UTF8 requirements as approriate).


Well, &str is the type of string literals, so on some level, it has to be special.

There was a PR implementing it as a wrapper around &[u8], but it didn't really provide any actual advantages, so it was decided to not do that.

https://github.com/rust-lang/rust/pull/19612


You can't just index into a string.

    let hello: &[u8] = "Héllo".as_u8_slice();
    // hello[1] != "é", I believe it would be the bottom half-bytes of "H".
    // hello[2] != "é", I believe it would be the top half-bytes of "é".
To use &[u8] would be very non-ergonomic.


One could, in theory, make an ExtendedGraphemeCluster type, and make str a slice of ExtendedGraphemeClusters. So &[ExtendedGraphemeCluster] could be indexed into without having things not make sense. Of course that's much more complicated than most other primitives, and most people don't have any idea what an Extended Grapheme Cluster even is. But since they're the Unicode notion that most naturally maps to a "character" you could just call the type Character or char, and confuse the hell out of the C programmers by having a variable-width char type.


Sure - but iterating by extended graphemes isn’t the only thing you want to do with strings. Sometimes you want to treat them as a bunch of UTF-8 bytes. Sometimes you want to iterate / index by Unicode code points (eg for CRDTs). And sometimes you want to render them, however the system fonts group them.

It makes sense to have a special type because it can supports all of this stuff through separate methods on the type. (Or through nearby APIs). It’s confusing, but I think it’s the right choice.

Although, I think the most confusing thing about rust strings isn’t that &str isn’t &[u8]. It’s that &str isn’t just &String or something like that.


Well, &String is ALSO a valid Rust type, it's a reference to a (heap-allocated) String.

Really, there's no single, good option that covers all use cases. Which is why Rust ended up with multiple string types, along with ways to convert between them.


> Well, &String is ALSO a valid Rust type

Yeah I know. This is really confusing when learning rust. Coming from another language where there's a single rust type its really tricky to figure out when to use what!

For reference, I now use String for large strings, string builders, and when the string is owned - like in a struct. And &str when taking in a string as an argument to a function. Or when iterating or something like that where you don't want to move the string. And then for small owned strings (like labels or IDs) I reach for the InlinableString crate. That handles small strings inline and puts large strings on the heap, like what Apple does for Swift and Objective-C.

I understand why its this way; but its a pity the best answer rust has is so complex and confusing.


It's worth noting that you can use &String anywhere that &str is accepted. The distinction is subtle and I agree it's sort of confusing, but in practice the former gets automatically coerced to the latter as needed


The other issue is that Rust pretty strongly expects slices to be laid out in memory as a flat array of some fixed-size element type. If something isn’t laid out that way, it can’t support all the operations a slice is supposed to support, ranging from the high-level operation of O(1) indexing, to the low-level operation of getting a raw pointer to the slice contents. In principle you could make &[ExtendedGraphemeCluster] special, but it would hamper the ability to write generic code that works on &[T] for any T. Besides, Rust likes to be explicit. Things that are different should look different.

In addition, even if you gave up on O(1) indexing by grapheme cluster index, anything faster than O(n) indexing would require some sort of tree data structure to translate grapheme cluster indices to byte indices. Such a structure would be expensive to maintain and not needed the vast majority of the time; in a language that prioritizes performance, it simply couldn’t be part of the default string type no matter what the syntax.


The bstr crate is designed to make handling of ASCII-compatible (but not necessarily valid UTF-8) encodings a bit more ergonomic: https://docs.rs/bstr


Yea but i don't think that's what GP was asking, imo. Rather than `[str]`, i think they were asking why it's `str` and not `[char]`, no?

Just as `[u8]` is to `Vec<u8>`, `[char]` is hypothetically to `Vec<char>`.. and `Vec<char>` is basically a `String`, no?

_edit_: Though looking at the docs, `char` is 4 always bytes, so i guess that's where the breakdown would be? `char` would need to be unsized i guess, but then it would be an awkward `[unsized_char]`, which is like two unsized types.... hence `str` maybe?


`char` is a (internally) a `u32` because it represents any single unicode character. `str` is not a `[char]`, because rust doesn't store strings as utf-32 (system APIs don't accept utf-32, and it tends to waste space in many cases)

`str`'s data layout happens to be `[u8]`, but it's type provides additional guarantees about the structure of the data within it's internal `[u8]` (for example, forbidding sequences of u8 that don't encode valid utf-8).


To be extra pedantic, char represents a single Unicode Scalar Value.


This is a big deal because adding an accent mark to a letter (often) means a single char can no longer store it. APIs should not orient around isolated scalar values or codepoints because most devs will misuse them, not being experts on combining and normalization.


Well yea, i wasn't saying `[char]` _is_ a `str`, rather i was positing that the GP comment was asking why it's a `str` than some hypothetical `[unsized_char]`.

I think `char` _would_ work, if it was similarly unsized like a single piece of `str`. The Problem is .. as i see it, that `[unsized_char]` seems odd.


Yes, the reason you can't use char here is that a char is always 4 bytes, so a &[char] is a type that already exists, and that type uses four bytes per character.


`str` is not `[char]`, there is no datatype `char` possible or which this would hold, and the name is already taken.

`str` is not a slice; this is itself already a wrong statement. A slice is a dynamically sized type, a region of memory that contains any nonnegative number of elements of another type.

A `str` is dynamically sized, but is not guaranteed to contain a succession of any particular type. It's simply a dynamically sized sequence of bytes guaranteed to be Utf8.

`strs` aren't slices; all they have in common with them is that they are both dynamically sized types.

`Vec<char>` is also not the same as `String`; a string is not a vector of `char`, which is already a type that has a size of 4 bytes.

This all results from that Utf8 is a variable width encoding, and since slices are homogeneous, all elements have the same size.


Are there potentially other situations where `&[T + !Sized]` makes sense?

The majority of the functions on `&str` seem to make sense for all `&[T + !Sized]` where `type str = [unsized_char]`.


You’re right that what you’re suggesting could work as a way to treat &str as a type of slice, without breaking the type system. But there’s a fatal flaw. If you have `s: &[unsized_char]`, what should slicing, say, `&s[2..4]` do?

With the actual `&str`, this treats 2 and 4 as byte indices. But if you’re going to treat str as a slice of a specific type, the indices really should be in units of that type. `&s[2..4]` should return the second and third grapheme clusters, or codepoints, or whatever you want to define `unsized_char` to be. The same applies to `len()` and all the other methods that return or accept indices.

But if it’s in units of `unsized_char`, how do you actually implement indexing? You could do an O(n) scan from the beginning of the string, but that’s clearly unacceptable. Or you could set up some kind of acceleration data structure behind the scenes, but that would be expensive to maintain, and would conflict with Rust’s goals of explicitness and cheap FFI.

If on the other hand you keep indexing with byte offsets, you end up with almost no operations that work, and do the same thing, for both &[unsized_char] and normal slices. You wouldn’t be able to write any useful generic code that operates on &[T] for T: ?Sized, at least not without branching on which kind of slice it is. And even from a human user’s perspective, the syntactic similarity between two things that work differently could confuse more than it helps. Again, a matter of explicitness.


I don't really know what unsized_char would even mean, chars have a size, and str is not a sequence of chars.

That said, I'm also not sure in the general case.


unsized_char would be some phantom/opaque type that is not actually useable because it represents an object of variable length.

e.g. Think of it as an enumeration like :

    unsized_enum unsized_char {
        Char1(u8),
        Char2(u8, u8),
        Char3(u8, u8, u8),
    }

    let c: unsized_char = ...; // is a complied error: “cannot assign a variable stack size.” or something.

    let parsed: &[unsized_char] = unsafe { /* parse some bytes */}; // this complies because the implementation of some hypothetical &[T + !Sized] appropriately abstracts the allocated memory region.
Unlike a normal enum it doesn’t always take 3 bytes of memory, it’s variable memory such that [unsized_char] might only take 1 byte if there is only 1 unsized_char::Char1 inside.

Of course this means that you can’t index into the slice because the boundary is not clear.

What I’m describing is not different than the internals of str. I’m just trying to armchair design a generic implementation such that str and slice are more unified.


I wish they had been named &path and Path instead, it would feel more consistent than String, &str, PathBuf, &Path


I joke the only way that I'll sign off on Rust 2.0 is if we get to do the exact opposite, rename String to StrBuf, haha! (Same reason though, for consistency.)


I'd actually love something like that, mostly because differentiating between "str" and "String" in, uh, speech can be pretty annoying. They both sound the same! :)

I've taken to calling "String" "owned string" but that's a little unwieldly and not quite suited for complete Rust beginners who don't know the concept of ownership yet. Same with "string reference" for "str".


`str` contractually guarantees UTF-8 contents, so because of multi-byte codepoints it cannot be sliced at arbitrary indexes like a `[u8]` can be.

As a side note, it is possible to define your own "unsized" slice type which wraps `[u8]`. This can be useful for binary serialization formats which can be subdivided / sliced into smaller data units.


> `str` contractually guarantees UTF-8 contents

I don’t think any other languages do that. Instead, most of them are implementing as much as they can while viewing the storage as a blob of UTF8/UTF16 bytes/words, and throw exceptions from the methods which interpret the data as codepoints.

Strings are used a lot in all kinds of APIs. For instance, strings are used for file and directory names. The OS kernels don’t require these strings to be valid UTF-8 (Linux) or UTF-16 (Windows).

To address the use case, Rust standard library needs yet another string type, OsString. This contributes to complexity and learning curve.


The number of things that you have to learn remains constant. I could even make the argument that the number of things you need to learn up-front is lowered when only talking about the distinction between String and OsString/CString. The difference is that rustc will be pedantic and complain about all of these cases, asking you to specify exactly what you wanted, while other languages will fail at runtime.


> rustc will be pedantic and complain about all of these cases, asking you to specify exactly what you wanted

So, they offloading complexity to programmers. Being such a programmer, I don’t like their attitude.

> other languages will fail at runtime

In practice, other languages usually printing squares, or sometimes character escape codes after backslash, for encoding errors in their strings. That’s not always the best thing to do, but I think that’s what people want in majority of use cases.


The complexity already exists regardless of what the language does.

The only choice is whether it's explicit and managed by the language, or hidden, and you need knowledge and experience to handle it yourself without language's help. If you want "squares" for broken encoding, Rust has `to_string_lossy()` for you. It's explicit, so you won't get that error by accident.

Avoiding "mojibake" in other languages is usually a major pain. For example, PHP is completely hands-off when it comes to string encodings. To actually encode characters properly you need to know which ini settings to tweak, remember to use mb_ functions when appropriate, and don't lose track of which string has what encoding. There's internal encoding, filesystem encoding, output encoding, etc. They may be incompatible, but PHP doesn't care and won't help you.


> It's explicit, so you won't get that error by accident

I would want it to be implicit.

Ideally, for rare 20% of cases when I care about UTF encoding errors, I’d want a compiler switch or something similar to re-introduce these checks, but I can live without that.

> For example, PHP

When you compare Rust with PHP it’s no surprise Rust is better, many people think PHP is notoriously bad language: https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design/

I like C# strings the best, but I also have lots of experiencer with C++, and some experience with Java, Objective-C, Python, and a few others. None of them have Rust’s amount of various string types exposed to programmers, many higher-level languages have exactly 1 string type.

Interestingly, some dynamic languages like swift use similar stuff internally, but they don’t expose the complexity to programmers, they manage to provide a higher-level abstraction over the memory layout. Compared to Rust, improves usability a lot.


> I would want it to be implicit.

Then simply use the `lossy` function. Or create an alias to it if you don't like the name.


Because there's no possible choice for T. You can't use u8, because then you would allow non-utf8 data. You can't use char, because a &[char] uses four bytes per character, whereas a &str stores the characters in utf-8, which is a variable-width encoding.

A &str is really a different kind of thing from other slices. In any other slice, each element in the slice takes up a constant number of bytes, but this is not the case for a &str.


That's a great question. I don't have a complete answer, but I do know that &str has lots of string-specific functionality that's really helpful. The .chars() method for example gives you an iterator over actual unicode chars, as opposed to bytes, because the former can have variable byte-widths. There may be other reasons; I'm not sure.


In Rust a slice `&[T]` is a fixed length sequence of `T`s laid out linearly in memory. Every `T` is required to be of the same size.

Strings in Rust are (normally) represented as UTF-8. Both `String` and `str` represent data that is guaranteed to be valid UTF-8.

This means that if Rusts UTF-8 strings were represented as normal slices, they would have to be slices of UTF-8 code-units.

Rust wants to provide a safe and correct String data type, and therefore, indexing a string on a byte (code-unit) level would be incorrect behavior.

Having a custom type `String` and `str` instead of just a `Vec<u8>` enables you to have more correct behavior implemented on top of the data type that doesn't implement normal slice indexing and such.

---

As a note, even though you probably don't want to normally, you can quite easily access the backing data of your string using `String::as_bytes`


The regular slice type &[T] lets you access and manipulate individual elements. But Rust strings enforce the invariant that they are valid Unicode, which puts restrictions on element-wise operations.

Calling &str a "string slice" is really more about the contrast with String, and how the relationship there mirrors the relationship between &[T] and Vec<T>. It's more of an analogy than a concrete description of the interface.


What would they be, &[u8]? That's already a thing: it's an arbitrary byte sequence. &str is specifically UTF-8 data.

&OsStr and &Path are the same way.


Indeed &str has as_bytes which returns itself as &[u8]

But, str is a subset of &[u8], the type's contract is that it is unsafe to have non valid UTF-8 data, hence https://doc.rust-lang.org/std/str/fn.from_utf8.html can error, offering the unsafe variant https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.htm...

This is all very different than &[char] which would be an array of 4 byte characters (or, a UCS4 string)


> But, Rust has a notation for slices -- &[T]

&[T] is an “array-slice”, even if it's called just “slice”.

See this example ([..] is the syntax to create a slice of something): https://play.rust-lang.org/?version=stable&mode=debug&editio...


Long story short, Rust slices are contiguous collections of identically-sized elements. Strings, by being UTF-8, are not guaranteed to be identically-sized. So they require a distinct type that lets you work with them in spite of this detail.


Another: String is inmutable AND whole. &[T] is parts, and when declared as &mut [T] is mutable.

Because unicode, &[T] make easy to write wrong code (that asume here T = Char).

It CAN'T be Char, because char is larger than u8:

https://doc.rust-lang.org/std/primitive.char.html

and it mean unicode point.

In other words: Rust is using types to PREVENT the wrong behavior.


Fun fact: `&mut str` exists. You don't get random access, but in controlled scenarios it's fine to mutate str in-place, e.g. `make_ascii_lowercase`

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...


"string slice" is just the name of the one thing, and "slice" the name of the other, and they are different things with similar names and related features.


Incredibly clear and compassionate writing (callout boxes throw a bone to readers who aren’t well versed in concepts like the heap, character arrays etc.).

Big kudos to the author!


Thanks :)


> Then if your program later says "actually that wasn't enough, now I need Y bytes", it'll have to take the new chunk, copy everything into it from the old one, and then give the old one back to the system.

This is mostly true. If you get lucky, there may already be enough unused space past the end of the existing allocation, and then realloc() can return the same address again, no copying required.

But if you know you're going to be doing lots of realloc() (and you're not unusually-tightly memory-constrained) then instead of growing by 1 byte each time it's often worth starting with some sensible minimum size, and doubling the allocated size each time you need more space. That way you "waste" O(N) memory, but only spend O(N lg N) time on copying the data around instead of O(N^2).


> but only spend O(N lg N) time on copying the data around instead of O(N^2).

O(n) time copying, with doubling the buffer, not O(n log n)

It seems a bit counter intuitive, of course, that you can copy multiple times and have it still be O(n). The last copy will be n bytes. The second to last copy will be n / 2. The third to last, n / 4, etc.

  n + n/2 + n/4 + n/8 … = 2n
You end up copying, roughly, 2n bytes across all the buffer resizes, which is O(n).


Funnily enough this exact topic came up on Hackernews just recently when a Googler started benchmarking AssemblyScript (a language for WebAssembly) and realized that AssemblyScript was increasing the size by +1 instead of doubling when reallocating...

Here is the HN thread: https://news.ycombinator.com/item?id=26804454


I personally use a 'stride' instead of doubling. On small sizes doubling works OK. But when you get past about 8k-16k the empty memory starts to stack up.


Except on non-embedded platforms, oftentimes large blocks of allocated memory aren't occupying physical memory until you write to them. There's not much reason to avoid using exponential buffer growth on a system with a robust virtual memory implementation.


I would like to add one personal example where this isn't very true and is kind of painful: for a while I was running a windows system with quite a lot of RAM but not much disk space, enough that having swap space equal to my RAM significantly reduced the space available on an already nearly full disk. In linux this didn't represent a problem, because if memory was allocated and not used, it didn't count at all. However windows does not over-allocate: if you request memory there must be somewhere (swap or RAM) to put it. Without swap space this meant getting memory allocation errors with less than half of physical memory used, which was extremely annoying. I either needed to give up a huge chunk of my disk space (in practice I often needed 2x my RAM in swap to avoid memory exhaustion), or not be able to use all my actual RAM.


Hmm, I was under the impression that Windows would also overcommit but it seems you're right. I've been in Linux land for too long I guess. If Windows doesn't have a physical place that can back the allocation, it will fail :(

Seems crazy to me given how useful it is.


Never do this - you're introducing a hidden O(n^2).

Folks, this is why you take the algorithms class in CS.


I know it is a hidden one, I should have been more clear in my statment. I found stride to be better in my experiments in my code for many of my use cases. It is a mater of finding the correct stride for your code (sometimes I make it self tune itself). If you pick stride x2 you will waste memory in some (many) cases. Some libraries calls touch the whole thing and make sure it is 0 and paged in. Usually you want to find a balance of 'not a lot of waste and not a lot of speed loss'. Now in some cases yeah x2 on every alloc where you bubble over is the perfect choice. Other times the O(n^2) is the right choice. I found stride to be a nice compromise. As many times you know up front what the block size is. Finding a good compromise on slack is sometimes tough. Especially if you have buffers that are discarded and re-used over and over.

One project I used a compromised stride alg. Basically if it every had to double create one of the strides it doubled the stride from then on. So it would say start 'cold' at allocating 4 bytes (you get 4, 8 or 16, usually at the bottom anyway). Then when you have to allocate 2 blocks in a row to fulfill you double the stride for the next time (which is coming in on the next tick of the for loop). It was not perfect but it also did not balloon the memory wildly out. This was because of the way the incoming data was structured. But gave a decent hidden average of size. It ended up very nicely between O(n^2) and O(n). Both speed and size wise.

This is why I have the fancy CS degree. To know the trade off between runtime and space usage. If you blindly pick algs you will have a 'bad time'. Sometimes space is of consideration (you have 256k total in your box), sometimes speed is a consideration (I need this to finish in 3ms). Pick one alg over the other and you can have a bad time in some cases.


If you multiply by two, you're wasting from 0 to 50% at any random time, for an average of 25%, right? So if you don't like that, then multiply by a smaller factor. Multiply by 1.02 if you can't abide by having more than 1% empty.


Could you elaborate? This seems very interesting.


Asymptotically, there's no difference between allocating a n+1 buffer and a n+k buffer before copying your old data in. You'll still get O(n^2)

In reality, it depends on the data you're handling. You may never end up handling sizes where the O(n^2) asymptote is significant and end up wasting memory in a potentially memory constrained situation. At the end of the day, it all depends on the actual application instead of blind adherence to textbook answers. Program both, benchmark, and use what the data tells you about your system.

If I've got a 500 MB buffer that I append some data to once every 5 minutes, I might want to reconsider before I spike my memory usage to fit a 1 GB buffer just to close the program 15 minutes later.


The O(n²) here is the time spent copying of the data; it's not about the size of the buffer, or that you'll temporarily use 2× the space. The program would die by becoming unusably slow.

Take your 500 MiB example. Let's say we start with a 4KiB buffer (well in excess of of what most vector types would start you with). If we grow it by a constant 4KiB each time it runs out of space, by the time the buffer is 500MiB we've copied ~30 GiB of data. If instead we grow the buffer by doubling it, we will have had to copy ~1000 MiB (1 GiB) by the time it hits 500 MiB, difference of 30×. (Which is why the program would slow to a crawl.)

I'm ignoring the time aspect of your supposition; at that rate, you'd never really get to 500MiB using the starting sizes & strides suggested upthread. That's a highly exceptional case; for nearly everything, the general recommendation from CS is going to be the right one. Yes, profile & optimize where and when needed, but the thread is clearly about defaults, and this is one of those defaults that holds pretty tightly.

Further, at that size, if you don't use the memory, most OSes will simply never allocate the page. (They'll assign a dummy zero-filled page that will get swapped with a real page of memory one once the first write happens. So even at a large size, you'll end up using 1000 MiB of virtual memory, but not real RAM.)

O(n²) is playing with fire: it grows fast, fast enough relative to the data that it is a common source of bugs.

Edit: Gah, I'm off by a factor of 1,000, I think, so it's "only" 30×. (The point stands, though; the next doubling is 2GiB vs. 128GiB, and the gap widens fast.)


Yes I'm aware of how the algorithm works. I also know that if I allocated 500 MiB at the beginning of the program expecting my memory usage to be roughly that size, and my prediction was off by 50 MiB maybe I don't want to go hunting for another 500 MiB of space before my program ends or I stop using the buffer and free it.

But your point about the virtual memory makes that moot anyways. Thank god for modern OSes. I've clearly been spending too much time around microcontrollers.


The case here: we have a buffer, and it grows as we stick bytes into it. A "vector" in many languages.

If you double the underlying memory each time you need more space, you'll end up copying some number of bytes that is linearly proportional to the final size. The final copy will be n bytes, the second to last n/2, … n/4, n/8, etc. The sum of those copies is 2n; that is, we'll copy about twice the final amount of data during all the resizes of the buffer. 2n is called "O(n)" in order notation, which describes how a function (in software engineering, usually time or memory use) grows in response to its input; it doesn't care for the leading constant factor of 2 ([1]).

If you instead grow the buffer by some constant amount — and it can be a single byte, or a page, it doesn't matter — you'll end up doing O(n²) copies. One way to see this is to list the copies out; also, let's say there's an imaginary copy at the front of 0 bytes (and is free), but it'll simplify the sum here; let's call the amount we grow the buffer by "k"; each time we copy, up to size n, will be:

  0
  k
  2k
  3k
  ...
  n-3k
  n-2k
  n-k
  n
To sum this sequence, We can pair each of those up: n + 0 is n; n-k + k is n. n-2k + 2k is n, and so on. You'll get n/2 copies of size n, so the total copied is about n²/2, or 0.5n². Again, the leading constant is unimportant, so it's O(n²).

This is a lot more work. Like, a lot a lot. Say you expand the buffer until you have 500 MiB of data. If you grow a constant 4KiB at a time, you end up doing:

  In [6]: sum(range(0, 500 * 1024, 4096))
  Out[6]: 31744000
~32 GiB of copying; if instead you grow by doubling, you'll copy ~1GiB, or 30× less.

The larger point is that O(n²) grows viciously fast compared to O(n). To quote Bruce Dawson, whose blog is filled with amazing feats of debugging slow programs,

> Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

https://randomascii.wordpress.com/2019/12/08/on2-again-now-i...

There's also this blog: https://accidentallyquadratic.tumblr.com/

There are worse things than O(n²), but they usually are so bad they're caught in testing. But O(n²) can work well enough in testing. There are things better than O(n²), but worse than O(n), like O(n lg n), but O(n lg n) is usually fast enough in a full, production data set. (Sorting, in the general case, is O(n lg n).) This is what Dawson means by "sweet spot".

The even more meta conversation is that in the HN discourse, there are often conversations that wonder what the value of a CS degree is. This is one of the times where there is value. And… it's not that uncommon. I use my degree fairly often, but I also had a fairly good school, IMO, that balanced the theory and the practical aspects. E.g., my database professor taught us Btrees, relational algebra, etc., but also had us work with a real database, learn SQL, and joked that a btree is O(1) (It's not, it's O(log(n), but in a modern database the base of that log is huge, and so it generally takes no more than a half dozen splits in the tree to find your item. (Above that, you're talking about databases with billions or trillions of rows.) In theory, the number of splits keeps growing, but in practice, finding a dataset that large becomes harder exponentially fast, and soon you run out of hard disks on earth, or something, that for practical purposes it suffices. (And/or, you'll know when it doesn't.)

[1]: https://en.wikipedia.org/wiki/Big_O_notation#Family_of_Bachm... ; specifically, note the |f(n)| ≤ k·g(n); that leading "k" is an arbitrary (in the sense that we're saying there exists some k) constant, g(n) is the thing inside the O(…). So O(2n) and O(n) are the same, because that leading constant of k is going to "eat" the 2; the constant we find for one will just be twice what we'd find for the other, so we remove it from g(n) when we discuss these things. The shape of the curve — how it responds to input size changes — is the point.


Thank you very much!


> and then realloc() can return the same address again, no copying required

Interesting, I didn't know that!

Regardless, though, I intentionally skimmed over certain nuances like exponential buffer growth for the sake of the main point


He's working too hard. You don't need all those type declarations.

    let a = "hello".to_string();
    let b = "world".to_string();
    let c = a + " " + &b;
    println!("Sentence: {}", c);
It's a bit confusing that you need "&b". "+" was defined as String + &str, which is somewhat confusing but convenient.

If you've been handling Unicode properly in other languages, then Rust strings seem easy in comparison.

Yes. C is awful. C++ is still sometimes UTF-16. C#/.net is still mostly UTF-16. Windows is still partly in the UTF-16 era. So is Java. So is Javascript. Python 2 came in 2-byte and 4-byte versions of Unicode. The UTF-16 systems use a "surrogate pair" kludge to deal with characters above 2^16. CPython 3 has 1, 2, and 4 byte Unicode, and switches dynamically, although the user does not see that.

Linux and Rust are pretty much UTF-8 everywhere now, but everyone else hasn't killed off the legacy stuff yet.


The type declarations are there for maximum clarity, since the inferred types may not be obvious to the target audience


Yes, please. One of my biggest peeves are posts that are meant to be educational, but yet don't define the types used for variables nor define the namespace / package they are from. Java posts are rife with the latter, and I'm really not looking forward to the `var` keyword making the former a thing too. And quite a few get bonus points for doing such on types that require a new dependency -- but good luck figuring out which dependency without knowing the dependency name nor the package name of the class!


You did it right, IMHO. The parent comment just confuses me while your article was quite clear.


If you can indulge a non-Rust point of view, if I'd been faced with this design problem, I think I would have put a "this is a rodata-backed string" bit into String, and then the representation would be

    {ptr, len, rodata-bit}
(or squeeze a bit out of the ptr or len if the space matters).

Then "abc" could have type String, and the only difference between

     let abc1: String = "abc"
and

     let a: String = "a"
     let abc2: String = a + "bc"
would be that (assuming the compiler doesn't get cute) abc1 is pointing at the rodata bytes for "abc" and abc2 is pointing at allocated bytes. (But it can tell them apart so the deallocation is not ambiguous.)

It seems like this would have avoided a lot of the ink spilled over &str vs String. I know it's too late now, but was this considered at all, and if so what were the reasons not to adopt it?

Thanks.


Cow<'static, str> is exactly what you are asking for.

In general the String type is not very good, and you should use something appropriate to your use of strings for any string that is used more than a constant times in your program, which can be, for instance:

- String

- Box<[str]>

- Cow<'static, str>

- Cow<'a, str>

- Interned strings

- Rc<[str]> or Arc<[str]>

- A rope data structure

In fact I think putting String in the standard library might have been a mistake since it's almost always a suboptimal choice for anything except string builders in local variables.


COW == Copy on Write

  Its interesting that certain file systems use COW schemes:

  https://en.wikipedia.org/wiki/Btrfs


When you want a reference that can transparently become an owned value when mutated, Rust has a type Cow that implements this generally. It's an enum that has two variants, Owned and Borrowed. If you mutate a Cow::Borrowed, it first copies the data to a new owned allocation, and replaces itself with Cow::Owned.

However, the difference between String and &str has nothing to do with mutability, or whether the data is in a read-only page or not. If you have &mut str, you can mutate the values, and if you don't declare your String as mut, it's not mutable.

The difference is that String owns its allocation, whereas &str is a reference to memory that someone else owns. It's exactly the same as Vec vs &[u8], and if you check the source, you'll see that String is just a wrapper around Vec<u8>: https://doc.rust-lang.org/src/alloc/string.rs.html#278-280

The general principal is that if you own the allocation, you can reallocate it to change its size. If you only have a reference to memory that someone else owns, you can't do that.

Consider for example Go's slices, which work kind of like you describe, where they point to the original array's memory until someone grows the array, at which time they might or might not make a new allocation. Appending to a Go slice from some inner function can suddenly break code that calls it, because the slice it's operating on suddenly points to new memory.

Rust's Big Idea is to make ownership and borrowing more-explicit. Having your default stdlib text type be ambiguous about whether it's owning or borrowing is both weird, and also makes things a lot more awkward to deal with.

If you use Cow<str>, you'll see that its API declares that it borrows from some source, and can't outlive that source. That's fine if what it's borrowing from is static text in the binary, but that really constrains what you can do with a string that you've dynamically allocated.

Just like all other data structures, having a distinction between an owned value and a reference to the value is very useful. It's easy to build a variety of shared or ambiguous ownership structures on top of owned values and references, but it's much more complicated to go the other direction.


In addition to what the others have said, Rust's `String` is three pointer sized words: `(ptr, len, capacity)`. So, a `String` has 50% of the overhead of a `&str`.


I hadn't heard of rodata before, but based on a quick Google, I think what you're describing is similar to Cow<str>. I can't speak to the reasons why this wasn't made the default, but I believe it is at least possible.


It's probably the same reason why IIRC the default Rust String does not have a "small string optimization": it would add a lot of unpredictable branches. And it would be even worse, since unlike the "small string optimization", even small mutations which don't change the size would have to allocate when the original was read-only.


Another problem is that `&str` is more than just "read-only". It also tracks lifetime to prevent use-after-free.

A universal owned-or-readonly-borrowed String would be unsafe without also adding reference counting or GC to keep the borrowed-from string alive.


Where would abc allocate the bytes from, if a is in rodata?


I wish more languages had primitives for "simple strings". Most of the time when I use a string I could live with a restriction that it's ascii only and can fit in 64bytes. "Programming string" vs "human string". For example a textual value of some symbol or a name of a resource file I can control myself or a translation lookup key (which I can make sure is always short and ascii). An XML element name or json property in an enormous file with a schema I control. It seems weird to use the same type for the "human string" e.g. user input, a name, the value looked up from the translation key and so on. For the simple strings it feels like making heap allocations, or either using two bytes per char (e.g C#) or worrying about encoding in an utf-8 string (e.g. Rust) are both wasteful.


I would say UTF-8, but I really do miss old-school Pascal strings (aka strings with a length field and a fixed allocation) sometimes.

Pascal strings could automatically #[derive] a whole bunch of the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq, ...) that would help sidestep a whole bunch of ownership issues when you start throwing strings around in Rust.

The downside would be that you would occasionally get a runtime panic!() if your strings overflowed.

Sometimes, I can live with that. Embedded in particular would mostly prefer Pascal strings.

I suspect that Rust is powerful enough to create a PString type like that and actually fit it into the language cleanly. The lifetime annotations may be the trickiest part (although--maybe not as everything is a fixed size).


> Pascal strings could automatically #[derive] a whole bunch of the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq, ...)

Worth noting that the explicitness of #[derive] was a design decision; particularly when exposing a library API, it's good to have control over the set of interfaces you actually support, so that (for example) if one of them stops being derivable later you won't break people's code downstream


> The downside would be that you would occasionally get a runtime panic!() if your strings overflowed.

VLQ/LEB128 would take care of encoding an arbitrary string size to avoid this issue.


You can pretty much do this in Rust:

  let simple_str: &[u8] = b"hello world";
  println!("{}", simple_str[6] as char);  // 'w'
Though I would advise against it in most cases, because even many "non-human" formats like XML and JSON do allow for Unicode characters


Filenames too.


There is no problem making filenames that are bound ascii. Just because they can be non-ascii on most OS'es, doesn't mean they are (when I myself control all input - that's the key).

I think it's the problem of generality here. If I make a game that has thousands of texture files I want to store in a lookup table and it saves me x ns on the lookup if those textures are in a dictionary with 1-byte characters vs 2-byte character keys, then I could just make sure my file names are ascii. So I have a collection of 2000 ascii file names. I don't care whether the rest of the files outside my game textures are, or could be, non-ascii.


You shouldn't be using strings in that case anyway, especially if performance matters.


This is, as I understand it, how Symbols work in Ruby. They are prefixed with a colon, and are interned and immutable.


I'm glad they don't have it. Many "programming" identifiers, filenames, and whatnot already support Unicode. You'd be setting yourself up for failure because you forgot to update the type you used somewhere.


But I want it to fail and I'm happy for it to fail then. If I write a utility that is about to read a multi gigabyte file in a json format I know has only ascii identifiers, into a dictionary, then it's silly that the dictionary has UTF-16 keys stored in an array on the heap, with each value stored on the heap as well, so second pointer away.

Just because I'm using json or xml or something doesn't mean I'm parsing any json, or that I'm not happy for it to catch fire if I parse a file that has some non-ascii in it.


I love articles like this. I imagine this is the kind of stuff you learn if you study CS or Software Engineering in college. Maybe when I retire I will go and get a CS degree so I can learn all the things I should have learned before I began working professionally as a programmer 25 years ago.


Nah, I wish, you just learn how to implement quicksort in Java and stuff like that The only languages I saw during my degree were Java, C, Python, Perl, and a little bit of Prolog. And I finished my Bach last year.


We used C++ throughout most of my CS program, and I hated it at the time and never wanted to write it ever again (and haven't!), but good lord did it help me understand how computers work. I've benefitted from that perspective ever since.

I'm not sure you have to do a whole degree to get that perspective, though. Just learning C or C++ should get you a good chunk of it.


> understand how computers work

No body understands it even at electron orbitals. Your understanding will be wrong in next decade if not wrong already. Computer Arch is proprietary.

Give a try for C++17.

Zero Cost Abstractions RAII

It's a beast.


Except that you forget pretty quicly everything you've learned in CS because when you work in the real world you don't need it.


This was one of the more confusing parts of Rust when I first started using it. I find that I don't necessarily run into it a lot. After a while you sort of change how you organize your code and you find that you're not fighting strings or the borrow checker very often. I'm not really sure how I got to that point however, it's more just practice than anything else.


Yeah, in some ways it is kind of weird how much discussion there is about Rust's borrow checker compared to how much time practitioners actually spend dealing with it. I see less about string handling (this post being an exception) but that is also basically a non-issue once you get the hang of it.


Exactly. It's not actually hard once it clicks, but I think a certain subset of newcomers spend lots of time being frustrated with the fact that their understanding doesn't seem to work, when they could just be given the key bits of information and be able to move forward with things making sense. That was the motive of this post :)


I want rust to be a popular language, but I can see how C and C++ are so easier for beginners if you considers the basics: you first learn C, and then you go into C++ and you use what you need.

I'm very curious if there is feedback of uni teachers teaching rust as a first compiled language to students, especially when dealing with borrowing etc.


I'm not sure I would call them "easier for beginners", but probably "better for beginners trying to learn underpinning concepts"

But overall I agree: I probably would not teach Rust as a first programming language, maybe not even a first low-level language. Its abstractions may be zero-cost in terms of performance, and airtight in terms of correctness, but many of them are not intuitive unless you understand what's really happening underneath. Some of them feel intuitive enough at first... until they aren't, and then people hit a wall.

I'm not sure what the solution is to this problem (or whether there is one). To its credit the team has done an incredible job documenting and writing guides, but I still think there's a gap here. It may simply be that at a certain point Rust is not a learning language, it's a productivity language for experienced programmers.


My way of thinking about this (which I think is correct) is: "&str" is like "&[u8]", except that "&str" is guaranteed to contain valid UTF-8.


This is a great article! The only omission is that you can concatenate two &str's at compile time using concat!().


I didn't know about that! It's pretty interesting, though also fairly niche because the most common usecase is a string addition where at least one member is a variable, not a literal


Hm, I feel Rust's strings are the easiest I have worked with on any programming language but that might be due to my knowledge of unicode and of C.

The only thing which surprised me was the &str type. Why isn't it just an ordinary struct (called e.g. Str) consisting of a pointer/reference and a length?


With dynamically sized types like `str`, Rust allows to separate "what kind of pointer + metadata is this" from "what kind of data does it point at". So, for example, you can have the types `str`, `[T]` or `Path`, and can have them behind the pointer types `&T`, `Box<T>` or `Arc<T>`.

If Rust had defined a special struct `Str` for `&str`, then it would have to define special structs for all the combinations possible: Str, ArcSlice, BoxPath, etc.


Rust's strings are one of the easiest to use correctly and one of the hardest to use incorrectly, but they vast majority of string handling is done incorrectly, so that makes Rust seem hard.


Yeah, this article was specifically aimed at people who haven't worked with C/C++, and instead have the higher-level mental model for what strings are and how they work.


I linked to a PR above which implemented this idea, and was rejected.


(see kimundi's answer; I misunderstood what you were asking. I'd have deleted this comment but I don't seem to be able to.)


Amusingly, Rust strings, whether &str or String, are unable to represent filenames, which in many, many programs is the overwhelmingly most common use for character sequences that people want to call strings.

The Rust people invented the wonderful "WTF-8" notion to talk about these things. It gets awkward when you want to display a filename in a box on the screen because those boxes like to hold actual strings qua strings, not these godforsaken abominations that show up in file system directory listings.

Handling WTF-8 will take a whole nother article. I don't know a name for WTF-8 sequences; I have been calling them sthrings, which is hard to say, and awkward, but that is kind of appropriate to the case.


It doesn't really make things any more complicated than they already are. If you take a filename in C, and then have to display it somewhere, you're facing the same problem - except that you might not even be aware of it, because all types are the same, and you won't notice the problem unless you happen to run into an unprintable filename.

Rust is doing the right thing here by forcing developers to deal with those issues explicitly, rather than swiping them under the rug. The real issue is filenames that aren't proper strings - i.e. an OS/FS design defect - but this ship has sailed long ago.


> If you take a filename in C, and then have to display it somewhere, you're facing the same problem

If you have such problem, it is often better to ignore it and pass data around without any conversion (to force data to be valid utf-8). For example, if i have ls-like tool, who just read filenames and display it, then it will work even with non-utf-8 filenames when used like:

for A in $(ls-like) ; do ...

But if the ls-like too tries to be smart to interpret data when they are displayed to stdout then it will break them.


Doing that is how you end up with binary data being interpreted as escape sequences, either mangling the screen or, when unlucky, putting your shell in a seemingly unresponsive state.

https://unix.stackexchange.com/questions/79684/fix-terminal-...


If you're merely passing the data, sure. But the scenario I described was specifically displaying data. For the terminal, your approach would only work if the expected encoding of the filename is the same as the output encoding of the terminal - which is not always the case. But then there's also GUI etc.

Ultimately, some bit of code (e.g. the terminal emulator) will still need to interpret that filename as a string in order to display it as one - and if it's not really a string, it won't be able to do that properly. Filenames are part of the UX - the filesystem a user-facing artifact, not an implementation detail - so they have to be constrained in such ways that make sense for humans, even if arrays of octets would do just fine as unique IDs.


The problem isn’t the forcing you to handle it, it’s the horrific API to handling it. It feels like all the OS representations were haphazardly bolted onto the core string API which is already kinda janky.

Rust just needs to sit down and make a good API for the whole set of string-ish types they have.


That sthrings are equally as awkward to handle correctly in C, C++, Python, Ruby, Javascript, and Perl as in Rust makes them no less awkward.

Nobody said Rust has done anything wrong to ban them from String. Many languages make the distinction, including (probably) C++23. Filename sthrings stink in any language.

With the usual solution, using "invalid character blocks", two different files with different names may show up in a list looking the same, so you are still in an awkward place.

Even with fully valid Unicode filenames, you can get a similar problem, from legitimate characters that just look the same. Even if filesystems forbade creating files with ill-formed names, we would still have the second problem. Filenames are just a hard problem.

It is too much to ask of a language or library to solve them. Even filesystems that forbid ill-formed filenames do not solve them all.



Nicely written!

I wish I had this article when I first started Rust. Would have saved me some trouble.


> "foofoo" takes up twice as much space in memory as "foo".

Not in the imaginary language you keep talking about as "C/C++".


That’s disingenuous. That quote is _immediately_ followed by a note

> Technically there will be a tiny amount of extra data, like a length, so it won't be precisely twice the size. But it's close enough for our purposes.

Also, nitpicking, those two strings _can_ take up the same amount of memory due to short string optimization (https://joellaity.com/2020/01/31/string.html shows that libc++ uses 24 bytes for strings up to 22 bytes long) or (for implementations that don’t do SSO) due to granularity of heap allocations.


Right, in case they're heap allocated, the worst-case alignment requirement of malloc, as well as free() without a size is likely to force implementations to round it up to at least 8 or 16 bytes for both.


I actually (and pedantically, I'm afraid) meant that neither in C or C++ is it true that "foofoo" takes up double the size of "foo" - both will have a zero character at the end. The sizeof operator when applied to each will also return values that indicate that one is not double of the value of the other.


I was approximating for the sake of demonstration (as stated in the note that followed)


No. The result of std::string("foo") is exactly the same size as of std::string("foofoo"). They take up exactly the same number of bytes in the free store, on the stack, in a hash table, what have you.

When the string is very long, it occupies two stretches of storage, one exactly the same size as std::string("foo"), plus another stretch big enough for the long sequence plus whatever overhead heap allocation adds.

But that is different from the examples given. For the cases cited, there is no indirection of any kind. The bytes are right there in the std::string object.


We just interpret "space in memory" at different levels of indirection. If you try to create std::string("verylong…") with more characters than you have memory, you run out of memory, so it "takes up space in memory", even though std::string has the same size as always.

Bonus fact: *"foo" and *"foofoo" in Rust dereference to 3 bytes and 6 bytes, respectively. It's neither a pointer nor the first char.


Why cant every literal "abc" just be instantiated as a heap String by default? You could have a separate notation like &"abc" when you want a slice, similar to python's b"abc" r"abc", etc. modifiers. Heap Strings seem much more useful in general.


It would be pretty wasteful; in any read-only context you'll need a &str anyway, and making them all Strings would cause tons of unneeded allocations. Many Rust devs care a lot about avoiding unnecessary allocations: some people even use Rust on embedded systems that don't allow allocations at all, so building allocation into a language fundamental would likely be a mistake.


Ok, I can see why it's not the default, but do you know why they didn't give it some syntax sugar to at least make it less painful/verbose to create one?

    let a = "abc"    // this is a &str
    let b = s"abc"   // this is a String


There have been conversations about this, but none that pushed forward an RFC that answered all the potential concerns. It is indeed a wart, one that I would personally like to see fixed, but I'm also ambivalent on whether I would like to make String nicer to use on literals or instead focus our efforts to make Cow<'_, str> a much nicer (and prevalent) type.

https://internals.rust-lang.org/t/pre-rfc-custom-string-lite...

https://internals.rust-lang.org/t/pre-rfc-allowing-string-li...

https://internals.rust-lang.org/t/pre-rfc-string-literals-th...


System programming languages avoid allocations and unnecessary resource consumption. I'd say it's one of their hallmark characteristics.

The programming convenience of higher-level languages comes at a very substantial cost of requiring a complex runtime, more memory for data structures, unpredictable performance, and pushing complexity where it's not visible. One philosophy favors visibility over resources, the other favors convenience of use.


You can be using a system that doesn't have a memory allocator (so no heap memory), like in many embedded systems.


This is a great intro, but clarifying one more thing might be useful: how do you return a string?


A String can be returned like any other owned value; whether or not a &str can be returned depends on lifetimes, as it does with any other reference.

Lifetimes seem out of scope for this post, and the lifetimes story for strings isn't really strings-specific enough that it felt important to cover. There are other resources out there that thoroughly cover the topic of lifetimes; in fact I wrote a short summary myself in another post :)

https://www.brandons.me/blog/favorite-rust-function


Plug for The Rust Programming Language "book": https://doc.rust-lang.org/book/

This article is lovely. The book goes further, and is truly very readable.


More articles like this and I might get started with Rust again. Well async is still a problem but at least strings are much clearer now after reading this.


Once, when asked, “Why is your neck crooked?”, the Camel said, “What do I have that is not?”


An alternative to `"string".to_owned() + "foo"` is just using a macro such as `concat!("string", "other string")`.


That actually doesn't work in the general case, because concat!() only accepts static (compile-time) literals. i.e., this is allowed:

  concat!("string", "other string")
but this is not:

  let my_str: String = ...;
  concat!(my_str, "other string")


Very interesting write up. Would be interested in a comparison to a COW string too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: