One interesting/fun problem with text layout is that:
width(A) + width(B) != width(A+B)
...which some basic text layout engines assume. The post touches on this, and is one reason why line-breaking is so difficult.
If you add some text to a line, the width of the line may be longer or shorter(!) than if you measured (shaped) the parts separately.
This occurs for many reasons, the post mentions splitting a ligature, can also happen with kerning (spaces can have kerning applied for example).
E.g. many text layout engines will incrementally "add" to a line. However to do this correctly for all cases you need to re-shape (measure) the entire line (or from a known "safe to break" point in the shape result see: https://github.com/harfbuzz/harfbuzz/issues/224 ). This is typically fine for latin cases, but begins to become slow for more complex scripts (Thai).
Chromium kinda works backwards. It tries to fit everything in the paragraph on a single line, shapes (measures), then finds a line-break opportunity within the (potentially large) shape result which would "fit" on the line without re-shaping.
Then given that line-break will reshape the entire line (using the "safe to break" API), (and remaining content for the next line), see if it will fit and repeats the process[1].
By always "removing" content from the line you end up with a correct implementation which works for all the crazy things which can happen with text rendering.
[1] The content in the line may become bigger after taking a line-break, hence this needs to happen in a loop until there is one "unbreakable" piece of content in the line. This loop typically only goes through one iteration.
No. Most line breaks in English are at spaces, and the Computer Modern fonts don't have any complex shaping behavior across spaces (it's mostly complex scripts such as Nastaliq that have this behavior). I haven't checked to see whether it takes into account the kerning value with the hyphen added.
I doubt Knuth would claim for "solving" the problem, but I do get the impression that it is a solution. In particular, it does tackle hyphenation of words. Is where I learned of the re-cord versus rec-ord problem.
I'm curious what you mean regarding complex shapes across spaces. If it is related to the displayed bug regarding how words like "office" would be split, I don't think TeX ever had the bug as described. Though, I could be wrong, easily. (Incidentally, what a fun bug. Kudos to who found that one!)
Edit: I also had the impression that TeX would adjust intercharacter spacing as a whole to keep things looking a bit more uniform. Essentially, the goal was to act a lot like a human would when setting out the characters. If needed, you could adjust spacing on all characters within a margin of "not going to be noticed" so that you could expand a line of text to take up the full width. Without just having more space between a few words.
By "shaping across space" I mean that the shapes of the words might be very different depending on whether there's a space or whether they're on second lines. That basically doesn't happen in Latin script (though one could possibly imagine a very fancy cursive or a stunt). So I'll give an example in Nastaliq. One of the example images[1] on Wikipedia is "خط نستعلیق" (meaning "Nastaliq script") where the "خط" is tucked under the other word when they're on the same line. There's a space between them, so it's a valid line break. If they were on two separate lines, the total width would be a lot wider because of that shaping behavior.
This is not conceptually a difficult problem, you can just shape all possible substrings between candidate line breaks to get their widths, at which point Knuth-Plass will give the optimal line breaks (relative to your objective function). But given that shaping is expensive, you really want to avoid that in the 99.99% or so cases where the width of the word isn't altered by other words beyond space boundaries. That's what the "safe to break" logic is about - letting you know when you can make that assumption, as opposed to needing to reshape to get the precise metrics.
Sadly, I don't understand the example. Is there a meaning/logic to whether the word is tucked under? I know you said it rarely happens in latin, so I'm assuming it is almost akin to how 2nd is supposed to be narrower, as the "nd" should not be the same as the 2? Definitely not a typographic convention that happens in traditional text, but almost akin to dropped caps? Feels like it is a convention that is common in logos and such? (Though, scanning, I can't see too many logos that stack words, anymore. Olive Garden is about the best example I can find.)
That said, logos are a good example where TeX is not suited. Odds are high that it will not help stack words in a way that works for slogans and such.
Pedantry: note that "safe to break" optimisations can occur vertically (where I encountered them with latin scripts) as well as horizontally.
Also, the original HTML tables algorithm was evidently meant for hardcopy output and not relayout-upon-typed-input, as it was horrendously slow without making some safe to break optimisations.
Pretty sure TeX considers horizontal optimizations, as well? I know it tries not to dangle a single sentence to a page, at least. (Well, I "know" that... I would be far from shocked to find I'm wrong.)
A lot of folks think that Knuth-Plass is the optimal solution for good looking text, for all text. It really only considers Latin, and even then has restrictions.
Some fonts have the space character in the kerning table, and possibly (although super rare) in the "ligature" table (gsub). (https://www.sansbullshitsans.com/ is an example of a font with a space in gsub, "paradigm shift" maps to a single glyph).
Fair, a lot of folks do treat it with probably more praise than makes sense. I'm sure I'm one of those people. :D
That said, my understanding is that there really was no "optimal" for the best way to arrange text. You can, effectively, make an objective function that you can optimize on; but there is no global "this will make text look good" algorithm.
Indeed, even ligatures are... debatable in utility. I personally like them; but I would scoff at any claims for any objective superiority of them. They are fun and a bit of a "flex" for laying out text on a computer. Any other claim is going to be tough to hold up.
This occurs for many reasons, the post mentions splitting a ligature, can also happen with kerning (spaces can have kerning applied for example).
E.g. many text layout engines will incrementally "add" to a line. However to do this correctly for all cases you need to re-shape (measure) the entire line (or from a known "safe to break" point in the shape result see: https://github.com/harfbuzz/harfbuzz/issues/224 ). This is typically fine for latin cases, but begins to become slow for more complex scripts (Thai).
Chromium kinda works backwards. It tries to fit everything in the paragraph on a single line, shapes (measures), then finds a line-break opportunity within the (potentially large) shape result which would "fit" on the line without re-shaping. Then given that line-break will reshape the entire line (using the "safe to break" API), (and remaining content for the next line), see if it will fit and repeats the process[1].
By always "removing" content from the line you end up with a correct implementation which works for all the crazy things which can happen with text rendering.
[1] The content in the line may become bigger after taking a line-break, hence this needs to happen in a loop until there is one "unbreakable" piece of content in the line. This loop typically only goes through one iteration.