But you can't send the new emoji over basic SMS, because SMS, uses a variant of ...

vardump · on Jan 4, 2016

> because SMS, uses a variant of UTF-16 from the era when people thought 16 bits was big enough

SMS uses 7-bit by default. https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al...

Hacks? It's called UTF-16 surrogate pair. Not hacks. UCS-2 officially, but UCS-2 has been legacy for almost 20 years now. Modern phones select the encoding dynamically and use UTF-16 if the message cannot be encoded otherwise. You'll get up to 160 7-bit characters (some symbols take two characters) or 70 UTF-16 code units in unicode mode.

I just wish everyone started to use UTF-8 already and dropped all the other nonsense.

alblue · on Jan 4, 2016

Emoji has always been beyond 2 bytes. The Unicode spec also includes "surrogate pairs" which allows a higher plane code point to be represented as 4 bytes.

Animats · on Jan 4, 2016

There are a few 2-byte emoji:

0x2639 Frowning face

️0x263a Smiling face

(Hacker News doesn't speak much Unicode; the Unicode symbols won't pass through.)

vardump · on Jan 4, 2016

So what happens if I put a standard smiley in a message? (Grinning face Unicode: U+1F600, UTF-8: F0 9F 98 80)

Edit: I see, it disappears.

simoncion · on Jan 4, 2016

Hmm. Does HN use MySQL with "utf8" encoding as backend storage? ;)

riffraff · on Jan 4, 2016

for those not getting the joke: mysql has a thing called "utf8" which is not in fact utf8 and will (depending on settings) either truncate text when it meets a 4+ byte character, or raise an error.

It also supports real utf8 in more recent versions calling it "utf8mb4"

david-given · on Jan 4, 2016

This made filing a bug about Thunderbird not sizing astral plane code points correctly slightly more hilarious than it should have been (Mozilla's Bugzilla instance runs on MySQL)...

simoncion · on Jan 4, 2016

Everywhere I've used mysql, the default has been to silently truncate data that doesn't fit in the "utf8" encoding! :(

masklinn · on Jan 4, 2016

If that were the case, astral characters would be removed (e.g. shavian or the emoji block) but BMP symbols like U+263F MERCURY or U+262C ADI SHAKTI would be left alone, and they're not: "", ""

Since U+A420 YI SYLLABLE JJUOX goes through (ꐠ) HN likely explicitly strips most symbols (but not all, U+00B6 PILCROW SIGN (¶) is unmolested)