Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But you can't send the new emoji over basic SMS, because SMS, uses a variant of UTF-16 from the era when people thought 16 bits was big enough. (So do Java and Windows, although there are hacks in both to get past 2 bytes.) The new emoji are all up in the astral planes, beyond 2 bytes.


> because SMS, uses a variant of UTF-16 from the era when people thought 16 bits was big enough

SMS uses 7-bit by default. https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al...

Hacks? It's called UTF-16 surrogate pair. Not hacks. UCS-2 officially, but UCS-2 has been legacy for almost 20 years now. Modern phones select the encoding dynamically and use UTF-16 if the message cannot be encoded otherwise. You'll get up to 160 7-bit characters (some symbols take two characters) or 70 UTF-16 code units in unicode mode.

I just wish everyone started to use UTF-8 already and dropped all the other nonsense.


Emoji has always been beyond 2 bytes. The Unicode spec also includes "surrogate pairs" which allows a higher plane code point to be represented as 4 bytes.


There are a few 2-byte emoji:

0x2639 Frowning face

️0x263a Smiling face

(Hacker News doesn't speak much Unicode; the Unicode symbols won't pass through.)


So what happens if I put a standard smiley in a message? (Grinning face Unicode: U+1F600, UTF-8: F0 9F 98 80)

Edit: I see, it disappears.


Hmm. Does HN use MySQL with "utf8" encoding as backend storage? ;)


for those not getting the joke: mysql has a thing called "utf8" which is not in fact utf8 and will (depending on settings) either truncate text when it meets a 4+ byte character, or raise an error.

It also supports real utf8 in more recent versions calling it "utf8mb4"


This made filing a bug about Thunderbird not sizing astral plane code points correctly slightly more hilarious than it should have been (Mozilla's Bugzilla instance runs on MySQL)...


Everywhere I've used mysql, the default has been to silently truncate data that doesn't fit in the "utf8" encoding! :(


If that were the case, astral characters would be removed (e.g. shavian or the emoji block) but BMP symbols like U+263F MERCURY or U+262C ADI SHAKTI would be left alone, and they're not: "", ""

Since U+A420 YI SYLLABLE JJUOX goes through (ꐠ) HN likely explicitly strips most symbols (but not all, U+00B6 PILCROW SIGN (¶) is unmolested)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: