Right well again, if your average East Asian codepoint is 3 bytes in UTF-8, and your average markup codepoint is one byte, then as the percentage of markup in your corpus rises you'll grow asymptotically towards 1. But not all text contains markup ;) Consider any database storage for example, text in a game, e-books, visual novels. I guess anything that isn't the web or JSON is what I'm saying -- which is a lot.