This is easily the most interesting thing I've read all week.
I was talking to some of our systems engineers about the size of the Hadoops clusters at Yahoo. I was pretty impressed with having access to thousands of machines with 10s of petabytes of storage (think kid in a candy shop).
It's nice to be put gently back into my place, by a simple raindrop.
This guy needs to read GEB. One could describe a raindrop with even more bits (subatomic particles anybody?) or less (size, volume, shape of raindrop).
Plus, of course, he's talking about that raindrop in isolation. To describe it in context with the real world, you also need to factor in the light refracting through it, the position of the observer, and the pace at which it moves both due to gravity and to the earth's rotation. So to describe it accurately from a single position requires knowledge of everything around that raindrop.
In fact, it's possible to say then that you could start with something as simple as, say, a piece of fairy cake, and deduce accurately the nature of everything else in the universe.
Reading this, the thing that stuck out to me was the relatively small size of text (5MB for Shakespeare versus 20GB for Beethoven). It struck me that poetry - particularly haiku, which deals with nature - is sort of a primitive way of applying lenses to the world to create a filter for viewing things by appealing to collective experiences. Just a thought.
What is the theory of useful information? According to algorithmic information theory, the less compressible something is, the more information it has. But, the sort of information I'm interested in implies that there is a fairly large ratio of bits to compression. I.e. this text is much more compressible than a random string of letters, and conveys much more information to me than a random string. Is there a precise characterization of this sort of information? It isn't simple the compression ratio, since AAAAAAAAAAAAAAAAAAAAAAAAAAA is very compressible, but conveys very little information.
Go the other way. Assume two strings of length k: string a composed of english words, and string b generated by reading /dev/random. String a will be more compressible, length(compress(a)) < k. You can now add MORE "information" to a and compress it until length(compress(a)) = k. Since you can't compress b any further, it already contains the maximum amount of information.
I put "information" in quotes above because the actual "information per bit" of English is pretty low, and that's where the ability to both compress it and comprehend it, not as single bits but as groups of bits, comes from. Still, compressed data is still comprehensible once you uncompress it, so it's a measure of the surface comprehensiblity. It's really a measure of information density. Compressed data has a high information to space ratio, whereas random, uncompressible data has low information to space ratio (or perhaps negative).
You can experiment with this yourself:
cat > /tmp/string.a
(paste in some English text cut and pasted from a web page)
dd if=/dev/urandom of=/tmp/string.b bs=1 count=`wc -c < /tmp/string.a`
bzip2 /tmp/string.*
ls -l /tmp/string.*
Experiment with the above for corpora of different lengths. You'll see that as there is more English text, which is information rich, it can compress better than shorter English text (as a compression ratio), and will compress better than random data (which we know contains little information) of the same length.
A string composed solely of 27 As would compress down to perhaps 2 bytes or less (not including the size of the decompressor). You are right: there is not much information in it. Less than 16 bits of information in 27 As.
I'm looking for a definition of information that matches up with how we use the word in normal parlance. This definition has to strike some kind of medium between incompressible and completely compressible. I could make something up, but I was wondering what the official version is.
I'm not sure I follow. The definition of "information" I use everyday matches the "official" version -- I'm not sure how it could be different. What is "completely compressible"? Something can not be compressed beyond the shortest string that can represent it without losing information content.
Shannon's information is a measure of the size of a set that an element is selected from, is that right? So a letter conveys log2(26) bits of information. That alone doesn't allow me to discriminate between the information content of a random string and an English sentence, since both supposedly contain the same amount of information by this bare metric.
If I look at the occurrence of subsets of the string, then that would be a better discriminator: the random string's subsets should follow a normal distribution while the English string's subsets will be highly skewed.
However, that doesn't work when I try to discriminate between an English string and a string generated by a simple algorithm, since the latter's subset distribution will also be highly skewed. What kind of metric discriminates the English sentence from either case?
Am I making sense here? I haven't had any formal training in information theory, and my brain is kind of fried right now.
A raindrop does not have that much information in it. A raindrop only has some simple function that completely describes it, and this function likely has only some bytes. You can't take any information out of a raindrop, because the particles within it are random.
A raindrop is made up of a certain number of atoms, but these atoms are all the same, and cannot store any information. To store information, you need items that are dissimilar to each other. So the comparisons he makes are not correct.
As you say, the particles are random(ish), and this is exactly what makes them hard to describe completely with a simple function. "Random" means "the simplest description of it is the literal one", and the literal description of the position, orientation and motion of that many water molecules is a lot of information.
If raindrops were perfect crystals at absolute zero (and there were only one isotope each of hydrogen and oxygen), they would be a lot less information in them.
You are looking at things backwards. If you try to describe a raindrop by mapping the position of every single atom or molecule, you are creating information. But the raindrop itself does not contain any information, because it is completely random. It has no memory, and so cannot store information.
There is a very small amount of information in a raindrop, no matter how complex it is from the molecular structure.
The entire argument is flawed. It's based on a completely wrong premise - information is not the same thing as structure.
I can't explain this any better, you will need a leap of intuition to understand what I mean, but when you get it, it will be obvious.
Well, that's if you want to record the whole state for simulation by a classical computer. It you just want to have the state, it seems like it should be measured in qubits.
If we don't know whether the universe is continuous or discrete, how can we know how much digital information is in a raindrop? If the universe is continuous, the distance between any two particles can be measured to an infinitely more fine-grained level, yielding more information.
I was talking to some of our systems engineers about the size of the Hadoops clusters at Yahoo. I was pretty impressed with having access to thousands of machines with 10s of petabytes of storage (think kid in a candy shop).
It's nice to be put gently back into my place, by a simple raindrop.