500 Exabytes per Raindrop

sh1mmer · on Nov 12, 2008

This is easily the most interesting thing I've read all week.

I was talking to some of our systems engineers about the size of the Hadoops clusters at Yahoo. I was pretty impressed with having access to thousands of machines with 10s of petabytes of storage (think kid in a candy shop).

It's nice to be put gently back into my place, by a simple raindrop.

biohacker42 · on Nov 12, 2008

Having worked for a bioinformatics company, I can describe my work as handling huge volumes of data.

Modern instruments are ever more computerized and spit out gigs and gigs of 1s and 0s with every use.

And then you have to take all that data and try to turn into something humans can understand.

And every time our instruments get better the amount of data we collect goes up.

Systems biology is basically a data management problem.

bradgessler · on Nov 12, 2008

This guy needs to read GEB. One could describe a raindrop with even more bits (subatomic particles anybody?) or less (size, volume, shape of raindrop).

unalone · on Nov 12, 2008

Plus, of course, he's talking about that raindrop in isolation. To describe it in context with the real world, you also need to factor in the light refracting through it, the position of the observer, and the pace at which it moves both due to gravity and to the earth's rotation. So to describe it accurately from a single position requires knowledge of everything around that raindrop.

In fact, it's possible to say then that you could start with something as simple as, say, a piece of fairy cake, and deduce accurately the nature of everything else in the universe.

Reading this, the thing that stuck out to me was the relatively small size of text (5MB for Shakespeare versus 20GB for Beethoven). It struck me that poetry - particularly haiku, which deals with nature - is sort of a primitive way of applying lenses to the world to create a filter for viewing things by appealing to collective experiences. Just a thought.

celoyd · on Nov 13, 2008

Primitive? Poetry is probably the most developed form of lossy compression there is.

unalone · on Nov 13, 2008

Primitive meaning the concept behind it - capture a moment in time with words - is a very old and simple one.

Don't get me wrong: poetry has become incredibly sophisticated. And it's still as enjoyable to write as it was long, long ago.

robertk · on Nov 12, 2008

This is John Baez. I am pretty sure he has read GEB.

http://en.wikipedia.org/wiki/John_Baez

jacobscott · on Nov 13, 2008

From his website:

"I'll probably keep working on topological quantum field theory and other wimpy subjects"

I would be careful what you tell him to go read...

andreyf · on Nov 12, 2008

Interesting that you thought of describing a raindrop... I thought of storing information in it ;)

Just imagine how much bigger our storage mechanisms will get...

bradgessler · on Nov 12, 2008

[...] In other words, this is how much information it takes to completely specify the state of one gram of water.

Isn't that what the author was getting at?

vitaminj · on Nov 13, 2008

To see a world in a grain of sand,
And a heaven in a wild flower,
Hold infinity in the palm of your hand,

And eternity in an hour

William Blake, Auguries of Innocence

glymor · on Nov 13, 2008

I'm pretty sure you got this from the Tomb Raider movie.

yters · on Nov 13, 2008

What is the theory of useful information? According to algorithmic information theory, the less compressible something is, the more information it has. But, the sort of information I'm interested in implies that there is a fairly large ratio of bits to compression. I.e. this text is much more compressible than a random string of letters, and conveys much more information to me than a random string. Is there a precise characterization of this sort of information? It isn't simple the compression ratio, since AAAAAAAAAAAAAAAAAAAAAAAAAAA is very compressible, but conveys very little information.

thwarted · on Nov 13, 2008

Go the other way. Assume two strings of length k: string a composed of english words, and string b generated by reading /dev/random. String a will be more compressible, length(compress(a)) < k. You can now add MORE "information" to a and compress it until length(compress(a)) = k. Since you can't compress b any further, it already contains the maximum amount of information.

I put "information" in quotes above because the actual "information per bit" of English is pretty low, and that's where the ability to both compress it and comprehend it, not as single bits but as groups of bits, comes from. Still, compressed data is still comprehensible once you uncompress it, so it's a measure of the surface comprehensiblity. It's really a measure of information density. Compressed data has a high information to space ratio, whereas random, uncompressible data has low information to space ratio (or perhaps negative).

You can experiment with this yourself:

   cat > /tmp/string.a
      (paste in some English text cut and pasted from a web page)
   dd if=/dev/urandom of=/tmp/string.b bs=1 count=`wc -c < /tmp/string.a`
   bzip2 /tmp/string.*
   ls -l /tmp/string.*

Experiment with the above for corpora of different lengths. You'll see that as there is more English text, which is information rich, it can compress better than shorter English text (as a compression ratio), and will compress better than random data (which we know contains little information) of the same length.

A string composed solely of 27 As would compress down to perhaps 2 bytes or less (not including the size of the decompressor). You are right: there is not much information in it. Less than 16 bits of information in 27 As.

yters · on Nov 13, 2008

I'm looking for a definition of information that matches up with how we use the word in normal parlance. This definition has to strike some kind of medium between incompressible and completely compressible. I could make something up, but I was wondering what the official version is.

thwarted · on Nov 13, 2008

I'm not sure I follow. The definition of "information" I use everyday matches the "official" version -- I'm not sure how it could be different. What is "completely compressible"? Something can not be compressed beyond the shortest string that can represent it without losing information content.

http://en.wikipedia.org/wiki/Information#Measuring_informati...

yters · on Nov 13, 2008

Shannon's information is a measure of the size of a set that an element is selected from, is that right? So a letter conveys log2(26) bits of information. That alone doesn't allow me to discriminate between the information content of a random string and an English sentence, since both supposedly contain the same amount of information by this bare metric.

If I look at the occurrence of subsets of the string, then that would be a better discriminator: the random string's subsets should follow a normal distribution while the English string's subsets will be highly skewed.

However, that doesn't work when I try to discriminate between an English string and a string generated by a simple algorithm, since the latter's subset distribution will also be highly skewed. What kind of metric discriminates the English sentence from either case?

Am I making sense here? I haven't had any formal training in information theory, and my brain is kind of fried right now.

markessien · on Nov 12, 2008

A raindrop does not have that much information in it. A raindrop only has some simple function that completely describes it, and this function likely has only some bytes. You can't take any information out of a raindrop, because the particles within it are random.

A raindrop is made up of a certain number of atoms, but these atoms are all the same, and cannot store any information. To store information, you need items that are dissimilar to each other. So the comparisons he makes are not correct.

celoyd · on Nov 13, 2008

As you say, the particles are random(ish), and this is exactly what makes them hard to describe completely with a simple function. "Random" means "the simplest description of it is the literal one", and the literal description of the position, orientation and motion of that many water molecules is a lot of information.

If raindrops were perfect crystals at absolute zero (and there were only one isotope each of hydrogen and oxygen), they would be a lot less information in them.

markessien · on Nov 13, 2008

You are looking at things backwards. If you try to describe a raindrop by mapping the position of every single atom or molecule, you are creating information. But the raindrop itself does not contain any information, because it is completely random. It has no memory, and so cannot store information.

There is a very small amount of information in a raindrop, no matter how complex it is from the molecular structure.

The entire argument is flawed. It's based on a completely wrong premise - information is not the same thing as structure.

I can't explain this any better, you will need a leap of intuition to understand what I mean, but when you get it, it will be obvious.

DaniFong · on Nov 13, 2008

There's even more information if the quantum information is deemed important; in excess of 2^(5 x 1020).

Ultimately, one should come away with the understanding that obviously not all that information is important.

cgranade · on Nov 13, 2008

Well, that's if you want to record the whole state for simulation by a classical computer. It you just want to have the state, it seems like it should be measured in qubits.

anewaccountname · on Nov 13, 2008

If we don't know whether the universe is continuous or discrete, how can we know how much digital information is in a raindrop? If the universe is continuous, the distance between any two particles can be measured to an infinitely more fine-grained level, yielding more information.

tel · on Nov 12, 2008

90 petabytes: the "Deep Web" in 2002 - includes private databases, controlled-access sites and so on.

6 years ago, oh my.