I thought the title meant it was using Altivec or SSE, but it's merely operating on a chunk of 4 bytes at a time dealing with misaligned data up front. Still a good article, despite my initial disappointment.
A similar article which originally taught me these tenants is:
I don't think I believe those benchmarks. Many of those numbers are significantly slower (!) than main memory bandwidth, and for code like strlen() which can be naively implemented in about three instructions per byte.
A similar article which originally taught me these tenants is:
http://rentzsch.com/papers/straightenUpAndFlyRight