Spark goes much more beyond just in-memory caching. It features a more advanced scheduler and a higher level programming abstraction.
The programming abstraction treats all data as collections (RDD in Spark terminology), and allow programmers to apply bulk transformation on these collections. Some examples of operations you can apply include the traditional map and reduce, the relational filter, join, outerJoin, and more advanced ones like sample. This abstraction makes it much easier to write distributed programs. As the Wired article mentioned, a distributed program written in Spark often looks identical to a single node program. This substantially reduces the amount of code one needs to write for distributed programs, and the best part is the code really expresses the algorithm (rather than cluttered with JobConf setup).
And the scheduler and the engine itself are aware of the general DAG of operators, so they can schedule and run those operators better. For example, if you have multiple maps, the execution gets pipelined; if you are joining two collections that are partitioned the same way, the execution avoids an expensive shuffle step.
There are many other benefits too. I'd encourage you to give a try. Thanks!
Disclaimer: I am on the Spark team at UC Berkeley.
It doesn't look like there's been any direct comparison of the two, though it looks like there's overlap. (I've wanted to start a streaming data processing project, and this looks like it would be good to consider for it.)
The programming abstraction treats all data as collections (RDD in Spark terminology), and allow programmers to apply bulk transformation on these collections. Some examples of operations you can apply include the traditional map and reduce, the relational filter, join, outerJoin, and more advanced ones like sample. This abstraction makes it much easier to write distributed programs. As the Wired article mentioned, a distributed program written in Spark often looks identical to a single node program. This substantially reduces the amount of code one needs to write for distributed programs, and the best part is the code really expresses the algorithm (rather than cluttered with JobConf setup).
And the scheduler and the engine itself are aware of the general DAG of operators, so they can schedule and run those operators better. For example, if you have multiple maps, the execution gets pipelined; if you are joining two collections that are partitioned the same way, the execution avoids an expensive shuffle step.
There are many other benefits too. I'd encourage you to give a try. Thanks!
Disclaimer: I am on the Spark team at UC Berkeley.