Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Spark: Open Source Superstar Rewrites Future of Big Data (wired.com)
145 points by MarlonPro on June 19, 2013 | hide | past | favorite | 39 comments


Hadoop is a pile of bad code, a stagnant codebase, crusty APIs and a thick surrounding layer of hype which obscures what it's really like to use. Spark might be better or faster, but mostly what you need to beat Hadoop is to make something practical, which lets developers be expressive rather than wrestle with overdesigned nonsense.

I know this because it's my full-time job to actually get stuff done inside Hadoop.

Spark may be a great system, but this article doesn't do much to settle the issue. When you read fluff like "sweeping software platform", "famously founded the Hadoop project", "great open source success stories" and machine learning described as "crunching and re-crunching the same data -- in what's called a logistic regression", it's time to move on.


>>Hadoop is a pile of bad code, a stagnant codebase, crusty APIs and a thick surrounding layer of hype which obscures what it's really like to use.

This is where marketing and branding becomes the primary factor influencing adoption, and not technical merit. Hadoop gathered so much momentum and hype as part of the Big Data buzz in the past few years that it's only now beginning to percolate through to telecommunication carriers and other larger/slower-moving enterprises.*

* I work primarily with wireless carriers; can't say much about broadband, although I'd hazard a guess and say that the majority are only now allocating experimental budgets to see how Hadoop can help them manage their Big Data.


Yeah, I lol'd at that too. Especially given that logistic regression is a tool for statistics, not machine learning (unless you subscribe to the theory that those two fields are one and the same).


> mostly what you need to beat Hadoop is to make something practical, which lets developers be expressive rather than wrestle with overdesigned nonsense.

It would be difficult for an article to convey this. Having used both Hadoop and Spark, practicality and expressiveness are precisely what made me fall in love with Spark. You can do so much with so little code. Don't take anyones word for it, see for yourself. Download the code and run the interactive shell, takes 2 minutes. It was totally mind-blowing for me.


>Spark may be a great system, but this article doesn't do much to settle the issue.

Wired is a mainstream technology magazine.


When I first learned about Spark, I knew this team would go on to build great things and they've gone beyond my expectations all the while being very friendly and supportive of the community.

Among the things I love most about the Spark and the eco-system:

* repl -- so great for running short little experiments or even full blown jobs. saves you time compiling small changes and really get to know your data quickly.

* caching -- processing in-memory opens up many possibilities beyond doing iterative, machine learning jobs. quantifind, for example, demoed a system that allows them to run ad-hoc queries using Shark on the fly across GBs of data (think OLAP, a bit) in seconds or less.

* scala -- makes for very succinct code using closures and built-in operations; check out some examples here: http://spark-project.org/examples/

And some of the upcoming projects are also very cool. Tachyon, for example, will enable users to share data with a very robust in-memory file system. A teammate and I, for example, could have used it recently because we were simultaneously running different analysis against the same data, so we had to cache duplicate instances on two clusters.


Spark is a wonderful project, I blogged about it just the other day: http://subprotocol.com/2013/06/17/spark-darling-of-big-data....

Spark make doing MR easy. I've used other frameworks on hadoop MR, but nothing compares with the ease with which you can express computations using it. And it does both batch and real-time/streaming. It is a very well thought-out project.


Not only it is easy to use, but the source code is a real pleasure to read, contrary to Hadoop's mess.


Can someone explain why the in-memory caching is such a big win? Does Hadoop MapReduce not do caching as well? I'd expect at least filesystem caching when the computation is running on the same machine as the data block...


Spark goes much more beyond just in-memory caching. It features a more advanced scheduler and a higher level programming abstraction.

The programming abstraction treats all data as collections (RDD in Spark terminology), and allow programmers to apply bulk transformation on these collections. Some examples of operations you can apply include the traditional map and reduce, the relational filter, join, outerJoin, and more advanced ones like sample. This abstraction makes it much easier to write distributed programs. As the Wired article mentioned, a distributed program written in Spark often looks identical to a single node program. This substantially reduces the amount of code one needs to write for distributed programs, and the best part is the code really expresses the algorithm (rather than cluttered with JobConf setup).

And the scheduler and the engine itself are aware of the general DAG of operators, so they can schedule and run those operators better. For example, if you have multiple maps, the execution gets pipelined; if you are joining two collections that are partitioned the same way, the execution avoids an expensive shuffle step.

There are many other benefits too. I'd encourage you to give a try. Thanks!

Disclaimer: I am on the Spark team at UC Berkeley.


Does Spark address the same problems as Storm?

It doesn't look like there's been any direct comparison of the two, though it looks like there's overlap. (I've wanted to start a streaming data processing project, and this looks like it would be good to consider for it.)


There are some nice looking Spark vs Storm graphs in their streaming presentation slides: http://spark-project.org/talks/strata_spark_streaming.pdf . Makes me wonder how biased these might be.


Last versions of Spark has ability to do stream processing. http://spark-project.org/docs/latest/streaming-programming-g...


Thanks - a nice explanation of Spark's benefits. The programming & execution models sound like big improvements - I'll check it out!


Justin: 1. Hadoop is multi-tenant. Another job could burst your cache. Unless things have dramatically changed since I first read the source a year or two ago, spark jobs basically grab control of the cluster so that the cache won't get evicted. 2. Think of Pregel and other iterative-style jobs. You may run several iterations before writing to disk to checkpoint. This can speed up your job immensely since you don't have to write & read data from slow disks. Of course you lose on fault tolerance but the speedup may be worth it.


My understanding of hadoop and hdfs is that both input and ourput are to/from disk. Your MR jobs are spun up and pointed to hdfs urls to read input from and they write to hdfs when done. This means that when there is an error, the intermediate computations don't necessarily have to be redone. However there is a trade off.

Additionally, I believe that HDFS keeps 3 copies of the data around on 3 different nodes for redundancy. So there is the overhead of that network traffic.


I did find this (now-fixed) bug/enhancement: https://issues.apache.org/jira/browse/HDFS-2246

Sounds like, with that configuration applied, the in-memory performance difference between Hadoop and Spark should not be nearly as large.


Good question. I did read the spark paper, and one reason that I found for spark doing so much better than hadoop was that it avoids the unnecessary serialization, deserialization which hadoop just can not avoid. The RDD's as mentioned by @rxin, are in memory objects and thus do not require frequent serialization/deserialization when multiple operations are being applied to data.


Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximizes the likelihood function, so an iterative process must be used instead. Wikipedia


You can find closed-form expressions for SOME likelihood functions. This is the case for logistic regression.

Anyway, the takeaway that I think you were downvoted for omitting is that if you can bypass the serialization, startup, and shutdown steps associated with the traditional Hadoop iterative process you'll get a huge speedup.


Spark project itself: http://spark-project.org/


I'm not sure what is meant by supporting Scala and python.

* Here is an example of a simple job written using Scala for hadoop (and uses Mahout libraries) - https://github.com/sandys/distributed-scala-mahout/tree/wiki...

* you can embed pig inside jython

* you can write UDFs using jruby or jython.

* I didn't try to figure out how, but I'm pretty sure you can build a standalone job jar using jruby and warbler

There might be a different way of hooking into Spark through Scala or python, but I'm pretty sure that fundamental support is not the big advantage here.

Pig has a repl - but as I quickly realized from playing around, that you end up mucking around with classpath problems (and questions of having your Jars in distributed cache) once you attempt to build UDFs involving a few different libraries.

Plus, I haven't used and so cannot comment on the ecosystem of Cascalog/Clojure - which is as functional as you can get.


This sounds amazing. I've done iterative jobs in hadoop before - it's very hacky and I generally just have it launch job after job after job until the result converges to where I want it. I'll definitely try this out soon.

Then again, while I absolutely love doing work with big data, I've been having a bit of an "existential crisis" since the NSA leaks :(


> Then again, while I absolutely love doing work with big data, I've been having a bit of an "existential crisis" since the NSA leaks

Book recommendation: "Who owns the future" by Jaron Lanier.


I've met Matei in passing through programming contests (where he is also a star) but even in those brief moments, his brilliance is pretty apparent. Good on him for the recognition---he deserves it.


I don't understand why they wrote this in Java instead of a language more suited to the task, such as Erlang.

Can someone explain this to me?


According to github, 85% of this is in Scala.

It's written on the JVM for the simple reason that if they wrote it in Go or Erlang, no Enterprise would adopt it as there isn't a CTO at a non-tech Fortune 500 that has every heard of Erlang or GO, and wouldn't know the first thing about trying to hire developers for it. Remember, the jobs written for Map Reduce are done in the same language ( typically ) as the MapReduce code itself.


Why didn't they go native?


Why would they?


I thought they cared about performance...


It is written in Scala with Akka framework, not pure Java. So it uses the same computational model as Erlang. Thanks to being Java-compatible it is easier to integrate with some other popular BigData tools, e.g. Hadoop or Hive (see Shark).


Erlang is not that performant in the general case compared to C/Java/etc. While the conceptual overhead in distribution is much lighter with Erlang, actual computation in an idiomatic implementation tends to be slower than Scala/Akka.

Of course, there are tasks for which an Erlang implementation would be faster, but as others have mentioned, most organizations would prefer to write Scala.

(Yes, I find that icky too. :-))


More than a few other choices (Go...), not to mention alternatives still on the JVM.

Nobody ever got fired for building with Java.


Go is nowhere near the adoption of Scala right now, and its compiler must catch up a lot to get to the similar level of performance as Scala has on top of Oracle JVM.

Considering there are already lots of BigData tools in Java ecosystem (hadoop, hive, pig, mahout etc.) Scala looks like a very reasonable choice.


You statement is perfectly valid. Scala was in my mind when I wrote my comment.

I wish I found Scala as easy to use as I do Go. I do not enjoy the syntax, nor do I consider running on top of the JVM a selling point.

I came from a Java, but worked in Python for many years, so that probably explains my bias.


The article discusses the challenge of supplanting entrenched software such as Hadoop. I am actually a bit more optimistic about the ability to swap out Hadoop with technologies like Spark.

At Twitter we don't program using Hadoop directly; we mostly use either Scalding or Pig, languages that compile down to Hadoop code. https://dev.twitter.com/blog/scalding http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-n...

I believe this is how many other companies use Hadoop as well.

The benefit here is that it's possible to write new backends for Pig and Scalding that compile down to Spark or anything else. And then you have backwards compatibility with all your old big-data code.


One of the Cloudera devs told me 80% of all Hadoop users run Hive. This suggests most devs secretly want to continue using SQL, but want more scalable relational solutions. Thus why Cloudera is backing Impala, and Facebook is about to open source Presto.

I worked on an entire team of developers where I was the only one who understood the raw Java MapReduce API. Almost everyone else on my team got by with learning HiveQL, and a very minimal understand of MapReduce design flow.

Your belief is absolutely correct.


I think they focus too much on algorithms and not enough on the technical implementation which is one of the major reasons why Hadoop is so slow.


"60 servers owned by MegaUpload were directly confiscated by the FIOD and transported to the US."

This implies that the confiscated servers were originally not located in the US. I wonder which country they were located and on what legal basis the US could confiscate servers there?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: