How we run Spark and Sqoop in production

58 points by natekupp 10 years ago · 18 comments

Reader

Any good alternatives for Sqoop? I feel that an ETL tool just for HDFS is too limiting and leads to further fragmentation on the data pipeline.

technofiend 10 years ago

I feel like I've recommended it enough times I'm turning into a shill but Pentaho is an open-source and commercially supported ETL tool that will natively do what you want, or call sqoop when you discover that's kinda slow. :-) And no I definitely don't work there.
dn5 10 years ago

Have a look at Kafka Connect (http://docs.confluent.io/2.0.0/connect). The JDBC Connector will poll for database changes changes and push them to a Kafka topic. Means you should see all the changes in the database rather than a snapshot say once a day
rathboma 10 years ago

Using Sqoop from something like Luigi as the ETL manager is a pretty great workflow - https://github.com/spotify/luigi
You can define dependencies between jobs based on output file which allows you to re-run only part of your pipeline
- machbio 10 years ago
  
  Thats a great idea - but could you elaborate on the scheduling of jobs on Luigi - it does not have a scheduler like AirFlow - how do you schedule Luigi tasks ?
  - rathboma 10 years ago
    
    Check out this Foursquare talk that goes through how we used to do scheduling -- basically you make jobs dependent on a date - http://www.slideshare.net/OpenAnayticsMeetup/luigi-presentat...
  - allengeorge 10 years ago
    
    You have to use an external scheduler. We built one on top of AP Scheduler: https://apscheduler.readthedocs.io/en/latest/
- natekuppOP 10 years ago
  
  +1 to this, we kick off our Sqoop jobs using Airflow - http://airbnb.io/projects/airflow/
  Airflow is very similar to Luigi; we've been using in in production to schedule all of our workflows for ~4 months now and it's worked out really well for us.
Joeri 10 years ago

We've been trying out goldengate to get streaming replication, but it has proven rather unreliable. Stops replicating if you sneeze in its general vicinity. I wonder whether the alternatives like shareplex and tungsten are more reliable.
- capkutay 10 years ago
  
  You can try out Striim for streaming data integration (full disclosure, I work there):
  http://www.striim.com/download-striim/
falaki 10 years ago

Ever since using Apache Spark's Data Sources API was released, I have been relying no different Spark Data Source packages for my ETL jobs.

sciurus 10 years ago

Their earlier post about rebuilding their data infrastructure is more interesting imho: https://news.ycombinator.com/item?id=11474284

natekuppOP 10 years ago

hey all, feel free to reach out to me either on this thread, or directly at nate[at]thumbtack.com if I can answer any questions!

rahij 10 years ago

I had to connect to a US VPN to access the jobs page. Is that intentional?
- natekuppOP 10 years ago
  
  thanks for flagging, I'll look into it!

lazywizard 10 years ago

Thanks for sharing with all scripts. Great help.

lobster_johnson 10 years ago

Speaking of Spark, has anyone used it with Go? Is such a thing even possible?

mcrad 10 years ago

Thumbtack is great for lazy consumers. But word on the street is they contribute to too much price pressure on the market, therefore drive the overall quality of services down. Good work Thumbtack! You have figured out a convenient way to sacrifice long term value in favor of short term profit.

Settings

How we run Spark and Sqoop in production

Keyboard Shortcuts