The following is a guest post by Doug Daniels, CTO of Mortar Data Inc.
Today, we're excited to announce integration between MongoLab and Mortar, the Hadoop platform for high-scale data science. If you have one of the 100,000+ databases at MongoLab, you can now seamlessly use Hadoop to:
- Run advanced algorithms (like recommendation engines)
- Build reports that run quickly in parallel against large collections
- Join multiple collections (and outside data) together for analysis
- Store results to Google Drive, back to MongoLab, or many other destinations
In this article we'll show you how to connect your MongoLab database to Hadoop, and then use Hadoop to do something simple but very useful: gather schema information from an entire collection, including histograms of common values, data types, and more. Mortar handles all deployment, monitoring and cluster management, so no prior knowledge of Hadoop is required.
We first need to connect your MongoLab database to Mortar and Hadoop. If you haven't already, head over to the MongoLab sign-up page to create an account. After completing the form, you can immediately begin to provision new databases. Make sure that you choose the AWS us-east-1 datacenter for your MongoDB.
**If you're unsure which plan is right for you, visit the MongoLab plans page or email the MongoLab team at email@example.com
Next, login to your MongoLab console. For this tutorial, we'll be using a replica set cluster and will connect to a secondary node. It's recommended to use a secondary node for analytics so that you don't affect regular traffic on the primary node (which can lead to performance degradation). For a deeper dive and alternate connection strategies, see the full Mongo-->Hadoop tutorial.
In your MongoLab console, open up the MongoDB cluster and database you'd like to process with Hadoop.
Click on the Users tab for that database. Add a new user that you can use to connect to the database. We'll call ours "mortar_user". If you want to save results back to the database, make sure the user has write privileges.
Next, sign up for a free account at Mortar. If you don't mind your project being public, stick with the free Public plan. If you need your project to be private, grab a free 7-day trial on the Solo plan.
Install Mortar and Connect to MongoLab
Now that your account is setup, use Mortar's installer to set up your workstation with everything you need to run and deploy Hadoop and Pig jobs.
Next, use Mortar to fork an example project for working with Mongo data:
mortar projects:fork firstname.lastname@example.org:mortardata/mortar-mongo-examples.git <your_project_name_goes_here>
Now, grab the standard Mongo URI connection details for your database from MongoLab. If you have a Replica Set, use the credentials for the secondary node to keep traffic off the primary.
You can get the Mongo URI by clicking on your MongoLab Cluster and then choosing the Servers tab. If you have a secondary node, choose that one from the list. Then, select the database underneath you'd like to analyze.
At the top of the page, you'll see a box that says "To connect using a driver via the standard URI". Grab your database's Mongo URI from there, and fill in the missing <dbuser> and <dbpassword> with the user credentials you created above.
With your filled-out URI in hand, set the configuration for your Mortar project to point to your MongoLab server by running:
mortar config:set MONGO_URI='put your Mongo URI here'
This will store your encrypted configuration at Mortar for running jobs against MongoLab.
Run a Small Hadoop Job Locally
Now we're ready to run our first Hadoop job against Mongo. As an example, we'll run an Apache Pig script that connects to a collection in your database and emits statistics about every field in the collection. We'll run this script on your local computer first, so choose a small collection! Otherwise, you'll spend a lot of time trying to stream data from your MongoLab database to your local computer. We'll try larger collections when we run in the cloud next.
In your project directory, open the params/characterize-local.params file. Change INPUT_COLLECTION to a small collection you'd like to see stats on, and OUTPUT_COLLECTION to where you'd like the stats delivered. Then run:
mortar local:run pigscripts/characterize_collection.pig -f params/characterize-local.params
This will first download all of the dependencies you need to run a Pig job to a local sandbox for your project. Once complete, it will do a local run of the characterize_collection Pigscript against your input collection.
When finished, you'll have a new Mongo document in your output collection with detailed information about each field in the input collection, including the number of unique values in the field, example values and predicted data types.
Run a Full Hadoop Job in the Cloud
Running locally is fine for smaller datasets, but to process larger data, we'll want to use the full power of a Hadoop cluster. With Mortar, one command deploys a snapshot of your code to a private Github repository, launches a private AWS Elastic MapReduce Hadoop cluster, and runs your code at scale.
Let's try it out. Open up the params/characterize-cloud.params file. Set the INPUT_COLLECTION parameter to a larger collection that you'd like to analyze. Set the OUTPUT_COLLECTION to either the same output you used before or a new collection.
mortar run pigscripts/characterize_collection.pig -f params/characterize-cloud.params --clustersize 3
This will validate your script, launch a new private, 3-node Hadoop cluster on AWS Spot Instances, and analyze your collection. Cluster startup will take about 10-15 minutes, and the job should cost less than $0.40 for the whole hour on 3 machines--Mortar passes AWS cluster costs directly back with no up-charge.
When you start your job, mortar will print out a job URL. Open it up, and you'll see realtime progress tracking, logs, and visualization for your job.
When your job finishes, your results will be ready to view in the output collection you chose.
The example we ran is a fairly simple one. You'll want to go deeper on your own data--bringing in multiple collections, joining and aggregating, and using your own code. Our Mongo --> Hadoop tutorial will step you through the process, showing you how to work with your MongoLab data in Hadoop and Pig.
Mortar also has a growing number of open-source data apps pre-built on top of the platform, such as recommendation engines and Google Drive / Data Hero dashboards. We're quickly adding more, but if your use case isn't yet available, we have tutorials to help build your own data app.
If you have any questions about getting your data connected, contact us @mortardata or drop a question to our Q&A Forum.