Harnessing the power of Twitter and MongoDB

Hey Mongoers!

I recently had the pleasure of joining the MongoLab team.  I share this with you for two reasons: First, you can too! (We're hiring!). But also because I remember when I heard about MongoDB, I created an account on MongoLab and thought... now what?


With open-source technologies proliferating as "Big Data" and analytics explode, we thought it would be beneficial to let our users and friends utilize a script that takes care of the nitty gritty and allows them to explore what makes MongoDB great.  We're excited to present Twitter-Harvest, a Python script that utilizes the Twitter REST API v1.1 to retrieve tweets from a user's timeline and inserts them into a MongoDB database.

Quick Demo

Update 5/8/14 11:40 AM: The twitter credentials previously provided in the gist below no longer work (we've been rate limited!). Please go to Twitter's Dev Center to create your own set of credentials.

The details on installation and running the app are located on this GitHub repo. For the impatient, I empathize... we've provided some Twitter credentials and an out-of-the-box command that you can run to see that everything works. After you have downloaded/unzipped the repo, run:

Straight out of the box, you'll notice that the script will print in your console all the tweets that it is harvesting.  Peruse the help docs and pass arguments accordingly- most notably you'll want to tack on a MongoDB URI using the --db flag so that you can store the tweets in your database.  Also keep in mind that if you'd like to use this script more than once, you should obtain your own Twitter credentials for security and rate limiting reasons.

Diving in

Once you have the necessary modules set up, you'll notice that the run script has quite a few options. *Twitter OAuth credentials are required. To help you store the harvested tweets, you can create a free Sandbox database with us! We have included the following options that we thought would be popular with users:

  • harvesting native retweets (-r)
  • printing each tweet the program iterates over (-v)
  • MongoDB URI, allow insertion into a MongoDB (--db)
  • setting the number of tweets to be harvested (--numtweets)
  • user timeline that you would like to harvest from *default is mongolab (--user)

So, let's say I want to harvest and print 100 of @mongolab's tweets (and retweets). The command and arguments would be:

Just like that, we have 100 tweets in a collection called "mongolab".

To help you along, we also have help documentation available:

Now, onto the fun stuff. Let's see what interesting data or projects you can come up with using this tool!

We challenge you!

In case you're stumped, here's a few challenges we've thought up that really highlight both Twitter's vast array of information and MongoDB features.

1. Compile a list of "successful"- retweeted and/or favorited- tweets and return only a few of the fields. Hint: Aggregation Framework

2. Harvest from a variety of users (friends, family, athletes) and see who has tweeted near you and with what frequency. Hint: Geospatial Indexes

3. Experiment with text indexes - after all, tweets are text- and examine your queries. Can you make them faster?  Hint: Text Search + Cursor Explain

4. Use this as an example to set up a public stream- great for data mining! Hint: Twitter Public Streams

Happy coding, and be sure to keep us posted on your projects. We're always here to help!



*special thanks to our Swedish friend Gustav Arngården @arngarden over at @aitellu for the harvesting idea!

, , , , , , , ,