Analytics with Redis

Saturday, 30 July 2011

This is my experience using Redis.io as a datastore to support an analytics system. I originally used MySQL for the job but for some reasons it did not correspond to my expectations.

The objective for this system is to give hourly social metrics like page fans, page views, posts shares, from Facebook, but also how much likes, shares and tweets the posts from several websites received. This would enable, later, the visualization how page posts, that link to the website content, influence the ecosystem.

MySQL Approach

I first modeled a simple schema: a table for pages, links and snapshots. A snapshot is an entry with the pageId, metric type, datetime and the value. Every 15 min a cron job would run and go through a list of pages, take their fan pages and other metrics, run trough a list of links, feed from several rss, and collect their shares and tweets. This is the social timeline of a link:

Even tough this analytics system was design to hold only a few thousand metrics a day, which MySQL, can handle pretty easily I would like to perform some data analysis from it. One idea was to calculate which was _trending _in the last X hours, X days. To do this I would simply compare the current value of shares/tweets with the value from X hours ago. Calculate the differences, keep it in an array, order it and return the top 10 links. The problem was to return the correct snapshots. I would have to do two queries: one for the current values (latest snapshot of each link) and other to get the values of the past X hours. This could get tricky because I was not sure when the snapshot would have been taken - the cron job can take less or more time to run and span through different minutes - and need to search for large time spaces.

Other problems involved summing up all tweets in a specific hour or day, retrieving top posts of the day, etc.

Using Redis

Redis is, simply explained, a key/value datastore. You tell him to save the value “bar” in the key “foo”.

redis> SET foo bar OK redis> GET foo "bar" redis> GET dne (nil)

Imagine PHP arrays, Python dictionaries, etc. Or if you are familiar with memcached its the same principle. For a good introduction to Redis check this post from Michael J. Russo. Redis offer advantages over other key/value datastores: lists, sets, sorted sets and hashes. This enable us to keep data organized in a very simple and fast way!

Modeling for Redis

When using Redis we need to focus on modeling it for the specific purpose it will be used. Its important to retain that we can only get data through the keys and never search through all the values. Its mandatory that we create several indexes. Let’s go.

To keep data about analytics and having only key/value available how do we do it? The key must contain the information we are going to look for in the future: ids, keys, etc. Here’s an example of a key: **metric:page_fans:12345678:2011-05-01. **In the future we will want to look for the daily page fans for the page with id **12345678 ** in a specific date. Easy!

redis> GET metric:page_fans:12345678:2011-05-01 41012

No complicated SQL queries or calculations. What if we want for a full month? We use the mget command, which enables us to demand several keys in one request:

redis> MGET metric:page_fans:12345678:2011-05-01 metric:page_fans:12345678:2011-05-02 41012 41210

For simplicity I just wrote two days but for a full month we would request all the days in one swoop. Simple, fast and accurate.

Managing the data

This looked pretty simple but what if we want to look for all available statistics? That’s where sets, or lists, come handy. The purpose of sets is to save several values. In the old memcached days we would have a list on our program, serialize it and then save the output in a key. If we wanted to change it? We would need to retrieve it, parse it locally, remove the last entry, insert the new one and finally save it again. Boring, easy errors can happen here, no atomic operations, etc. How to do it in Redis?

Before explaining how to do it just a quick note: in this sets we are not gonna keep the values, because then we would not know where they belong, but instead we will save the keys. This will act as “table”, a directory. Lets see the example:

redis> SET  metric:page_fans:12345678:2011-05-01    41012 OK redis> SADD  page:12345678:page_fans   metric:page_fans:12345678:2011-05-01 1    //this is the response of Redis for true, the operation succeed

If we do this for every metric we get then to see all the metrics for a specific page we just need to get the set entries:

redis> SMEMBERS   page:12345678:page_fans metric:page_fans:12345678:2011-04-28 metric:page_fans:12345678:2011-04-29 metric:page_fans:12345678:2011-04-30 metric:page_fans:12345678:2011-05-01 metric:page_fans:12345678:2011-05-02 metric:page_fans:12345678:2011-05-03 ..

Then we just give all of these values through an mget command and we know all the values related to a post. There’s a lot to explore in this: in my application, for example, I keep metrics for days and also for hourly stats. If I want to know a whole month I request the daily values. If I want to be more precise I just request all the hours of specific day(s). The use cases of this can give another blog post.

Extra useful use case: top posts daily, weekly, ..

Now you want to know the top posts by day. You can maintain a table in the database and every time a post is updated, or receives a page view, a like or a tweet you re-calculate everything and update this table. Boring. What about using Redis again? With Redis sorted sets you can maintain a ranked list easily. Pick a key, like “top-posts-twitter”, insert the value and that’s it. Then you just need to request the top 10 entries and you get yourself a super fast ranked list. If you want to have daily lists just add the current date to the key when you perform the update and you will have different sets. The syntax is “ZADD key score member”

redis> ZADD   top-posts-twitter:2011-05-20   230   post:11111111112 redis> ZADD   top-posts-twitter:2011-05-20   450   post:11111111113 redis> ZRANGE   top-posts-twitter:2011-05-20  0 10  WITHSCORES post:11111111113 450 post:11111111112 230 ...

When inserting just give the score (page views, likes or tweets), the key (postId) and you’re done. Then just request the top 10 members with ZRANGE and you’re done.

Conclusion

I want to state clearly that all of this could be done in MySQL but some operations would be more complex, slow or simply less direct. The fact that you can manipulate the indexes, work with simple structures and perform more direct queries simplifies the whole layout of the application. This is also a work in progress to test how we can use this simple-modeled-technologies to overcome the complexity of a full modeled MySQL database. In the future I will try to expand this topic.

Further Reading

Comments and Reactions

Also Doubts? Leave all of them in the comment box. We can chit-chat.