Reading Twitter's internal structure.
Microblogging services like Twitter have a lot of posts in a short period of time. Especially for the timeline, simply moving data in and out from the RDB will not scale.
I found some articles and videos on the Internet that explain the architecture of Twitter. I would like to summarize how Twitter caches the timeline (including some speculation).
Twitter table structure
A simple Twitter table definition in RDB would look like this.
Timelines are not just an aggregation of SQL
Twitter allows each user to post tweets and follow other users.
Users can view the tweets of the people they follow on their home timeline. They can view a list of tweets posted by the user in the past by visiting the user's profile page.
In a simple social networking service, a timeline can be created by retrieving the IDs of the users you follow via SQL, and then retrieving the tweets that those IDs have tweeted in a certain period of time in the past.
If you are following thousands of users, the aggregation will be heavy, and in reality, a certain amount of cut will be necessary. There is also a requirement to insert recommended tweets and advertising tweets into the home timeline. So timelines need a separate table.
Twitter uses a PUSH-based architecture for the main data structure of the timeline that writes to each timeline when a user you follow tweets, rather than a PULL-based architecture that aggregates when you request.
This is not the subject of this article, but they have two lists, the list of people who are following a certain person and the list of people who are followed by a certain person. Generally only one list is OK if it were RDB. Twitter isn't normal.
Updating timelines in application layer is high cost
If you use Twitter, you may have reached the bottom of the timeline. Although it looks like infinite scrolling, timelines seem to have a cap on length. The home timeline is limited to 300 tweets or something like that.
In fact, most people don't want to recreate your home timeline at a week ago. If you want to find out the past tweets of one user, you will search word or check the user timeline directly.
When inserting recommendations, advertisements, etc. into the timeline, if we do the insertion and replacement in a simple web application layer, we can take the following flow.
- retrieve user_id data from the timelines table
- insert, replace and delete tweet_id
- replace all the user_id data in the timelines table in a transaction
This is a relatively expensive operation for a service like Twitter, which has a large number of users and is highly real-time oriented. They create special-purpose data structure on customized Redis, rather than store data in a RDB table.
Haplo: Hybrid List, BTree on Redis
Based on public information, Twitter is trying to migrate its in-memory cache store from Memcached+Redis to Pelikan.
Except for the timeline, the most of the cache is Memcached. Timeline is Redis. Also, they create Pelikan, new cache system. They originally tried to migrate Memcached and Redis to Pelikan, but the person who was leading the cache area was laid off.
Pelikan is open-sourced, but the maintainer of the GitHub account has been moved to pelikan-io since she is no longer authorized to maintain it as Twitter. They partially migrated the code from C++ to Rust a while ago. The presentation at RustConf is available on YouTube.
The timeline is Redis, but they are not using it as it is. They call a customized Redis, Haplo. Hybrid List and BTree are added as data structures. BTree is known as the structure of RDB indexes.
The normal Redis does not support these data structures because they forked an older version of Redis. Perhaps the Redis team does not want to have several custom data structures (especially BTree).
Compression efficiency and performance are the motivations for creating the Hybrid List
The timeline is primarily using Hybrid List.
Redis has a built-in
- linked list
They call the combination of ziplist and linked list as the Hybrid List.
The Twitter timeline is basically a set of tweet_ids, but I think it has a disadvantageous in the linked list. If the content part is large enough compared to the pointer, the pointer part can be ignored. However, since tweet_id is an id, the difference with the pointer is small and there is room of optimization.
With a simple ziplist, a celebrity with a large number of followers would cause the heavy writes. Many timelines have to be updated when he/she tweeted. It caused the problems such as running out of heap space.
Other performance and cost considerations
Redis is an in-memory datastore, so preparing timelines for all registered users would be expensive. In fact, they change the cash strategy based on recent login activity.
They have a queue of writing timelines when new tweet was created. It alleviates the killing timeline updates from the celebrity tweets.
There is a lot of imagination, but there was a video that mentioned a Celebrity Cluster. If I were making an early Twitter, I might flag people with a certain number of followers and change the speed of queue advancement or do some other special manipulation. Actually, Twitter has a limit on the number of people who can follow in a day. It prevents a bot from suddenly becoming Celebrity.
It is a bit old, but there is also an article abount creating a Twitter clone in the Redis documentation. It might be interesting to read.
I did my best to read the articles and watch videos, but please let me know if you find any mistakes.
- Scaling Redis at Twitter, 2014
- "Cache à la carte: a framework for in-memory caching" by Yao Yue, 2015
- Using Redis at Scale at Twitter - by Rashmi Ramesh of Twitter - RedisConf17 -, 2017
- RustConf 2021 - Whoops! I Rewrote It in Rust by Brian Martin
- Timelines at Scale
- Real-Time Delivery Architecture at Twitter