Design Twitter Search

4 min readMar 10, 2020

Disclaimer: In the writing of this article, we have used grokking for the twitter search design, and this is a very very basic design and could be improvised.

Requirements

Let us assume that Twitter has 1.5 billion total users, with 800 million daily active users.
On average, Twitter gets 500 million tweets every day.
The average size of a tweet is 300 bytes.
Let us assume there will be 500M searches every day.
The search query will consist of multiple words combined with and or.

Capacity Estimation and Constraints

We have 500 million tweets, and each tweet is 300 bytes. Total space required is

500 million * 300 bytes = 150 GB/day

System APIs

We can have soap or rest API to expose our search APIs

search(api_dev_key,search_terms,max_results_to_return,sort,page_number)

This API will return a list of tweets matching the search_terms.

Parameter

api_dev_key: API developer key

search_terms: A String to be searched

max_results_to_return : Number of tweets to return

sort: Latest First or most liked or best matched

page_number: This will specify the page_number you want to display

High-Level Design

We will store all the status in the storage server and also build an index server that can keep track of which word appears in which tweet. This index will help us quickly find tweets that users are trying to search for.

DB Design

We can use a NoSQL database here as we have we need to scale the system and store tweets in a distributed envirnment.

For the demonstration of this example, we will be using MYSQL, where we will shard the DB based on tweetid. So we will feed Tweets.id to hash function which will help us find the storage server

Table structure will look like as follows:

Table Structure for storage of tweets

Now let us built the most complicated Index Server. Since our tweets consist of words, we will build our index across words. So indexes in most layman terms are nothing just distributed hash table where the key is the word, and the set of tweet id that contains the words are the values. There are approx 250k words in English and approx 250 k nouns, so there will be 500 k indexes. If we assume, there will be 300B tweets per year. We will get 600 B tweets for two years. if we assume each tweet id to be 5bytes then to store all tweet id we will need :

600B * 5 bytes = 3000 GB

If we assume, there are 40 words in a tweet, and if we remove a, an, the and other stop words and assume only 20 words contain information so they only those words need to be indexed. It concludes that each tweet id will be stored 20 times in our index. So the space required will be

(20 * 3000) GB => 60,000 GB => 60 TB

So we assume a high-end server could save 150 GB of data we would need 400 servers to handle this sort of load. So we need to apply to shard

We can use either of two technique to shard our data :

Shard based on words: We first calculate the hash based on word and then find the server based on the hash.

Problems with this approach :

There might be too much load on the single server when a search term becomes hot.
No uniform distribution of load

2. Shard based on tweet id: We calculate the hash based on tweet id and find the server based on this hash. All the servers maintain the index for all the words . while fetching results, the result is first fetched from each server then compiled together on a central server and returned to the user.

Caching Strategy

We can use Redis to store all the hot tweets in the cache instead of serving it from the DB. We will use LRU (Least Recently used) as our eviction strategy for removing data from the data from our systems.

Ranking Tweets

Generally, a ranking score ie, RS, is precomputed for all the tweets and stored along with the tweet id. RS may be computed based on the number of likes, number of comments, etc. When fetching the result, each server sorts the tweets based on RS, and then the final central server combines these sorted results and sorts them based on RS and serves them to the user for further processing.