learning System design as a landscape architect 12
Rethink Location Based Service system design in a much fun way, as a former urban planner/landscape planner. Take Search System as example
- Design Search System
- Scenario analysis
- Service
- Storage system in Twitter
- searching database involvement
- Schema design
- tweet search vs e-commerce search
- Scale
System Design = Logic Design(why) + Infrastructure Design(how)
Design Search System
Scenario analysis
Take twitter Search System as example.
Functional Requirement
500 million tweets are posted every day.
Service
- Index Service
- Search Service
- Ranking Service
Storage system in Twitter
- use
Redis Cluster
for caching - use
Blobstore
for large file such as image, video - Hadoop cluster for backend data, analyzing Server Logs
- MySQL for persistent data
database system
- Sharding
- Cold and hot data separation
- read-write separation
searching database involvement
- MySQL database
In the MySQL database, twitter is indexed based on time, which is divided into multiple Timeslice table. Only support 3 days data search.
However, due to the huge number of tweets, each search only supports three days of all data searches.
The search methods supported by MySQL search are very limited and difficult to scale
Searching for a single keyword, does not support searching for multiple keyword combinations, filter, fuzzy query, range search
When the amount of data is too large, the query speed is very slow.
- earlybird –> Index server
MySQL is only used in Synchronizing , disseminating data, and saving serialized documents
Index file storage
where is index file store? MySQL , Redis, or HDFS?
HDFS: Build an index on the query content in advance, put it in the index library, and then query from the index library when querying.
search engine
divide the file into multiple Blocks
Inverted Index
an inverted index is a database index storing a mapping from content, such as words or numbers, cons: Large storage overhead and high maintenance costs on update, delete and insert.
Tokenizer
is used in inverted index
Ranking search result
- Interactions (Likes, comments, retweets, saved, read)
- connection (Author Credibility, followed or not)
- popularity (realtime + connection + Interactions)
Schema design
Tweet Table
id varchar
user_id id varchar
content text
create_time timestamp
like long
forwarding_times long
comment long
User Table
user_id id varchar
user_name
email
is_superstart boolean
how to optimize query performance
real_time index
Store the latest tweets in the last 1 week to build an index library in memory. Querying an index from memory is much faster than from disk
create real_time index
- New tweets posted by users will be sent to the word segmentation server. Here the text of the tweets is tokenized
- According to hash segmentation, tweets are distributed to each Earlybird (index server), and each Earlybird (index server) is responsible for a part data, indexing tweets in real time
- At the same time, there is another update service, which pushes the dynamic change information of tweets (for example: the number of likes, the number of retweets), and dynamically updated index
tweet
–> MQ
MQ
–> Tokenizer –hash–> EarlyBird(index servers)
MQ
–> tweet updates–> Updatas server
query process
- query request User search request reach Blender (Search front-end server), Blender parses the request
- execute calculation Earlybird Service executive calculation,and sorted tweet lists, return to Blender
- return result Blender merges each lists returned by Earlybird table, and perform some heavy sorting, like Reranking, then return to the user
How to optimize Query Performance to the next level
Twitter saves not only the last 1 week tweets and also the 2% most popular and most likely to be queryed Search tweets in memory and save 16% of tweets on SSD hard drive.
Batch index creation process
- Aggregation: Join multiple data sources together based on Tweet ID
- Scoring: according to the feature extraction of Twitter (number of retweets, number of likes, number of comments, etc.) to score, the higher the score
- Partitioned storage: divide data into small blocks and store them in HDFS
tweet search vs e-commerce search
e-commerce search:
-
Supports all kinds of sorting, including popularity, sales, credit, price or other attributes such as grid, origin, etc.
-
Support range query –price range search
-
Support using Attribute to search and filter products
-
Real time accurate information on inventory and price
-
Having Accurate Product Data
Elasticsearch(ES): At present, most search engines used in e-commerce are based on distributed real-time engines
Scale
Twitter Trending
Twitter’s high-concurrency real-time Calculated Trending board, cache but not db is the one used to realize this function.
Redis Sorted Set has a weight score to sort from large to small.
Fault Tolerance - Replication Mechanism
-
primary shard
-
replica shard
-
When data is stored on the primary shard server, it will be synchronized to the backup shard server in real time.
-
When querying, all (primary and standby) are queried
-
Improve the fault tolerance of the system. When a node of a node is damaged or lost, it can recover from the replica, and the node with the primary shard hangs.
A replica shard will be promoted to primary shard
- Improve query efficiency: Replica shards also provide query capabilities and automatically load balance search requests.