learning System design as a landscape architect 12

Fri, Apr 1, 2022 4-minute read

Rethink Location Based Service system design in a much fun way, as a former urban planner/landscape planner. Take Search System as example

Design Search System
Scenario analysis
- Functional Requirement
Service
Storage system in Twitter
- database system
searching database involvement
- Index file storage
  - search engine
  - Inverted Index
- Ranking search result
Schema design
- create real_time index
  - query process
- Batch index creation process
tweet search vs e-commerce search
Scale
- Fault Tolerance - Replication Mechanism

System Design = Logic Design(why) + Infrastructure Design(how)

Design Search System

Scenario analysis

Take twitter Search System as example.

Functional Requirement

500 million tweets are posted every day.

Service

Index Service
Search Service
Ranking Service

Storage system in Twitter

use Redis Cluster for caching
use Blobstore for large file such as image, video
Hadoop cluster for backend data, analyzing Server Logs
MySQL for persistent data

database system

Sharding
Cold and hot data separation
read-write separation

searching database involvement

MySQL database

In the MySQL database, twitter is indexed based on time, which is divided into multiple Timeslice table. Only support 3 days data search.

However, due to the huge number of tweets, each search only supports three days of all data searches.

The search methods supported by MySQL search are very limited and difficult to scale

Searching for a single keyword, does not support searching for multiple keyword combinations, filter, fuzzy query, range search

When the amount of data is too large, the query speed is very slow.

earlybird –> Index server

MySQL is only used in Synchronizing , disseminating data, and saving serialized documents

Index file storage

where is index file store? MySQL , Redis, or HDFS?

HDFS: Build an index on the query content in advance, put it in the index library, and then query from the index library when querying.

search engine

divide the file into multiple Blocks

Inverted Index

an inverted index is a database index storing a mapping from content, such as words or numbers, cons: Large storage overhead and high maintenance costs on update, delete and insert.

Tokenizer is used in inverted index

Ranking search result

Interactions (Likes, comments, retweets, saved, read)
connection (Author Credibility, followed or not)
popularity (realtime + connection + Interactions)

Schema design

Tweet Table

id                  varchar
user_id     id      varchar
content             text
create_time         timestamp
like                long
forwarding_times    long
comment             long

User Table

user_id     id      varchar
user_name   
email       
is_superstart       boolean

how to optimize query performance

real_time index

Store the latest tweets in the last 1 week to build an index library in memory. Querying an index from memory is much faster than from disk

create real_time index

New tweets posted by users will be sent to the word segmentation server. Here the text of the tweets is tokenized
According to hash segmentation, tweets are distributed to each Earlybird (index server), and each Earlybird (index server) is responsible for a part data, indexing tweets in real time
At the same time, there is another update service, which pushes the dynamic change information of tweets (for example: the number of likes, the number of retweets), and dynamically updated index

tweet –> MQ

MQ –> Tokenizer –hash–> EarlyBird(index servers)

MQ –> tweet updates–> Updatas server

query process

query request User search request reach Blender (Search front-end server), Blender parses the request
execute calculation Earlybird Service executive calculation,and sorted tweet lists, return to Blender
return result Blender merges each lists returned by Earlybird table, and perform some heavy sorting, like Reranking, then return to the user

How to optimize Query Performance to the next level

Twitter saves not only the last 1 week tweets and also the 2% most popular and most likely to be queryed Search tweets in memory and save 16% of tweets on SSD hard drive.

Batch index creation process

Aggregation: Join multiple data sources together based on Tweet ID
Scoring: according to the feature extraction of Twitter (number of retweets, number of likes, number of comments, etc.) to score, the higher the score
Partitioned storage: divide data into small blocks and store them in HDFS

tweet search vs e-commerce search

e-commerce search:

Supports all kinds of sorting, including popularity, sales, credit, price or other attributes such as grid, origin, etc.
Support range query –price range search
Support using Attribute to search and filter products
Real time accurate information on inventory and price
Having Accurate Product Data

Elasticsearch(ES): At present, most search engines used in e-commerce are based on distributed real-time engines

Scale

Twitter Trending

Twitter’s high-concurrency real-time Calculated Trending board, cache but not db is the one used to realize this function.

Redis Sorted Set has a weight score to sort from large to small.

Fault Tolerance - Replication Mechanism

primary shard
replica shard
When data is stored on the primary shard server, it will be synchronized to the backup shard server in real time.
When querying, all (primary and standby) are queried
Improve the fault tolerance of the system. When a node of a node is damaged or lost, it can recover from the replica, and the node with the primary shard hangs.

A replica shard will be promoted to primary shard

Improve query efficiency: Replica shards also provide query capabilities and automatically load balance search requests.

I'm Riley Shen