learning System design as a landscape architect 12

Fri, Apr 1, 2022 4-minute read

Rethink Location Based Service system design in a much fun way, as a former urban planner/landscape planner. Take Search System as example


System Design = Logic Design(why) + Infrastructure Design(how)

Design Search System

Scenario analysis

Take twitter Search System as example.

Functional Requirement

500 million tweets are posted every day.

Service

  • Index Service
  • Search Service
  • Ranking Service

Storage system in Twitter

  1. use Redis Cluster for caching
  2. use Blobstore for large file such as image, video
  3. Hadoop cluster for backend data, analyzing Server Logs
  4. MySQL for persistent data

database system

  • Sharding
  • Cold and hot data separation
  • read-write separation

searching database involvement

  1. MySQL database

In the MySQL database, twitter is indexed based on time, which is divided into multiple Timeslice table. Only support 3 days data search.

However, due to the huge number of tweets, each search only supports three days of all data searches.

The search methods supported by MySQL search are very limited and difficult to scale

Searching for a single keyword, does not support searching for multiple keyword combinations, filter, fuzzy query, range search

When the amount of data is too large, the query speed is very slow.

  1. earlybird –> Index server

MySQL is only used in Synchronizing , disseminating data, and saving serialized documents

Index file storage

where is index file store? MySQL , Redis, or HDFS?

HDFS: Build an index on the query content in advance, put it in the index library, and then query from the index library when querying.

search engine

divide the file into multiple Blocks

Inverted Index

an inverted index is a database index storing a mapping from content, such as words or numbers, cons: Large storage overhead and high maintenance costs on update, delete and insert.

Tokenizer is used in inverted index

Ranking search result

  • Interactions (Likes, comments, retweets, saved, read)
  • connection (Author Credibility, followed or not)
  • popularity (realtime + connection + Interactions)

Schema design

Tweet Table

id                  varchar
user_id     id      varchar
content             text
create_time         timestamp
like                long
forwarding_times    long
comment             long

User Table

user_id     id      varchar
user_name   
email       
is_superstart       boolean

how to optimize query performance

real_time index

Store the latest tweets in the last 1 week to build an index library in memory. Querying an index from memory is much faster than from disk

create real_time index

  1. New tweets posted by users will be sent to the word segmentation server. Here the text of the tweets is tokenized
  2. According to hash segmentation, tweets are distributed to each Earlybird (index server), and each Earlybird (index server) is responsible for a part data, indexing tweets in real time
  3. At the same time, there is another update service, which pushes the dynamic change information of tweets (for example: the number of likes, the number of retweets), and dynamically updated index

tweet –> MQ

MQ –> Tokenizer –hash–> EarlyBird(index servers)

MQ –> tweet updates–> Updatas server

query process

  1. query request User search request reach Blender (Search front-end server), Blender parses the request
  2. execute calculation Earlybird Service executive calculation,and sorted tweet lists, return to Blender
  3. return result Blender merges each lists returned by Earlybird table, and perform some heavy sorting, like Reranking, then return to the user

How to optimize Query Performance to the next level

Twitter saves not only the last 1 week tweets and also the 2% most popular and most likely to be queryed Search tweets in memory and save 16% of tweets on SSD hard drive.

Batch index creation process

  • Aggregation: Join multiple data sources together based on Tweet ID
  • Scoring: according to the feature extraction of Twitter (number of retweets, number of likes, number of comments, etc.) to score, the higher the score
  • Partitioned storage: divide data into small blocks and store them in HDFS

e-commerce search:

  1. Supports all kinds of sorting, including popularity, sales, credit, price or other attributes such as grid, origin, etc.

  2. Support range query –price range search

  3. Support using Attribute to search and filter products

  4. Real time accurate information on inventory and price

  5. Having Accurate Product Data

Elasticsearch(ES): At present, most search engines used in e-commerce are based on distributed real-time engines

Scale

Twitter Trending

Twitter’s high-concurrency real-time Calculated Trending board, cache but not db is the one used to realize this function.

Redis Sorted Set has a weight score to sort from large to small.

Fault Tolerance - Replication Mechanism

  • primary shard

  • replica shard

  • When data is stored on the primary shard server, it will be synchronized to the backup shard server in real time.

  • When querying, all (primary and standby) are queried

  • Improve the fault tolerance of the system. When a node of a node is damaged or lost, it can recover from the replica, and the node with the primary shard hangs.

A replica shard will be promoted to primary shard

  • Improve query efficiency: Replica shards also provide query capabilities and automatically load balance search requests.