learning System design as a landscape architect 1

Tue, Mar 1, 2022 6-minute read

Rethink system design in a much fun way, as a former urban planner/landscape planner. Take design twitter as example

Design twitter – Present a Landscape Design

1. what’s is the scenario planning of this project

Landscape Architects won’t jump into drawing a good site planning immediately, nor the system design. The first step is always to clarify the project and know your client demands as much as possible.

what problem you want to solve?
what basic function need here?
Features, QPS, DAU, Interface analysis (site condition like land, weather, population, culture analysis in a project)

need basic data about twitter: MAU 330M, DAO: 200m
connunicate the basic function need here
- Register / Login
- User Profile Display / Edit
- Upload Image / Video
- Search
- Post / Share a tweet
- Timeline/News Feed
- Follow / unFollow a user
Pick the one your clients need (interviewee)
- Post
- Timeline
- News Feed
- Follow / Unfollow
- Register / Login
Features (site condition)
- Concurrent User :
- Read QPS: DAO * average number of queries by a user / seconds in a day
150 M * 50 / 86400 ~ 100k
- Peak = 100k * 3 ~ 300k
- Fast Growing: Max peak users in 3 months = Peak users * 2 (not for twitter)
- Write QPS

QPS

100: your laptop can the web server
1 k : a web server, need consider single point failure, SQL database
1 m : 100 web server cluster, need consider maintainance, NoSQL database(Memchched), Cassandra is 10 k QPS

2. master plan and zoning concept

After site analysis, It’s time to give a draft drawing sketche, with enough imformation (which need communicate with your client/team members)

In a draft drawing sketches, it would contains：

traffic plan
commerical plan
green system

System design:(zoning plan)

Service

Router

+ User Service

+ Tweet Service

Post a tweet
News Feed
Timeline

+ Media Service

Upload Image
Upload Video

+ Friendship Service

Follow
Unfollow

3. core area plans(1: 500)

In a draft drawing sketches, it would contains：

pavment plan
rainstom plan

Storage

+ User Service

Register
Login
SQL

+ Tweet Service

Post a tweet
News Feed
Timeline
NoSQL

+ Media Service

Upload Image
Upload Video
File System

+ Friendship Service

Follow
Unfollow
SQL/NoSQL

Please design schema User Service

User Table
id          integer
username    varchar
email       varchar
password    varchar

Tweet Service

Tweet Table
id              integer
user_id         Foreign Key
content         text
created_time    timestamp

Media Service

Friendship Service

Friendship Table
from_user_id    Foreign Key     
to_user_id      Foreign Key
created_time    timestamp

Interviewer: how to storage and get the News Feed

PULL vs PUSH

PULL

algorithm: when users check New Feed, get every friend’s the first 100 tweets, merge them into the frist 100 News Feed

Mergr K sorted Arrays

time complexity: basically you can ignore the Mergr K sorted Arrays times, N times DB Reads is much more costly

User – fetch News Feed -> Web Server <-get followings->Friendship Table

Web Server <-get tweets from followings->Tweet Table

Web Server -merge and return->User

cons:

multiple DB Reads is very slow when users fetch News Feed

getNewsFeed(request)
    • followings = DB.getFollowings(user=request.user)
    • news_feed = empty
    • for follow in followings: • tweets = DB.getTweets(follow.to_user, 100) • news_feed.merge(tweets)
    • sort(news_feed)
    • return news_feed[:100] # return the first 100


• postTweet(request, tweet)
    • DB.insertTweet(request.user, tweet)
    • return succes

PUSH

algorithm: every user has a list store his News Feed information

Fanout: User post a tweet, push the tweet into his followers’s News Feed lists

when a user check his News Feed, he just read from his News Feed list

time complexity: 1 DB Read, N times Writes. (Async)

User <– Post a new tweet -> Web Server <-Insert the tweet to DB->tweet Table

Web Server -sent tweets to friends->Async Tasks Server

Async Tasks Server <-get followers->friendship Table

Async Tasks Server <-Fanout: insert new tweet to followers news feed->New Feed Table

cons:

huge followers numbers

getNewsFeed(request)
    • return DB.getNewsFeed(request.user)

• postTweet(request, tweet_info)

    • tweet = DB.insertTweet(request.user, tweet_info)
    • AsyncService.fanoutTweet(request.user, tweet)
    • return success

AsyncService::fanoutTweet(user, tweet)
    • followers = DB.getFollowers(user)
    • for follower in followers:
    • DB.insertNewsFeed(tweet, follower )

PULL vs PUSH

tweet and facebook are using pull

4. details design show how to solve some specific problem and improve the plan in the future

Optimize
Scalability Enhancement of Pull Server functions
- the slowest part is happening when user request News Feed
  - add Cache before DB Read
  - cache every user’s Timeline
    - n times cache reqest
    - trade off: all the cache, or the recent 1000 in cache
  - cache every user’s News Feed
    - with Cache: merge users recent 100 tweets, then get the first 100
    - without Cache: merger all the tweets after timestamp
Scalability Enhancement of Push Server functions
- compare Pull model store News Feed in Memory, Push model stores in Disk, and Disk is cheap
- Inactive Users
  - Rank followers by weight(last login time)
- followers » following: fanout would take hours
  - trade off: Pull + Push vs Pull
right Optimize thinking

Try to optimize with minimal changes to the existing model;

Estimate long-term growth and assess whether it is worth switching the entire model

Push + Pull
- ordinal user (Push)
- influencer (Pull), followers come to influencer’s timeline fetch the News Feed, merge into their News Feed

User Table
id           integer
username     varchar
email        varchar
password     varchar
is_influencer boolean

Maintenance
- Robust
- Scalability

unfollow problem(async)

follow: merge timeline into News Feed asynchronously
unfollow: pick out tweets from News Feed asynchronously
Async reduces the processing time
async improve user experience
however, after unfollowing, the user may still find his News Feed in timeline

store likes

Tweet Table
id                  integer
user_id             Foreign Key
content             text
created_time        timestamp
nums_of_like *      integer
nums_of_comments *  integer
nums_of_retweets *  integer

Like Table
id                  integer
user_id             Foreign Key
tweet_id            Foreign Key
created_time        timestamp

Normalize vs Denormalize

Normalize vs Denormalize likes

Normalize likes

SELECT COUNT * FROM like_table where tweet_id=xxx;

Pros: Standardized, most accurate
cons: slow, would cause O(n) SQL Queries

Denormalize likes

UPDATE like_table SET num_of_likes = num_of_likes + 1 where tweet_id=xxx;

UPDATE like_table SET num_of_likes = num_of_likes - 1 where tweet_id=xxx;

no need SQL Queries, because num_of_likes stores in tweet

Thundering Herd

The “thundering herd” problem occurs in a highly concurrent environment (typically, many users). When many users make a request to the same piece of data at the same time, and there is a cache miss (the data for the cached element is not present in the cache, data is kicked out due to cache expiration or elimination by the elimination algorithm, etc) the thundering herd problem is triggered.

I'm Riley Shen