learning System design as a landscape architect 6
Rethink system design in a much fun way, as a former urban planner/landscape planner. Take Scale and Distributed File System Desig as example
- Problem-based learning to understand SCALE
- Problem-based learning to understand Distributed File System
- Design a lookup service
Problem-based learning to understand SCALE
1.How to scale system ≈ How to scale database
2.single point failure
3.Replica (3 times)
4.Vertical Sharding (cons: massive data and QPS with few column)
Consistent Hashing
0 - 2^64 - 1 –> large Ring
Virtual Node
MySQL Replica
Master Slave Replica
Write Ahead Log
• Any operation of the SQL database will be recorded in the form of Log • For example, data A is changed from C to D at time B • After the slave is activated, tell the master that I am there • The master notifies the slave to read the log every time there is any operation • So the data on the slave is “delayed”
What if the Master crash
• Upgrade a Slave to Master, accept read+write • There may be some degree of data loss and inconsistency
NoSQL Replica
Cassandra : data is stored “clockwise” in three virtual groups on the Consistent hashing ring in nodes
SQL vs NoSQL
The “auto” Replica method is the Master Slave The “manual” Replica method can also deposit three copies clockwise on the Consistent Hashing ring
The “auto” Replica way is to store three copies clockwise on the Consistent Hashing ring The “manual” Replica way: No need to do it manually
User Table Sharding
SQL data – > User Table
How to shard data based on how to query data
select * from user_table where user_id=xxx
use user_id
After User Table Sharding, what if multiple databases cannot maintain a global auto-increment ID?
- Manually create a UUID as the user’s user_id
When a user is created, there is no user_id of the user, which database to create this user?
- Web Server is responsible for creating the user’s UUID as user_id
- After creation, obtain the entity database information based on the result of consistent_hash(user_id)
what if the User Table has used auto-incrementing IDs before sharding?
- UUID is usually a string, and the auto-incrementing id is an integer, which is not compatible
- A separate global UserIdService is used to create a new user IDs, record the maximum value of the current UserId, increasing + 1 everytime. It would ensure the atomicity of data operations (Atomic).
Friendship Table Sharding
direacted: to_userid
undirected friendship: need store 2 datas, use a as sharding key, and use b as sharding key
Sessopm Table Sharding
session_key as sharding key
News Feed / Timeline Sharding
user_id
leetcode Submission Sharding
- demand1: query certain question submit records
select * from submission_table where problem_id=100;
- demand2: query certain user submit records
select * from submission_table where user_id=101;
how to meet two demands? since NoSQL is not support multi-index
-
use user_id as sharding key in the submission_table
-
create the second table, store all the submission of certain question
- table contains “problem_id, submission_id…”
-
use problem_id as sharding key
Problem-based learning to understand Distributed File System
Scenario analysis
basic function
- function 1
- user read and write file
- the size of files
- function 2
- how many machines needed to store those files
- the numbers of the mathines
Service analysis
Client
Server
Server: Master-Slave
Stroage
How to design GFS (Global File System)
How to save a file in one machine? (100 G)
- Metadata
- File info
- Name=xxx.mp4
- CreatedTime=xxxxx
- Size=aaaaa
- index
- Block 11 -> diskOffset1
- Block 12 -> diskOffset2
- Block 13 -> diskOffset3
- File info
1 block = 4096Byte
How to save a file in one machine? (100 T)
1 block = 64M = 64 * 1024k
Advantage • Reduce size of metadata • Reduce traffic
Disadvantage • Waste space for small files
- Metadata
- File info
- Name=xxx.mp4
- CreatedTime=xxxxx
- Size=aaaaa
- index
- Chunk 11 -> diskOffset1
- Chunk 12 -> diskOffset2
- Chunk 13 -> diskOffset3
- File info
How to save extra-large file in several machine (10P)
One master + many ChunkSer
Slave Servers = Chunk Servers
Advantage • Reduce size of metadata • Reduce traffic
Disadvantage • Waste space for small files
- Metadata
- File info
- Name=xxx.mp4
- CreatedTime=xxxxx
- Size=aaaaa
- index
- Chunk 11 -> diskOffset1
- Chunk 12 -> diskOffset2
- Chunk 13 -> diskOffset3
- File info
improve the performance
The master don’t record the diskOffset of a chunk Advantage
-
Reduce the size of metadata in master
-
Reduce the traffic between master and ChunkServer (chunk offset change won’t notify the master)
-
Metadata
- File info
- Name=xxx.mp4
- CreatedTime=xxxxx
- Size=aaaaa
- index
- Chunk 11 -> cs3
- Chunk 12 -> cs3
- Chunk 13 -> cs4
- File info
How much space to store 10P big file in metadata
1 chunk = 64MB needs 64B
10P needs 10G
how to write a file
It is not easy to keep all those replicas consistent.
client –> write File_name=/xxxx.mp4, Chunk index=1 –> master
master –> Assign Chunkserver_locations=US, CS1 —> client
client –> transfer data=/xxxx.mp4, Chunk index=1 –> chunkServer1
chunkServer1 –> write finish
how to modify the written file
delete the file, then rewrite the file
how to read a file
Split a file into multiple chunks to read
client –> File_name=/xxxx.mp4 –> master
master –> return a chunk list —> client
client –> Read /xxxx.mp4 in chunkServer1 –> chunkServer1
chunkServer1 –> Return data /xxxx.mp4-00-of-09
Master Task
- Store each file’s metadata
- Store Map(file name + chunk index -> chunk server)
- Find the corresponding chunkserver when reading
- Allocate free chunkservers when writing
One Work Solution
- storage
- Ordinary file system Meta Data, Block
- Large file storage: Block-> Chunk
- Large files on multiple machines: Chunk Server + Master
- write
- Master + Client + ChunkServer communication process
- Master maintains metadata and chunkserver tables
- read
- Master + Client + ChunkServer communication process
How to identify whether a chunk on the disk is broken
use CheckSum
How to avoid chunk data loss when a ChunkServer is down
Repica
Repica
3 replicas by default
Reliability: tolerate 2 failures
HDFS
The Namenode actively monitors the numbers of replicas of each block. When a replica of a block is lost due to DataNode failure or disk failure, the NameNode creates another replica of the block.
Master-Slave
- Storage
- Save a file in one machine -> a big file in one machine -> a extra big file in multi-machine
- Multi-machine
- How to use the master
- How to traffic and storage of master?
- Read
- The process of reading a file
- Write:
- The process of writing a file
- How to reduce master traffic?
- Client and Chunk Server coummunicate
- How to reduce client traffic?
- Leader Election
- Failure and Recover (key)
- Discover the failure a chunk?
- Check Sum
- Avoid the failure a chunk?
- Replica
- Recover the failure?
- Ask master
- Discover the failure of the chunkserver?
- Heart Beat (chunkservers->master)
- Solve the failure of writing ChunkServer?
- etry
- Discover the failure a chunk?
Design a lookup service
10 Billion key-value pair in this system, when the user enter the key, it return its value. Each key size is 0.1KB, each value is 1kb. Required QPS >= 5000, latency < 200ms
Server Parameters need to know:
- commodity server
- 8x cpu cores on each server
- 32 G memory
- 6T disk
solution step1: calcution
-
total key size 1 TB
-
total value size 10T
-
with 6T disk, a server with two disks will be enough (12T)
-
1TB = 1024 GB ~ 1000 GB, 40 servers can store the whole keys.
step2: Find out the server location of the key
-
For every request, 1 value, which is 1KB needs to be returned
-
total time for reading one value would be 10ms(disk seek) + 1KB/1MB * 30MS (reading 1kb sequentially form disk) ~ 10ms
-
add a location mapping, one is 8 bite, so the size of key + location ~ 1KB
-
40 servers can store <key, disk-address> pairs in its memory (1TB memory size as a whole)
-
binary search 30 times on memory, which can be ignored
-
total latency is 10 + 0.5 = 10.5 ms
step3: solve 5000 QPS demand
- 40 servers, each machine has 2 disk, one disk support for 100 times search, each server can support 200 operation.
- total QPS = 200 * 40 = 8000 > 5000
senorio
function demand: search, read
qps >= 5000
read: