learning System design as a landscape architect 6

Tue, Mar 1, 2022 7-minute read

Rethink system design in a much fun way, as a former urban planner/landscape planner. Take Scale and Distributed File System Desig as example


Problem-based learning to understand SCALE

1.How to scale system ≈ How to scale database

2.single point failure

3.Replica (3 times)

4.Vertical Sharding (cons: massive data and QPS with few column)

Consistent Hashing

0 - 2^64 - 1 –> large Ring

Virtual Node

MySQL Replica

Master Slave Replica

Write Ahead Log

• Any operation of the SQL database will be recorded in the form of Log • For example, data A is changed from C to D at time B • After the slave is activated, tell the master that I am there • The master notifies the slave to read the log every time there is any operation • So the data on the slave is “delayed”

What if the Master crash

• Upgrade a Slave to Master, accept read+write • There may be some degree of data loss and inconsistency

NoSQL Replica

Cassandra : data is stored “clockwise” in three virtual groups on the Consistent hashing ring in nodes

SQL vs NoSQL

The “auto” Replica method is the Master Slave The “manual” Replica method can also deposit three copies clockwise on the Consistent Hashing ring

The “auto” Replica way is to store three copies clockwise on the Consistent Hashing ring The “manual” Replica way: No need to do it manually

User Table Sharding

SQL data – > User Table

How to shard data based on how to query data

select * from user_table where user_id=xxx

use user_id

After User Table Sharding, what if multiple databases cannot maintain a global auto-increment ID?

  • Manually create a UUID as the user’s user_id

When a user is created, there is no user_id of the user, which database to create this user?

  • Web Server is responsible for creating the user’s UUID as user_id
  • After creation, obtain the entity database information based on the result of consistent_hash(user_id)

what if the User Table has used auto-incrementing IDs before sharding?

  • UUID is usually a string, and the auto-incrementing id is an integer, which is not compatible
  • A separate global UserIdService is used to create a new user IDs, record the maximum value of the current UserId, increasing + 1 everytime. It would ensure the atomicity of data operations (Atomic).

Friendship Table Sharding

direacted: to_userid

undirected friendship: need store 2 datas, use a as sharding key, and use b as sharding key

Sessopm Table Sharding

session_key as sharding key

News Feed / Timeline Sharding

user_id

leetcode Submission Sharding

  • demand1: query certain question submit records

select * from submission_table where problem_id=100;

  • demand2: query certain user submit records

select * from submission_table where user_id=101;

how to meet two demands? since NoSQL is not support multi-index

  • use user_id as sharding key in the submission_table

  • create the second table, store all the submission of certain question

    • table contains “problem_id, submission_id…”
  • use problem_id as sharding key


Problem-based learning to understand Distributed File System

Scenario analysis

basic function

  1. function 1
  • user read and write file
  • the size of files
  1. function 2
  • how many machines needed to store those files
  • the numbers of the mathines

Service analysis

Client

Server

Server: Master-Slave

Stroage

How to design GFS (Global File System)

How to save a file in one machine? (100 G)

  • Metadata
    • File info
      • Name=xxx.mp4
      • CreatedTime=xxxxx
      • Size=aaaaa
      • index
      • Block 11 -> diskOffset1
      • Block 12 -> diskOffset2
      • Block 13 -> diskOffset3

1 block = 4096Byte

How to save a file in one machine? (100 T)

1 block = 64M = 64 * 1024k

Advantage • Reduce size of metadata • Reduce traffic

Disadvantage • Waste space for small files

  • Metadata
    • File info
      • Name=xxx.mp4
      • CreatedTime=xxxxx
      • Size=aaaaa
      • index
      • Chunk 11 -> diskOffset1
      • Chunk 12 -> diskOffset2
      • Chunk 13 -> diskOffset3

How to save extra-large file in several machine (10P)

One master + many ChunkSer

Slave Servers = Chunk Servers

Advantage • Reduce size of metadata • Reduce traffic

Disadvantage • Waste space for small files

  • Metadata
    • File info
      • Name=xxx.mp4
      • CreatedTime=xxxxx
      • Size=aaaaa
      • index
      • Chunk 11 -> diskOffset1
      • Chunk 12 -> diskOffset2
      • Chunk 13 -> diskOffset3

improve the performance

The master don’t record the diskOffset of a chunk Advantage

  • Reduce the size of metadata in master

  • Reduce the traffic between master and ChunkServer (chunk offset change won’t notify the master)

  • Metadata

    • File info
      • Name=xxx.mp4
      • CreatedTime=xxxxx
      • Size=aaaaa
      • index
      • Chunk 11 -> cs3
      • Chunk 12 -> cs3
      • Chunk 13 -> cs4

How much space to store 10P big file in metadata

1 chunk = 64MB needs 64B

10P needs 10G

how to write a file

It is not easy to keep all those replicas consistent.

client –> write File_name=/xxxx.mp4, Chunk index=1 –> master

master –> Assign Chunkserver_locations=US, CS1 —> client

client –> transfer data=/xxxx.mp4, Chunk index=1 –> chunkServer1

chunkServer1 –> write finish

how to modify the written file

delete the file, then rewrite the file

how to read a file

Split a file into multiple chunks to read

client –> File_name=/xxxx.mp4 –> master

master –> return a chunk list —> client

client –> Read /xxxx.mp4 in chunkServer1 –> chunkServer1

chunkServer1 –> Return data /xxxx.mp4-00-of-09

Master Task

  • Store each file’s metadata
  • Store Map(file name + chunk index -> chunk server)
    • Find the corresponding chunkserver when reading
    • Allocate free chunkservers when writing

One Work Solution

  • storage
    • Ordinary file system Meta Data, Block
    • Large file storage: Block-> Chunk
    • Large files on multiple machines: Chunk Server + Master
  • write
    • Master + Client + ChunkServer communication process
    • Master maintains metadata and chunkserver tables
  • read
    • Master + Client + ChunkServer communication process

How to identify whether a chunk on the disk is broken

use CheckSum

How to avoid chunk data loss when a ChunkServer is down

Repica


Repica

3 replicas by default

Reliability: tolerate 2 failures

HDFS

The Namenode actively monitors the numbers of replicas of each block. When a replica of a block is lost due to DataNode failure or disk failure, the NameNode creates another replica of the block.



Master-Slave

  • Storage
    • Save a file in one machine -> a big file in one machine -> a extra big file in multi-machine
    • Multi-machine
      • How to use the master
      • How to traffic and storage of master?
  • Read
    • The process of reading a file
  • Write:
    • The process of writing a file
    • How to reduce master traffic?
      • Client and Chunk Server coummunicate
    • How to reduce client traffic?
      • Leader Election
  • Failure and Recover (key)
    • Discover the failure a chunk?
      • Check Sum
    • Avoid the failure a chunk?
      • Replica
    • Recover the failure?
      • Ask master
    • Discover the failure of the chunkserver?
      • Heart Beat (chunkservers->master)
    • Solve the failure of writing ChunkServer?
      • etry

Design a lookup service

10 Billion key-value pair in this system, when the user enter the key, it return its value. Each key size is 0.1KB, each value is 1kb. Required QPS >= 5000, latency < 200ms

Server Parameters need to know:

  • commodity server
  • 8x cpu cores on each server
  • 32 G memory
  • 6T disk

solution step1: calcution

  • total key size 1 TB

  • total value size 10T

  • with 6T disk, a server with two disks will be enough (12T)

  • 1TB = 1024 GB ~ 1000 GB, 40 servers can store the whole keys.

step2: Find out the server location of the key

  • For every request, 1 value, which is 1KB needs to be returned

  • total time for reading one value would be 10ms(disk seek) + 1KB/1MB * 30MS (reading 1kb sequentially form disk) ~ 10ms

  • add a location mapping, one is 8 bite, so the size of key + location ~ 1KB

  • 40 servers can store <key, disk-address> pairs in its memory (1TB memory size as a whole)

  • binary search 30 times on memory, which can be ignored

  • total latency is 10 + 0.5 = 10.5 ms

step3: solve 5000 QPS demand

  • 40 servers, each machine has 2 disk, one disk support for 100 times search, each server can support 200 operation.
  • total QPS = 200 * 40 = 8000 > 5000

senorio

function demand: search, read

qps >= 5000

read:

service

QPS