learning System design as a landscape architect 8
Rethink system design in a much fun way, as a former urban planner/landscape planner. Take Bigtable as example
- Bigtable
Bigtable
NoSQL, how to scale
SQL and Complex Queries Are Needed for Real life, which is base on file system.
DB systems store data as files on the device’s local filesystem, provide interface to operate.
Better understand why we need DB system. You open a closet, find the drawer, get the socks you wanna wear today.
In this case, DB system is the drawer, the file is the sock.
Scenario analysis
search, return
Storage
how about save all data in memory? Bad idea, in District File System File sizes are always measured in PB.
how about save in Disk (sorting and binary search).
But how to Edit the file which is saved in Disk?
method1 :Edit the file directly. (original is 4 bites, now edit to 8 bytes, then the following need move)
method2: read the file, edit, delete the original one, rewrite the file
method3: without edit, but append a record at the end of a file (fast)
BigTable is method 3, use APPEND to improve write performance.
cons:
- when reading data, how to recognize the newest record
- timestamp
- how to Binary Search on unsorted data.
- Block order
- repeat blocks affect query performance
- merge K sorted
Write process
Client 1 – Write key: Aa + appearance value: 3 –> Memory (file 0, file 1..) –Append key: Aa + appearance value: 3 –> file 1 (unordered)
when the file 1 reach to 256(or 512 bytes) –merge/sort–> Order Block –> open a new unorder block
make the last file from unOrdered to Ordered
-
method 1: sort in memory (Write once(disk), read once(disk), sorted in memory, Write once(disk))
-
method 2: saved in memory from beginning ( sorted in memory, Write once(disk))
Client 1 – Write key: Aa + appearance value: 3 –> Memory (file 0, file 1..unOrdered block, when read to full size) –Append key: Aa + appearance value: 3 –>Sorted List –Serialize to disk–> file 1 (ordered)
Recover data of different file formats from crashed memory
Write Ahead Log
Read process
Client 1 – 1. Read key: Aa + appearance –> Memory –2. Read key: Aa + appearance –>Sorted List(Aa + appearance value: 3)
no result – Memory –3. Read key: Aa + appearance –> file 1 (ordered)
query in ordered file
- Index
- BloomFilter
One easy way to build index
Memory Index
- A: address
- D:
- S:
GFS Sstable 0
- Apple: 10
- Cxxx: 3
- Daaaa:8
- Xuuu: 19
- Sxxx: 9
Look up key Xuuu, just need binary search from D to S
In real life, B tree index is good choice.
One better way to read file
- BloomFilter
- use 2 hash keys
Compelete Read process with Index, BF
Memory:
- file0, Address0
- Index0, bloomfilter 0
- file1, Address1
- Index1, bloomfilter 1
Sorted List:
- xx + appearance: 9
Disk: File 0 Ordered:
- aa + appearance: 3
- bb+ appearance: 8
- Index 0
- bloomfilter 0
Disk: File 1 Ordered:
- cc + appearance: 3
- dd+ appearance: 8
- xx + appearance: 9
- Index 1
- bloomfilter 1
First query in unOrdered file
Client 1 – 1. Read key: Aa + appearance –>
Memory
–2. Read key: Aa + appearance –>Sorted List
(Aa + appearance value: 3)
return result if it exists
if not, query in Order File start from lastest.
use BloomFilter to search the key, query the next file if not exists; otherwise use index to find the value for the key.
SSTable = Sorted String Table
Sorted List = Skip List
Complete Write process
Client 1 – Write key: Aa + appearance value: 3 –> Memory
(file 0, file 1..) –Append key: Aa + appearance value: 3 –> Skip List
(unordered) –> SSTable 1 (Ordered in Disk)
How to read/write key: value from 1PB file
Sharding
-
Horizontal Sharding
- Consistent Hash(row key: name)
-
multiple machine
- Master has HashMap(key, server address)
- Slave
How to read in BigTable with multi-server
client – 1. Read key –> BigTable Master
– 2. Get the row key form key, Use the row key to get server id = 1 –> client
client –3. read key –> Slave Server 1
–4. return value –> client
How to write BigTable
write a key
client – 1. Write key –> BigTable Master
– 2. Get the row key form key, Use the row key to get server id = 1 –> client
client –3. Write key, value –> Slave Server 1
–4. Done –> client
Memory
- Write Ahead Log
- save the data into memory
Disk
- SkipList –> full
- write to SSTable
Too many data in Slave Server local disk
save data in GFS
Table Server = Store Tablet Slave Server
client –> Tablet Server 1
in BigTable
–> GFS
Read and write the same key at the same time
distributed lock
- Zookeeper
write a key with distributed lock
client 1 – 1. Write key and lock the key–> Lock Server
– 2. Get the row key form key –> client 1 –3. key value –>Tablet Server 1
–4. Done–> client 1
client 1 – 5. unlock key–> Lock Server
client 2 – 1. Read key and lock the key–> Lock Server
– 2. key is locked –> client 2
Advantage Distribute Lock
- the metadata is store on the Lock
- master don’t need store MetaData
Write with lock
client 1 –> 1. Write key and lock the key –>Lock Server– 2. return Server Id–>client 1
client 1 –> 3. key value –>Tablet Server 1– 4. write log
Tablet Server 1– 5. write key. value
–>Memory – 6. write to GFS–> GFS
Read with lock
client 1 –> 1. Read key and lock the key –>Lock Server– 2. return Server Id–>client 1
client 1 –> 3. Read key –>Tablet Server 1– 4. Check memory, Bloom Filter, index on each sstable, –>Memory
Tablet Server 1– 5. Binary search on sstable to find the value–>GFS