
Optimizing Peak Loads Through I/O Off-Loading Strategy
"Explore how Everest tackles unexpected I/O peaks on servers through efficient write off-loading techniques to enhance performance and mitigate response time issues. Learn how workload properties are leveraged to optimize stores for peak loads and overcome challenges in maintaining data consistency and recoverability."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK hanzler666 @ UKClimbing.com
Problem: I/O peaks on servers Short, unexpected peaks in I/O load This is not about predictable trends Uncorrelated across servers in data center And across volumes on a single server Bad I/O response times during peaks Everest: write off-loading for I/O peaks 2
Example: Exchange server Production mail server 5000 users, 7.2 TB across 8 volumes Well provisioned Hardware RAID, NVRAM, over 100 spindles 24-hour block-level I/O trace Peak load, response time is 20x mean Peaks are uncorrelated across volumes Everest: write off-loading for I/O peaks 3
Exchange server load 100000 Load (reqs/s/volume) 10000 1000 100 14:39 15:52 17:05 18:18 19:31 21:57 23:10 00:23 01:36 02:49 04:02 05:15 06:28 07:41 08:54 10:07 11:20 12:33 13:46 20:44 Time of day Everest: write off-loading for I/O peaks 4
Write off-loading Reads and Writes Everest store Writes Reclaims Everest client No off-loading Off-loading Reclaiming Everest store Reads Everest store Volume Everest: write off-loading for I/O peaks 5
Exploits workload properties Peaks uncorrelated across volumes Loaded volume can find less-loaded stores Peaks have some writes Off-load writes reads see less contention Few foreground reads on off-loaded data Recently written hence in buffer cache Can optimize stores for writes Everest: write off-loading for I/O peaks 6
Challenges Any write anywhere Maximize potential for load balancing Reads must always return latest version Split across stores/base volume if required State must be consistent, recoverable Track both current and stale versions No meta-data writes to base volume Everest: write off-loading for I/O peaks 7
Design features Recoverable soft state Write-optimized stores Reclaiming off-loaded data N-way off-loading Load-balancing policies Everest: write off-loading for I/O peaks 8
Recoverable soft state Need meta-data to track off-loads block ID <location, version> Latest version as well as old (stale) versions Meta-data cached in memory On both clients and stores Off-loaded writes have meta-data header 64-bit version, client ID, block range Everest: write off-loading for I/O peaks 9
Recoverable soft state (2) Meta-data also persisted on stores No synchronous writes to base volume Stores write data+meta-data as one record Store set persisted base volume Small, infrequently changing Client recovery contact store set Store recovery read from disk Everest: write off-loading for I/O peaks 10
Everest stores Short-term, write-optimized storage Simple circular log Small file or partition on existing volume Not LFS: data is reclaimed, no cleaner Monitors load on underlying volume Only used by clients when lightly loaded One store can support many clients Everest: write off-loading for I/O peaks 11
Reclaiming in the background Everest client Everest store Read any Everest store write delete(block range, version) Volume <block range, version, data> Everest store Multiple concurrent reclaim threads Efficient utilization of disk/network resources Everest: write off-loading for I/O peaks 12
Correctness invariants I/O on off-loaded range always off-loaded Reads: sent to correct location Writes: ensure latest version is recoverable Foreground I/Os never blocked by reclaim Deletion of a version only allowed if Newer version written to some store, or Data reclaimed and older versions deleted All off-loaded data eventually reclaimed Everest: write off-loading for I/O peaks 13
Evaluation Exchange server traces OLTP benchmark Scaling Micro-benchmarks Effect of NVRAM Sensitivity to parameters N-way off-loading Everest: write off-loading for I/O peaks 14
Exchange server workload Replay Exchange server trace 5000 users, 8 volumes, 7.2 TB, 24 hours Choose time segments with peaks extend segments to cover all reclaim Our server: 14 disks, 2 TB can fit 3 Exchange volumes Subset of volumes for each segment Everest: write off-loading for I/O peaks 15
Trace segment selection 1000000 Total I/O rate (reqs/s) 100000 10000 1000 100 14:39 15:52 17:05 18:18 19:31 20:44 21:57 23:10 00:23 01:36 02:49 04:02 05:15 06:28 07:41 08:54 10:07 11:20 12:33 13:46 Time of day Everest: write off-loading for I/O peaks 16
Trace segment selection 1000000 Peak 2 Peak 2 Total I/O rate (reqs/s) Peak 1 Peak 1 Peak 3 Peak 3 100000 10000 1000 100 14:39 15:52 17:05 18:18 19:31 20:44 21:57 23:10 00:23 01:36 02:49 04:02 05:15 06:28 07:41 08:54 10:07 11:20 12:33 13:46 Time of day Everest: write off-loading for I/O peaks 17
Three volumes/segment min Trace store (3%) client Trace Trace client median Trace store (3%) Trace Trace Trace client max Trace store (3%) Everest: write off-loading for I/O peaks 18
Mean response time 200 Mean resp time (ms) No off-load Off-load 150 100 50 0 Peak 1 reads Peak 2 reads Peak 3 reads Peak 1 writes Peak 2 writes Peak 3 writes Everest: write off-loading for I/O peaks 19
99th percentile response time 2000 99% resp time (ms) No off-load Off-load 1500 1000 500 0 Peak 1 reads Peak 2 reads Peak 3 reads Peak 1 writes Peak 2 writes Peak 3 writes Everest: write off-loading for I/O peaks 20
Exchange server summary Substantial improvement in I/O latency On a real enterprise server workload Both reads and writes, mean and 99th pc What about application performance? I/O trace cannot show end-to-end effects Where is the benefit coming from? Extra resources, log structure, ...? Everest: write off-loading for I/O peaks 21
OLTP benchmark OLTP client 10 min warmup 10 min measurement SQL Server binary Detours DLL redirection Everest client LAN Data Log Store Dushyanth Narayanan 22
OLTP throughput 3000 2x disks, 3x speedup? 2500 Throughput (tpm) 2000 1500 Extra disk 1000 + 500 Log layout 0 No off-load Off-load Log structured 2-disk striped Striped + Log- structured Everest: write off-loading for I/O peaks 23
Off-loading not a panacea Works for short-term peaks Cannot use to improve perf 24/7 Data usually reclaimed while store still idle Long-term off-load eventual contention Data is reclaimed before store fills up Long-term log cleaner issue Everest: write off-loading for I/O peaks 24
Conclusion Peak I/O is a problem Everest solves this through off-loading By modifying workload at block level Removes write from overloaded volume Off-loading is short term: data is reclaimed Consistency, persistence are maintained State is always correctly recoverable Everest: write off-loading for I/O peaks 25
Questions? Everest: write off-loading for I/O peaks 26
Why not always off-load? OLTP client OLTP client SQL Server 1 Everest client SQL Server 2 Read Read Read Write Write Write Data Data Store Store Dushyanth Narayanan 27
10 min off-load,10 min contention 4 3 Speedup 2 1 0 Off-load Contention (server 1) Contention (server 2) Everest: write off-loading for I/O peaks 28
Mean and 99th pc (log scale) 100000 Response time (ms) No off-load Off-load 10000 1000 100 10 1 Peak 1 reads Peak 2 reads Peak 3 reads Peak 1 writes Peak 2 writes Peak 3 writes Everest: write off-loading for I/O peaks 29
Read/write ratio of peaks 1 Cumulative fraction 0.8 0.6 0.4 0.2 0 0 10 20 30 40 % of writes 50 60 70 80 90 100 Everest: write off-loading for I/O peaks 30
Exchange server response time 10 Response time (s) 1 0.1 0.01 14:39 15:52 17:05 18:18 19:31 20:44 21:57 23:10 00:23 01:36 02:49 04:02 05:15 06:28 07:41 08:54 10:07 11:20 12:33 13:46 Time of day Everest: write off-loading for I/O peaks 31
Exchange server load (volumes) 100000 Max Mean Min Load (reqs/s) 10000 1000 100 14:39 15:52 17:05 18:18 19:31 21:57 23:10 00:23 01:36 02:49 04:02 05:15 06:28 07:41 08:54 10:07 11:20 12:33 13:46 20:44 Time of day Everest: write off-loading for I/O peaks 32
Effect of volume selection 40000 Peak 1 Load (reqs/s/volume) 35000 30000 All Selected 25000 20000 15000 10000 5000 0 22:30 22:32 22:34 22:36 22:38 22:40 22:42 22:44 22:48 22:50 22:52 22:54 22:56 22:58 22:28 22:46 Time of day Everest: write off-loading for I/O peaks 33
Effect of volume selection 70000 Peak 2 Load (reqs/s/volume) 60000 50000 40000 All Selected 30000 20000 10000 0 03:16 03:19 03:22 03:25 03:28 03:34 03:37 03:40 03:43 03:46 03:49 03:52 03:55 03:58 04:01 03:31 Time of day Everest: write off-loading for I/O peaks 34
Effect of volume selection 18000 Peak 3 Load (reqs/s/volume) 16000 14000 12000 All Selected 10000 8000 6000 4000 2000 0 10:07 10:09 10:11 10:13 10:15 10:17 10:19 10:21 10:25 10:27 10:29 10:31 10:33 10:35 10:05 10:23 Time of day Everest: write off-loading for I/O peaks 35
Scaling with #stores OLTP client Store SQL Server binary Detours DLL redirection LAN Everest client Store Data Log Store Dushyanth Narayanan 36
Scaling: linear until CPU-bound 6 Speedup 4 2 0 0 1 2 3 Number of stores Everest: write off-loading for I/O peaks 37
Everest store: circular log layout Header block Tail Stale records Active log Reclaim Head Delete Everest: write off-loading for I/O peaks 38
Exchange server load: CDF 1 Cumulative fraction 0.8 0.6 0.4 0.2 0 100 1000 10000 100000 Request rate per volume (reqs/s) Everest: write off-loading for I/O peaks 39
Unbalanced across volumes 1 Cumulative fraction Min Mean Max 0.8 0.6 0.4 0.2 0 100 1000 10000 100000 Request rate per volume (reqs/s) Everest: write off-loading for I/O peaks 40