mcrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
This research by Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann from Purdue University, along with Kathryn Mohror, Adam Moody, and Bronis R. de Supinski from Lawrence Livermore National Lab, introduces mcrEngine, a scalable checkpointing system that implements data-aware aggregation and compression techniques. The system aims to address the limitations of current checkpoint-restart systems, such as scalability issues with increasing concurrent transfers and checkpoint data volume. Through data-aware aggregation and compression, mcrEngine reduces the number of concurrent transfers, improves compressibility of checkpoints, and enhances application performance. The design and development of mcrEngine facilitate decoupling the checkpoint transfer logic from applications, leading to more efficient operation. Explore the implications and outcomes of this innovative approach in optimizing checkpointing systems for high-performance computing applications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
mcrEngine aScalable Checkpointing System using Data-Aware Aggregation andCompression Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski Lawrence Livermore National Lab.
Background Checkpoint-restart widely used MPI applications Take globally coordinated checkpoints Application-level checkpoint High-level I/O format HDF5, Adios, netCDF etc. Checkpoint writing Best compromise but complex Easiest but Contention on PFS 1. 2. 3. HDF5 checkpoint{ Group / { Group ToyGrp { DATASET Temperature { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE {(1024) / (1024)} } DATASET Pressure { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE {(20,30) / (20,30)} }}}} Not scalable Struct ToyGrp{ 1. float Temperature[1024]; 2. short Pressure[20][30]; }; Parallel File System (PFS) Application Parallel File System (PFS) Parallel File System (PFS) NetCDF I/O Library Data-Format API HDF5 N 1 N N N M 1 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Impact of Load on PFS at Large Scale IOR 78MB of data per process N N checkpoint transfer Observations: (-) Large average write time (-) Large average read time poor application performance less frequent checkpointing Average Read Time (s) Average Write Time (s) 1400 250 1200 200 1000 800 150 600 100 400 50 200 0 0 # of Processes (N) # of Processes (N) 2 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
What is the Problem? Today s checkpoint-restart systems will not scale Increasing number of concurrent transfers Increasing volume of checkpoint data 3 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Our Contributions Data-aware aggregation Reduces the number of concurrent transfers Improves compressibility of checkpoints Data-aware compression Reduces data almost 2x more than simply concatenating them and compressing Design and develop mcrEngine N M checkpointing system Decouples checkpoint transfer logic from applications Improves application performance 4 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Overview Background Problem Data aggregation & compression Evaluation 5 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Data-Agnostic Schemes Agnostic scheme concatenate checkpoints First Phase C1 PFS C1 C2 Gzip C2 Agnostic-block scheme interleave fixed-size blocks C1 C1 [1-B] C1 [B+1-2B] C2 [1-B] C1 C2 PFS Gzip [1-B] [B+1-2B] [B+1-2B] C2 [1-B] C2 [B+1-2B] Observations: (+) Easy ( ) Low compression ratio 6 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Identify Similar Variables Across Processes P0 Aware Scheme P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; -- Array, Atomic Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; Meta-data: 1. Name 2. Data-type 3. Class: C1.T C1.P C2.T C2.P Concatenating similar variables C1.T C2.T C1.P C2.P 7 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Aware-Block Scheme P0 P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; -- Array, Atomic Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C2.P C2.T Meta-data: 1. Name 2. Data-type 3. Class: C1.T C1.P First B bytes of Temperature of Temperature Pressure Next B bytes Interleave Interleaving similar variables 8 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Data-Aware Aggregation & Compression Aware scheme concatenate similar variables Aware-block scheme interleave similar variables C1.T C2.T C1.P C2.P Data-type aware compression FPC Lempel-Ziv First Phase T P H D Output buffer Gzip Second Phase PFS 9 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
How mcrEngineWorks CNC : Compute node component ANC: Aggregator node component Rank-order groups, Group size = 4, N M checkpointing Identifies similar variables Applies data-aware aggregation and compression CNC CNC T P H D Meta-data Request T, P Request H, D T H CNC CNC P D Group ANC Gzip CNC CNC Group Meta-data Request T, P Request H, D T P H D PFS CNC CNC ANC T P H D Gzip CNC CNC Group CNC CNC Meta-data Request T, P Request H, D T P H D T P H D ANC Gzip 10 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Overview Background Problem Data aggregation & compression Evaluation 11 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Evaluation Applications ALE3D 4.8GB per checkpoint set Cactus 2.41GB per checkpoint set Cosmology 1.1GB per checkpoint set Implosion 13MB per checkpoint set Experimental test-bed LLNL s Sierra: 261.3 TFLOP/s, Linux cluster 15,408 cores, 1.3 Petabyte Lustre file system Compression algorithm FPC [1] for double-precision float Fpzip [2] for single-precision float Lempel-Ziv for all other data-types Gzip for general-purpose compression 12 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Evaluation Metrics Effectiveness of data-aware compression What is the benefit of multiple compression phases? How does group size affect compression ratio? How does compression ratio change as a simulation progresses? Uncompressed size Compressed size Compression ratio = Performance of mcrEngine Overhead of the checkpointing phase Overhead of the restart phase 13 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
No Benefit with Data-Agnostic Double Compression Multiple Phases of Data-Aware Compression are Beneficial Data-type aware compression improves compressibility First phase changes underlying data format Data-agnostic double compression is not beneficial Because, data-format is non-uniform and uncompressible 4 Compression Ratio 3.5 Data-Aware 3 Data-Agnostic 2.5 2 1.5 1 0.5 0 Second-Phase Second-Phase Second-Phase Second-Phase First-Phase First-Phase First-Phase First-Phase ALE3D Cactus Cosmology Implosion 14 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Impact of Group Size on Compression Ratio Different merging schemes better for different applications Larger group size beneficial for certain applications ALE3D: Improvement of 8% from group size 2 to 32 ALE3D Cactus 4.5 2.5 3.5 1.5 Compression Ratio 2.5 0.5 Aware-Block 1 2 4 8 16 32 1 2 4 8 16 32 Aware Cosmology Implosion 3.7 2 2.7 1.5 1.7 1 1 2 4 8 16 32 64 1 2 4 8 16 32 64 128 Group size 15 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Data-Aware Technique Always Wins over Data-Agnostic Data-aware technique always yields better compression ratio than Data-Agnostic technique ALE3D Cactus 4.5 2.5 98-115% 3.5 1.5 Compression Ratio 2.5 0.5 Aware-Block 1 2 4 8 16 32 1 2 4 8 16 32 Aware Cosmology Implosion 3.7 2 Agnostic-Block Agnostic 2.7 1.5 1.7 1 1 2 4 8 16 32 64 1 2 4 8 16 32 64 128 Group size 16 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Compression Ratio Follows Course of Simulation Data-aware technique always yields better compression Cactus 2.3 Aware-Block Aware 1.8 Compression Ratio Agnostic-Block 1.3 Agnostic 0.8 Cosmology Implosion 2.3 6.0 2.1 5.0 1.9 4.0 1.7 3.0 1.5 2.0 1.3 1.0 Simulation Time-steps 17 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme Application Total Size (GB) Aware-Block (%) Aware (%) ALE3D 4.8 6.6 - 27.7 6.6 - 12.7 Cactus 2.41 10.7 11.9 98 - 115 Cosmology 1.1 20.1 25.6 20.6 21.1 Implosion 0.013 36.3 38.4 36.3 38.8 18 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Impact of Aggregation on Scalability Used IOR N N: Each process transfers 78MB N M: Group size 32, 1.21GB per aggregator 250 Average Write Time (sec) 200 N->N Write 150 100 N->M Write 50 0 Average Read Time (sec) 1400 1200 1000 800 N->N Read 600 400 N->M Read 200 0 # of Processes (N) 19 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Impact of Data-Aware Compression on Scalability IOR with N M transfer, groups of 32 processes Data-aware: 1.2GB, data-agnostic: 2.4GB Data-aware compression improves I/O performance at large scale Improvement during write 43% - 70% Improvement during read 48% - 70% 400 350 Average Transfer Time (sec) Agnostic-Write 300 Aware-Write 250 Agnostic 200 Agnostic-Read 150 Aware-Read Aware 100 50 0 # of Processes (N) 20 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
End-to-End Checkpointing Overhead 15,408 processes Group size of 32 for N M schemes Each process takes a checkpoint Converts network bound operation into CPU bound one Reduction in checkpointing overhead 350 Total Checkpointing Overhead (sec) 300 87% 51% 250 CPU Overhead 200 Transfer Overhead to PFS 150 100 50 0 Agnostic+Agg Agnostic+Agg Aware+Agg Aware+Agg No Comp.+N->N No Comp.+N->N Indiv. Comp+N->N No Comp.+N->M No Comp.+N->M Indiv. Comp.+N->M ALE3D Cactus 21 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
End-to-End Restart Overhead Reduced overall restart overhead Reduced network load and transfer time Reduction in recovery overhead 600 Reduction in I/O overhead Total Recovery Overhead (sec) 500 62% 64% 400 CPU Overhead 300 Transfer Overhead to PFS 43% 200 71% 100 0 Aware+Agg Aware+Agg Agnostic+Agg Agnostic+Agg No Comp.+N->N No Comp.+N->N No Comp.+N->M No Comp.+N->M No Comp.+N->M Indiv. Comp.+N->M ALE3D Cactus 22 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Conclusion Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115% Investigated different merging techniques Evaluated effectiveness using real-world applications Designed and developed a scalable framework Implements N M checkpointing Improves application performance Transforms checkpointing into CPU bound operation 23 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Contact Information Tanzima Islam (tislam@purdue.edu) Website: web.ics.purdue.edu/~tislam Acknowledgement Purdue: Saurabh Bagchi (sbagchi@purdue.edu) Rudolf Eigenmann (eigenman@purdue.edu) Lawrence Livermore National Laboratory Kathryn Mohror (kathryn@llnl.gov) Adam Moody (moody20@llnl.gov) Bronis R. de Supinski (bronis@llnl.gov) 24 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
25 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
Backup Slides 26 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
[Backup Slide] Failures in HPC A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson 1 0 0 1 0 0 H a r d w a r e S o f t w a r e N e t w o r k E n v i r o n m H u m a n U n k n o w n H a r d w a r e S o f t w a r e N e t w o r k E n v i r o n m H u m a n U n k n o w n 8 0 8 0 e n t e n P e r c e n t a g e ( % ) P e r c e n t a g e ( % ) 6 0 6 0 4 0 4 0 2 0 2 0 0 0 D E F G H A l l s y s t e m (a) Figure 1. The breakdown of failures into root causes (a) and the breakdown of downtime into root causes (b). Each graph shows the breakdown for systems of type D, E, F, G, and H and aggregatestatistics across all systems (A H). s D E F G H A l l s y s t e m (b) s Breakdown of root causes of failures Breakdown of downtime into root causes ure record. If the system administrator was able to identify the root cause of the problem he provides operations staff with the appropriate information for the root cause field of the failure record. Otherwise the root cause is specified as Unknown . Operations staff and system administrators have occasional follow-up meetings for failures with Un- known rootcause. If therootcausebecomesclearlateron, the correspondingfailure record is amended. Two implications follow from the way the data was col- lected. First, this data is very different from the error logs used in many other studies. Error logs are automatically generated and track any exceptional events in the system, not only errors resulting in system failure. Moreover, error logs often contain multiple entries for the same error event. Second, since the data was created manually by system administrators, the data quality depends on the accuracy of theadministrators reporting. Twopotentialproblemsinhu- mancreatedfailuredataareunderreportingoffailureevents andmisreportingofrootcause. FortheLANLdatawedon t consider underreporting(i.e. a failure does not get reported at all) a serious concern, since failure detection is initiated byautomaticmonitoringand failurereportinginvolvessev- eral people from different administrative domains (opera- tions staff and system administrators). While misdiagnosis can never be ruled out completely, its frequency depends ontheadministrators skills. LANLemployshighly-trained staff backedby a well-fundedcutting edgetechnologyinte- gration team, often pulling new technology into existence in collaborationwith vendors; diagnosis can be expected to be as goodas any customer and often as good as a vendor. varianceorthestandarddeviation,isthatitisnormalizedby themean,andhenceallowscomparisonofvariabilityacross distributionswith different means. We also consider the empirical cumulative distribution function (CDF) and how well it is fit by four probability distributions commonly used in reliability theory1: the ex- ponential,theWeibull, thegamma andthelognormaldistri- bution. We use maximum likelihood estimation to param- eterize the distributions and evaluate the goodness of fit by visual inspection and the negativelog-likelihood test. Note that the goodness of fit that a distribution achieves depends on the degrees of freedom that the distribution of- fers. For example, a phase-type distribution with a high number of phases would likely give a better fit than any of theabovestandarddistributions,whicharelimitedto oneor two parameters. Whenever the quality of fit allows, we pre- fer the simplest standard distribution, since these are well understood and simple to use. In our study we have not found any reason to depend on more degrees of freedom. 27 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression 4 Root cause breakdown An obviousquestionwhenstudyingfailures incomputer systems is what caused the failures. Below we study the entries in the high-level root cause field of the data. We first look at the relative frequency of the six high- level root cause categories: human, environment, network, software, hardware, and unknown. Figure 1(a) shows the percentage of failures in each of the six categories. The right-most bar describes the breakdown across all failure records in the data set. Each of the five bars to the left presents the breakdown across all failure records for sys- tems of a particular hardware type2. Figure 1 indicates that while the basic trends are similar acrosssystemtypes,theactualbreakdownvaries. Hardware 3 Methodology We characterizeanempiricaldistributionusingthreeim- port metrics: the mean, the median, and the squared coeffi- cient of variation (C2). The squared coefficient of variation isameasureofvariabilityandisdefinedasthesquaredstan- dard deviationdividedbythe squared mean. The advantage of using the C2as a measure of variability, rather than the 1We also considered the Pareto distribution[22, 15], but didn t find it to be a better fit than any of the four standard distributions 2For better readability, we omit bars for types A C, which are small single-node systems.
Future Work Analytical solution to group size selection? Better way than rank-order grouping? Variable streaming? Engineering challenge 28 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression
References 1. M. Burtscher and P. Ratanaworabhan, FPC: A High-speed Compressor for Double-Precision Floating-Point Data . 2. P. Lindstrom and M. Isenburg, Fast and Efficient Compression of Floating-Point Data . 3. L. Reinhold, QuickLZ . 29 Tanzima Islam (tislam@purdue.edu) mcrEngine: Data-aware Aggregation & Compression