
Effective Data Compression Algorithms for GPUs by Annie Yang and Martin Burtscher
Explore the research on synthesizing effective data compression algorithms for GPUs, focusing on the development of a brand-new lossless compression algorithm for floating-point data. The study addresses the challenges of optimizing compression algorithms for GPUs to enhance transfer throughput and decrease storage requirements in high-performance computing systems. Discover the approach taken, including breaking down existing algorithms, synthesizing new algorithms, and automating the compression and decompression processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science
Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision floating-point data Systematically derived to work well on GPUs MPC features Compression ratio is similar to best CPU algorithms Throughput is much higher Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2
Introduction High-Performance Computing Systems Depend increasingly on accelerators Process large amounts of floating-point (FP) data S Exponent Mantissa Moving this data is often the performance bottleneck Data compression Can increase transfer throughput Can reduce storage requirement But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3
Problem Statement Existing FP compression algorithms for GPUs Fast but compress poorly Existing FP compression algorithms for CPUs Compress much better but are slow Parallel codes run serial algorithms on multiple chunks Too much state per thread for a GPU implementation Best serial algos may not be scalably parallelizable Do effective FP compression algos for GPUs exist? And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4
Our Approach Need a brand-new massively-parallel algorithm Study existing FP compression algorithms Break them down into constituent parts Only keep GPU-friendly parts Generalize them as much as possible Resulted in algorithmic components Charles Trevelyan for http://plus.maths.org/ CUDA implementation: each component takes sequence of values as input and outputs transformed sequence Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5
Our Approach (cont.) Automatically synthesize compression algorithms by chaining components Use exhaustive search to find best four-component chains Synthesize decompressor Employ inverse components Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6
Mutator Components Mutators computationally transform each value Do not use information about any other value NUL outputs the input block (identity) INV flips all the bits , called cut, is a singleton pseudo component that converts a block of words into a block of bytes Merely a type cast, i.e., no computation or data copying Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7
Shuffler Components Shufflers reorder whole values or bits of values Do not perform any computation Each thread block operates on a chunk of values BIT emits most significant bits of all values, followed by the second most significant bits, etc. DIMn groups values by dimension n Tested n = 2, 3, 4, 5, 8, 16, and 32 For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8
Predictor Components Predictors guess values based on previous values and compute residuals (true minus guessed value) Residuals tend to cluster around zero, making them easier to compress than the original sequence Each thread block operates on a chunk of values LNVns subtracts nth prior value from current value Tested n = 1, 2, 3, 5, 6, and 8 LNVnx XORs current with nth prior value Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9
Reducer Components Reducers eliminate redundancies in value sequence All other components cannot change length of sequence, i.e., only reducers can compress sequence Each thread block operates on a chunk of values ZE emits bitmap of 0s followed by non-zero values Effective if input sequence contains many zeros RLE performs run-length encoding, i.e., replaces repeating values by count and a single value Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10
Algorithm Synthesis Determine best four-stage algorithms with a cut Exhaustive search of all possible 138,240 combinations 13 double-precision data sets (19 277 MB) Observational data, simulation results, MPI messages Single-precision data derived from double-precision data Create general GPU-friendly compression algorithm Analyze best algorithm for each data set and precision Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11
Best of 138,240 Algorithms data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | single precision DIM5 ZE LNV6x | ZE LNV5s LNV5s LNV5x | ZE DIM3 LNV5x BIT ZE | RLE DIM5 LNV6s ZE | LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE LNV1s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | LNV8s DIM2 | DIM4 RLE ZE BIT LNV1s ZE | BIT LNV1x DIM32 | RLE LNV6s BIT LNV1s ZE | Synthesizing Effective Data Compression Algorithms for GPUs 12
Analysis of Reducers data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | Double prec results only Single prec results similar ZE or RLE required at end Not counting cut; (encoder) ZE dominates Many 0s but not in a row First three stages Contain almost no reducers Transformations are key to making reducer effective Chaining whole compression algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13
Analysis of Mutators data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | NUL and INV never used No need to invert bits Fewer stages perform worse Cut often at end (not used) Word granularity suffices Easier/faster to implement DIM8 right after cut DIM4 with single precision Used to separate byte positions of each word Synthesis yielded unforeseen use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14
Analysis of Shufflers data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | Shufflers are important Almost always included BIT used very frequently FP bit positions correlate more strongly than values DIM has two uses Separate bytes (see before) Right after cut Separate values of multi-dim data sets (intended use) Early stages Synthesizing Effective Data Compression Algorithms for GPUs 15
Analysis of Predictors data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | Predictors very important (Data model) Used in every case Often 2 predictors used LNVns dominates LNVnx Arithmetic (sub) difference superior to bit-wise (xor) difference in residual Dimension n Separates values of multi- dim data sets (in 1st stage) Synthesizing Effective Data Compression Algorithms for GPUs 16
Analysis of Overall Best Algorithm data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | Same algo for SP and DP Few components mismatch But LNV6s dim is off Most frequent pattern LNV*s BIT LNV1s ZE Star denotes dimensionality Why 6 in starred position? Not used in individual algos 6 is least common multiple of 1, 2, and 3 Did not test n > 8 Synthesizing Effective Data Compression Algorithms for GPUs 17
MPC: Generalization of Overall Best data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | MPC algorithm Massively Parallel Compression Uses generalized pattern LNVds BIT LNV1s ZE where d is data set dimensionality Matches best algorithm on several DP and SP data sets Performs even better when true dimensionality is used Synthesizing Effective Data Compression Algorithms for GPUs 18
Evaluation Methodology System Dual 10-core Xeon E5-2680 v2 CPU K40 GPU with 15 SMs (2880 cores) 13 DP and 13 SP real-world data sets Same as before Compression algorithms CPU: bzip2, gzip, lzop, and pFPC GPU: GFC and MPC (our algorithm) Synthesizing Effective Data Compression Algorithms for GPUs 19
Compression Ratio (Double Precision) HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp bzip2 --best 1.321 gzip --best lzop -9 pFPC -1M GFC MPC 1.088 1.130 1.052 1.250 1.122 1.207 1.018 1.055 1.000 1.137 1.148 1.212 1.055 1.107 1.003 1.238 1.202 1.208 6.933 7.431 6.780 4.710 3.506 2.999 1.294 1.092 1.017 1.888 1.217 1.287 1.043 1.064 1.000 1.148 1.090 1.182 1.173 1.162 1.082 1.151 1.110 1.267 1.029 1.058 1.017 1.038 1.013 1.106 5.789 1.608 1.503 7.042 1.125 1.164 1.339 1.448 1.273 1.542 1.233 1.180 1.217 1.154 1.096 1.215 1.141 1.214 1.752 1.231 1.142 1.022 1.022 1.184 1.024 1.036 1.000 0.997 1.037 1.101 1.239 1.158 1.365 1.179 1.248 MPC delivers record compression on 5 data sets In spite of GPU-friendly components constraint MPC outperformed by bzip2 and pFPC on average Due to msg_sppm and num_plasma MPC superior to GFC (only other GPU compressor) Synthesizing Effective Data Compression Algorithms for GPUs 20
Compression Ratio (Single Precision) HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp bzip2 --best 1.398 gzip --best lzop -9 MPC 1.129 1.179 1.075 1.336 1.041 1.086 1.000 1.440 1.141 1.200 1.083 1.385 8.741 9.605 8.634 3.813 2.355 1.151 1.033 1.534 1.113 1.128 1.003 1.344 1.117 1.151 1.086 1.178 1.043 1.080 1.016 1.122 8.652 1.383 1.223 1.345 1.338 1.466 1.246 1.298 1.327 1.200 1.129 1.436 1.394 1.188 1.077 1.047 1.049 1.079 1.000 1.114 1.267 1.153 1.350 MPC delivers record compression 8 data sets In spite of GPU-friendly components constraint MPC is outperformed by bzip2 on average Due to num_plasma MPC is superior to GFC and pFPC They do not support single-precision data, MPC does Synthesizing Effective Data Compression Algorithms for GPUs 21
Throughput (Gigabytes per Second) MPC outperforms all CPU compressors Including pFPC running on two 10-core CPUs by 7.5x MPC slower than GFC but mostly faster than PCIe MPC uses slow O(n log n) prefix scan implementation double precision single precision double precision single precision compr. decom. compr. decom. 0.01 0.02 0.02 0.15 0.01 1.87 1.41 1.04 32.28 31.47 10.78 7.91 compr. decom. compr. decom. 0.1% 0.3% 0.2% 1.9% 0.1% 23.6% 13.0% 13.2% 299.4% 398.0% 100.0% 100.0% 100.0% 100.0% bzip2 --best gzip --best lzop -9 pFPC -1M GFC MPC 0.01 0.03 0.01 n/a n/a 5.81 0.02 0.15 1.44 n/a n/a 4.23 bzip2 --best gzip --best lzop -9 pFPC -1M GFC MPC 0.1% 0.4% 0.2% n/a n/a 0.6% 3.5% 33.9% n/a n/a Synthesizing Effective Data Compression Algorithms for GPUs 22
Summary Goal of research Create an effective algorithm for FP data compression that is suitable for massively-parallel GPUs Approach Extracted 24 GPU-friendly components and evaluated 138,240 combinations to find best 4-stage algorithms Generalized findings to derive MPC algorithm Result Brand new compression algorithm for SP and DP data Compresses about as well as CPU algos but much faster Synthesizing Effective Data Compression Algorithms for GPUs 23
Future Work and Acknowledgments Future work Faster implementation, more components, longer chains, and other inputs, data types, and constraints Acknowledgments National Science Foundation NVIDIA Corporation Texas Advanced Computing Center Contact information burtscher@txstate.edu Nvidia Synthesizing Effective Data Compression Algorithms for GPUs 24
Number of Stages 3 stages reach about 95% of compression ratio Synthesizing Effective Data Compression Algorithms for GPUs 25
Single- vs Double-Precision Algorithms data set msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp overall best double precision LNV1s BIT LNV1s ZE | LNV5s | DIM8 BIT RLE DIM3 LNV5x BIT ZE | DIM5 LNV6x ZE | ZE LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE LNV1x ZE LNV1s ZE | LNV2s | DIM8 BIT RLE ZE BIT LNV1s ZE | LNV8s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | single precision DIM5 ZE LNV6x | ZE LNV5s LNV5s LNV5x | ZE DIM3 LNV5x BIT ZE | RLE DIM5 LNV6s ZE | LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE LNV1s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | LNV8s DIM2 | DIM4 RLE ZE BIT LNV1s ZE | BIT LNV1x DIM32 | RLE LNV6s BIT LNV1s ZE | Synthesizing Effective Data Compression Algorithms for GPUs 26
MPC Operation What does LNVds BIT LNV1s ZE do? LNVds predicts each value using a similar value to obtain a residual sequence with many small values Similar value = most recent prior value from same dim BIT groups residuals by bit position All LSBs, then all second LSBs, etc. LNV1s turns identical consecutive words into zeros ZE eliminates these zero words GPU friendly All four components are massively parallel Can be implemented with prefix scans or simpler Synthesizing Effective Data Compression Algorithms for GPUs 27