
Parallel Compressor for Floating-Point Data Analysis
"Explore the pFPC algorithm, a parallel compressor designed for efficient transfer and storage of floating-point data in scientific programs. This algorithm, developed in March 2009, offers fast and well-performing compression of IEEE 754 double-precision data, ideal for high-performance computing environments. Learn about its sequential and parallel versions, evaluation methods, and applications in large-scale computer systems."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
pFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher1and Paruj Ratanaworabhan2 1The University of Texas at Austin 2Cornell University
Introduction Scientific programs Often produce and transfer lots of floating-point data (e.g., program output, checkpoints, messages) Large amounts of data Are expensive and slow to transfer and store FPC algorithm for IEEE 754 double-precision data Compresses linear streams of FP values fast and well Single-pass operation and lossless compression pFPC: A Parallel Compressor for Floating-Point Data March 2009
Introduction (cont.) Large-scale high-performance computers Consist of many networked compute nodes Compute nodes have multiple CPUs but only one link Want to speed up data transfer Need real-time compression to match link throughput pFPC: a parallel version of the FPC algorithm Exceeds 10 Gb/s on four Xeon processors pFPC: A Parallel Compressor for Floating-Point Data March 2009
Sequential FPC Algorithm [DCC07] uncompressed 1D stream of doubles double . . . . . . 3f82 3b1e 0e32 f39d 64 Make two predictions Select closer value XOR with true value Count leading zero bytes Encode value Update predictors FCM DFCM 64 64 3f82 4 3f51 9 compare compare selector predictor code 1 closer value 64 XOR leading zero byte counter encoder 1+3 0 to 8 bytes remainderb 7129 889b 0e5d bita cnta x y bitb cntb 0 2 remaindera z compressed stream . . . . . . pFPC: A Parallel Compressor for Floating-Point Data March 2009
pFPC: Parallel FPC Algorithm pFPC operation Divide data stream into chunks Logically assign chunks round-robin to threads Each thread compresses its data with FPC Key parameters Chunk size & number of threads thread 1 . . . first double second double chunk size chunk A chunk C chunk E . . . chunk A chunk B chunk C chunk D chunk B chunk D chunk F thread 2 . . . pFPC: A Parallel Compressor for Floating-Point Data March 2009
Evaluation Method Systems 3.0 GHz Xeon with 4 processors Others in paper Datasets Linear streams of real-world data (18 277 MB) 3 observations: error, info, spitzer 3 simulations: brain, comet, plasma 3 messages: bt, sp, sweep3d pFPC: A Parallel Compressor for Floating-Point Data March 2009
Compression Ratio vs. Thread Count Configuration Small predictor Chunk size = 1 Compression ratio Low (FP data) Other algos worse Fluctuations Due to multi- dimensional data 1.6 1.5 Compression Ratio 1.4 1.3 1.2 1.1 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Thread Count pFPC: A Parallel Compressor for Floating-Point Data March 2009
Compression Ratio vs. Chunk Size Configuration Small predictor 1 to 4 threads Compression ratio Flat for 1 thread Steep initial drop Chunk size Larger is better for history-based pred. 1.19 1.18 1.17 Compression Ratio 1.16 1 1.15 2 1.14 3 4 1.13 1.12 1.11 Chunk Size pFPC: A Parallel Compressor for Floating-Point Data March 2009
Throughput on Xeon System Throughput increases with chunk size Loop overhead, false sharing, TLB performance Throughput scales with thread count Limited by load balance and memory bandwidth Compression Decompression 1800 1800 1 thread 8kB 2 threads 8kB 1 thread 8kB 2 threads 8kB 1600 1600 3 threads 8kB 4 threads 8kB 3 threads 8kB 4 threads 8kB 1400 1400 Throughput (MB/s) Throughput (MB/s) 1200 1200 1000 1000 800 800 600 600 400 400 200 200 1024 1024 4096 2048 4096 8192 2048 8192 32768 32768 16384 65536 16384 65536 32 32 16 64 16 64 128 128 256 512 256 512 1 1 4 2 4 8 2 8 Chunk Size Chunk Size pFPC: A Parallel Compressor for Floating-Point Data March 2009
Summary pFPC algorithm Chunks up data and logically assigns chunks in round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-line http://users.ices.utexas.edu/~burtscher/research/pFPC/ pFPC: A Parallel Compressor for Floating-Point Data March 2009
Conclusions For best compression ratio, thread count should equal to or be small multiple of data s dimension Chunk size should be one For highest throughput, chunk size should at least match system s page size (and be page aligned) Larger chunks also yield higher compression ratios with history-based predictors Parallel scaling is limited by memory bandwidth Future work should focus on improving compression ratio without increasing the memory bandwidth pFPC: A Parallel Compressor for Floating-Point Data March 2009