Advanced Lossless Compression of Floating-Point Data

adaptive per file lossless compression n.w
1 / 29
Embed
Share

"Explore adaptive per-file lossless compression techniques for floating-point data, enhancing data storage efficiency and I/O performance. Learn about innovative approaches using genetic algorithms and population-based heuristics for optimal compression. Discover how AdaptiveFC outperforms leading compressors in achieving high compression ratios."

  • Compression
  • Data
  • Floating-Point
  • Algorithms
  • Genetic

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Adaptive Per-File Lossless Compression of Floating-Point Data Andrew Rodriguez, Noushin Azami, and Martin Burtscher

  2. Data Compression Compression saves space, boosts I/O performance Hardware/Hybrid Accelerated Cosmology Code Generates petabytes of data in just 1 simulation Finding the best compression algorithm is difficult Floating-point data is difficult to compress well HACC Visualization (ornl.gov) Adaptive Per-File Lossless Compression of Floating-Point Data 2

  3. LC Framework LC generates lossless compression algorithms Based on a library of data transformations (components) LC chains components together Yields many possible compression algorithms (pipelines) Problem: pkpipelines for p components and k stages p = 53, k= 5 ~418 million pipelines in search space Adaptive Per-File Lossless Compression of Floating-Point Data 3

  4. This Work Our approach: AdaptiveFC Provides custom compression algorithms Each designed for high compression on just 1 input Based on a genetic algorithm Quickly searches millions of pipelines LC can synthesize AdaptiveFC achieves highest compression overall and on 4 of the 6 datasets we tested Compared against 15 leading special-purpose and general- purpose GPU compressors Adaptive Per-File Lossless Compression of Floating-Point Data 4

  5. Genetic Algorithm Operation Adaptive Per-File Lossless Compression of Floating-Point Data 5

  6. Genetic Algorithms Population-based heuristic inspired by evolution A population of solutions is evolved over generations Each member (circle) of the population is a pipeline Generation 1 Generation 2 Generation 3 Adaptive Per-File Lossless Compression of Floating-Point Data 6

  7. Evaluation Fitness is a measure of how good that individual is Every individual is a pipeline The compression ratio is the fitness Adaptive Per-File Lossless Compression of Floating-Point Data 7

  8. Elitism Allow the fittest of the population to survive Copied to the next generation Elite has high compression ratio Within ? percent of the fittest individual Adaptive Per-File Lossless Compression of Floating-Point Data 8

  9. Selection Process by which parents are picked for crossover Tournament: k members are randomly sampled The fittest individual of those k is selected Adaptive Per-File Lossless Compression of Floating-Point Data 9

  10. Selection Process by which parents are picked for crossover Roulette Wheel: Probability of being selected is proportional to the fitness of an individual Adaptive Per-File Lossless Compression of Floating-Point Data 10

  11. Crossover Use two parents to generate new offspring Single-point: Components to the right of the crossover point are swapped TUPL RLE TUPL ZERE CLOG BIT Parent 1 Child 1 RLE ZERE DIFF CLOG DIFF BIT Parent 2 Child 2 Crossover Point Adaptive Per-File Lossless Compression of Floating-Point Data 11

  12. Crossover Uses two parents to generate new offspring Masked: Generate a random mask 0 means no change, 1 means swap with other parent TUPL RLE TUPL CLOG BIT CLOG Parent 1 Child 1 ZERE DIFF BIT RLE ZERE DIFF Parent 2 Child 2 Mask: 101 Adaptive Per-File Lossless Compression of Floating-Point Data 12

  13. Mutation Diversify the next generation s population Randomly mutate the components New population is mutated before continuing Each component has probability ? to change Replaced by a random component A non-mutated copy of the fittest is kept Ensures solutions do not worsen over generations Component 1, Mutate ? ? Component 2, No Change ? > ? Component 3, No Change ? > ? Adaptive Per-File Lossless Compression of Floating-Point Data 13

  14. Entire Process Adaptive Per-File Lossless Compression of Floating-Point Data 14

  15. Experimental Methodology Adaptive Per-File Lossless Compression of Floating-Point Data 15

  16. Methodology AdaptiveFC Implemented in Python 3 on top of LC Generations = 140 Population size = 20 Stages = 5 Masked crossover Tournament selection Mutation rate = 0.8 Elitism cutoff = 0.1 Ran AdaptiveFC 9 times on each input Using 9 fixed random number generator (RNG) seeds We report the median compression ratio from 9 runs Adaptive Per-File Lossless Compression of Floating-Point Data 16

  17. Methodology Inputs 6 datasets from SDRBench Total of 77 files Each file contains single-precision floats Adaptive Per-File Lossless Compression of Floating-Point Data 17

  18. Methodology Compressors AdaptiveFC compared against 15 GPU compressors Some compressors include multiple implementations Bitcomp, LZ4, Cascaded Compression ratio (CR) Main metric Uncompressed file size over compressed file size (De-)compression throughput Secondary metric Uncompressed file size over (de-)compression time Adaptive Per-File Lossless Compression of Floating-Point Data 18

  19. Results Adaptive Per-File Lossless Compression of Floating-Point Data 19

  20. Compression Ratio All Datasets Geometric mean CR AdaptiveFC Highest compression overall Highlights benefit of customized per-input compression Adaptive Per-File Lossless Compression of Floating-Point Data 20

  21. Compression Ratio CESM, NYX Top 5 & bottom 2 compressors AdaptiveFC Highest CR on CESM and NYX Adaptive Per-File Lossless Compression of Floating-Point Data 21

  22. Compression Ratio QMC, SCALE Top 5 & bottom 2 compressors AdaptiveFC Highest CR on QMC and SCALE ZSTD Appears in top 5 on QMC Adaptive Per-File Lossless Compression of Floating-Point Data 22

  23. Compression Ratio EXAALT, HACC Top 5 & bottom 2 compressors AdaptiveFC In top 3 Deflate and Gdeflate In top 5 on EXAALT dataset ZSTD In top 5 on EXAALT EXAALT and HACC prefer different compressors Adaptive Per-File Lossless Compression of Floating-Point Data 23

  24. Compression Ratio CESM Inputs Top 5 & bottom 2 compressors Results from two individual CESM dataset files FICE_1_26_1800_3600.f32 AdaptiveFC achieves 39% higher CR than the next highest Z3_1_26_1800_3600.f32 AdaptiveFC achieves 50% higher CR than the next highest Adaptive Per-File Lossless Compression of Floating-Point Data 24

  25. Compression Ratio Analysis AdaptiveFC Highest CR overall and on CESM, NYX, QMC and SCALE EXAALT and HACC prefer different compressor ZSTD appears in top 5 on QMC and EXAALT Deflate and Gdeflate appear in top 5 on EXAALT No single compressor works best on all datasets Adaptive Per-File Lossless Compression of Floating-Point Data 25

  26. Compression Throughput All Geometric mean throughput AdaptiveFC in top half NVcomp compressors do not concatenate data chunks All except AdaptiveFC, Ndzip, MPC AdaptiveFC s throughput does not include search time In archival settings, compression (and search) only occur once Adaptive Per-File Lossless Compression of Floating-Point Data 26

  27. Decompression Throughput All Geometric mean throughput AdaptiveFC in top half Bitcomp i0 and Ndzip Fluctuate in CR performance Tradeoff: CR and throughput Adaptive Per-File Lossless Compression of Floating-Point Data 27

  28. Summary and Conclusion AdaptiveFC provides the highest compression overall and on 4 of 6 datasets Per-file compression can provide substantial benefits Other 2 datasets have a unique compressor that compresses its files the most No single algorithm is a good choice for all datasets Different datasets prefer different compression algorithms Adaptive Per-File Lossless Compression of Floating-Point Data 28

  29. Thank you! Acknowledgments Department of Energy, reviewers Contact information andrew.rodriguez@txstate.edu Web page (link to paper and source code) https://userweb.cs.txstate.edu/~burtscher/LC/ Adaptive Per-File Lossless Compression of Floating-Point Data 29

Related


More Related Content