Machine Learning in Systems: Applications and Advancements
Applying machine learning to systems problems offers a solution when optimal system policies or configurations depend on input distribution or future state, especially when system state is challenging to model or partially observed. Machine learning can also help when user objectives are unknown but indirectly observed. Despite early issues like brittleness, difficulty in tuning, and heavy computational costs, there has been a recent resurgence of interest in using machine learning for systems, particularly in large-scale systems. Recent progress in deep learning has generated renewed interest, leading to several promising efforts and papers demonstrating the potential of machine learning in system optimization and decision-making processes. Advancements such as learned index structures and device placement optimization with reinforcement learning showcase the evolving landscape of machine learning applications in systems.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Reducing the Cost of Persistence for Nonvolatile Heaps in End User Devices Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan CERCS, Georgia Institute of Technology
Agenda Motivation Dual Use of NVM High level design Programming interface Sources of Persistence Cost in Dual Use of NVM Optimizations to Reduce Persistence Cost Page contiguity based allocation Library allocator optimization Hybrid logging Conclusion and Future Work
Motivation Ever more data intensive apps in end-clients => Growing needs for memory capacity But DRAM scalability is limited (power, cost) => Needs for large and faster persistent storage Poor flash/EMC performance (~1MB - 30MB)
NVMs to the rescue NVMs (like PCM) are byte addressable Provides persistence 100x faster than SSD Higher density compared to DRAM (~128GB) Use processor cache to reduce write latency impact and improve endurance (4x-10x slower writes, and limited endurance (~108 writes) )
Our Approach: Dual Use NVM Prior Research : Use NVM either for persistence or as an additional capacity heap Use NVM for both persistence and additional capacity NVM for persistence ( NVMPersist ) and also for additional capacity as a heap ( NVMCap ) NVMCap and NVMPersist threads use same last level cache
NVM Dual Use High Level View DRAM App thrd1 Processor Last Level Cache NVMCap. Heap App thrd2 NVMPersist. Heap
NVM Dual Use Interface APP1 User level NVM Library Persist Zone Kernel Zone Capacity Zone DRAM NVM Node NVM is with partitioned capacity & persistence zones
NVM Dual Use Interface APP1 User level NVM Library CapMalloc(size) Persist Zone Kernel Zone Capacity Zone DRAM NVM Node
NVM Dual Use Interface APP1 User level NVM Library PersistMalloc(size) Persist Zone Kernel Zone Capacity Zone DRAM NVM Node
Enabling Persistence Support hash *table = PersistAlloc(entries, "tableroot"); for each new entry: entry_s *entry = PersistAlloc (size, NULL); table = entry; count++; temp_buff = CapAlloc(size); Requires persistence metadata in library & OS No persistent metadata required Flush app. data cache to avoid loss on power failure Flush OS data-structures and library metadata
Persistence Impact on NVMCap Persistence-unaware OS page allocations cause cache conflicts between NVMCap and NVMPersist Persistent library allocator metadata maintenance increase flushes Increases NVMCap cache miss for shared data Transactional (durability) logging of persistent application state increases flushes and NVM Writes
Persistence Increases Cache Misses Atom platform with 1MB LLC MSR counters to record LLC misses
Agenda Motivation Dual use of NVM Heap High level design Programming interface Sources of Persistence Cost in Dual use of NVM Optimizations to Reduce Persistence Cost Cache conflict aware allocation Library allocator optimization Hybrid logging Conclusion and Future Work
Cache Conflict Reduction Techniques Co-running NVMPersist and NVMCap increase cache conflicts Solution: Cache partitioning Hardware techniques: little flexibility Software techniques: too complex (FreeBSD) Focused on allocating physically contiguous pages to application We adopt software-based partitioning
Conflict Unaware JIT Allocation Tag Way1 Way 2 Phys Frames NVMCap, Pg 1 NVMCap, Pg 1 NVMCap, Pg 1 Conflicts NVMPersist, Pg 2 NVMPersist, Pg 2 NVMPersist, Pg 3 NVMCap, Pg 4 NVMPersist, Pg 2 NVMPersist, Pg 5 NVMCap, Pg 6 NVMPersist, Pg 3 . NVMPersist, Pg 3 NVMCap, Pg 4 Current OS uses Just In Time (JIT) - allocate pages on first touch Reduces physical contiguity of pages with increasing threads NVMCap, Pg 4
Conflict Unaware JIT Allocation Tag Way1 Way 2 Conflict! Phys Frames NVMCap, Pg 1 NVMCap, Pg 1 NVMCap, Pg 1 NVMPersist, Pg 2 NVMPersist, Pg 2 NVMPersist, Pg 3 NVMPersist, Pg 2 NVMCap, Pg 4 NVMPersist, Pg 3 NVMPersist, Pg 5 NVMPersist, Pg 3 NVMCap, Pg 6 NVMCap, Pg 4 . NVMCap, Pg 4
Ideal Conflict-Free Allocator Tag Way1 Way 2 Phys Frames NVMCap, Pg 1 NVMCap, Pg 1 NVMCap, Pg 1 NVMCap, Pg 2 NVMCap, Pg 2 No Conflicts NVMCap, Pg 3 NVMCap, Pg 2 NVMPersist, Pg 4 NVMCap, Pg 3 NVMCap, Pg 3 NVMPersist, Pg 5 NVMPersist, Pg 4 NVMPersist, Pg 6 NVMPersist, Pg 4 . Physically contiguous page allocation reduces conflicts We propose a simple design to achieve contiguity
Contiguity Aware OS Allocation (CAA) Contiguous list 17 16 18 2 3 1 19 4 Non contiguous list 34 43 59 47 33 29 39 53 .
CAA Design On a page fault Contiguous list 2 3 1 4 Bucket with 4 contiguous physical pages (CAA- 4) Physically contiguous pages added to a bucket Allocates a batch of pages during each page fault
CAA Design Contiguous list 17 16 18 19 64 66 65 67 2 3 1 4 NVMCap app s bucket NVMPersist app s bucket Bucket with 4 contiguous physical pages (CAA- 4) Both NVMCap and NVMPersist applications have their contiguous buckets
CAA Design High watermark (less memory, but not critically low) Non contiguous list 34 43 47 59 NVMPersist app s bucket High water mark NVMPersist start using non- contiguous buckets
CAA Design Low watermark (critically low) Non contiguous list Non contiguous list 34 43 34 59 47 43 47 59 NVMPersist app s bucket NVMCap app s bucket Lower water mark NMVCap start using non- contiguous bucket
CAA - Reduction in Contiguity Misses 100 CAA-4 CAA- 16 90 Relative to JIT allocation 80 Reduction in Page Contiguity Miss 70 60 50 40 30 20 10 0 End Client Apps. SPEC Benchmark Reduces page contiguity misses by 89%
NVMCap Cache Miss Reduction 10 Reduction in Misses (%) relative 8 6 CAA -4 CAA -16 4 2 to Baseline 0 -2 -4 -6 End Client Apps. Spec Bench Beneficial for apps with large memory footprint Adding more pages to bucket can increase cache misses due to linked list traversal
Agenda Motivation Dual use of NVM Heap High level Design Programming Interface Sources of Persistence cost in Dual use of NVM Optimizations to Reduce Persistence Cost Cache conflict aware allocation Library allocator optimization Hybrid Logging Conclusion and Future Work
Library Allocator Overhead Nonvolatile heaps require user-level allocator Modern allocators use complex data structures Placing complex allocator structures in NVM requires multiple cache line flushes Increases cache misses and NVM writes for NVMPersist and NVMCap
Porting DRAM Allocators for NVM Persistence support for JEMalloc ~4 CLFlush/ alloc.
NVM Write Aware Allocator (NVMA) Allocator complexity independent of NVM support Idea: Put Complex allocator structures in DRAM NVM contains only log of allocation and deletion C1 C2 C3 ..... .... C1,C2 indicates log of allocated chunks Flush only log information to NVM (~2 lines)
NVMA Cache Flush Reduction 10 Increase in cache miss(%) 9 2% compared to baseline 8 7 6 5 4 8x lesser CLflush 3 NVMA (%) Na ve (%) 2 1 0 0.60 No. of Hash Operations (In Millions) 0.80 1.00 1.50
Logging Overheads Logging required for apps. with strong durability requirements Logs must be frequently flushed to NVM Current Word/Object logs increase NVM writes Word based logs: High log metadata/ log data ratio Object log: Logs entire object even for a word change
Hybrid Log Design Hybrid log to address word/object granularity tradeoffs Flexible use of word/object logs in same transaction Applications specify the transaction type Word- and object-based logs are maintained separately
Optimization - Miss Reduction 14 Reduction in Misses (%) relative to JIT allocation 12 CAA + NVWA +Hybrid 10 8 6 4 2 0 -2 End Client Apps. Spec Bench Reduces misses by 1-2% compared to CAA+ NVWA With increasing rate of hash operations, more gains
Estimation on Runtime Impact 40 Half and Half Writes Full Writes One third writes Reduction in execution time (sec) relative to 35 30 25 baseline 20 15 10 5 0 -5 End Client Apps. Spec Bench Half-Half : Half the misses reduced are NVM writes Full Writes: All misses reduced are NVM writes One-third: 1/3 of misses reduced are NVM writes even though optimization reduce ~2% of misses Gains in runtime improvement can be substantial
Summary & Future Work Efficient use of NVM requires cross stack optimizations Analysis of dual use of NVM shows high impact of NVMPersist on NVMCap We propose page contiguity, NVM aware user library allocator, and hybrid logging to reduce the impact Optimizations show ~12%-13% reduced cache misses Future work: DRAM data structures may not be suitable for NVM (e.g., OS allocator)
Overall NVMCap Miss Reduction 12 Reduction in Misses (%) 10 CAA + NVWA relative to Baseline 8 6 4 2 0 -2 End Client Apps. Spec Bench X264 make several small allocations for each frame Reduces complex allocator metadata misses ~10%
Porting DRAM Allocators for NVM Persistence support for JEMalloc ~4 CLFlush/ alloc.
Transactional Persistent Hash Code shows a sample NVMPersist transactional hash table
Hybrid Logging Gains Close to 2% reduction in cache misses with Hybrid logging With increasing #. of hash operations, benefits increase Up to 12% benefits for x264 whereas 1-6% for other application/benchmarks Hybrid logging reduces misses by 1-2% in most applications compared to CAA+ NVWA