Architectural Support for Programming Languages and Operating Systems at Indian Institute of Science (IISc), Bangalore

Slide Note

This article delves into the architectural support for programming languages and operating systems at Indian Institute of Science (IISc), Bangalore, as presented in the ASPLOS-2018 conference. Authored by Ashish Panwar, Aravinda Prasad, and K. Gopinath from NetApp Inc. and IISc, the paper showcases the practical applications of making huge pages actually useful. The research highlights innovative approaches to optimizing system performance and enhancing user experience in the realm of software development.

ame_ste Follow

Uploaded on Mar 08, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Making Huge Pages Actually Useful Ashish Panwar1,2, Aravinda Prasad1, K. Gopinath1 1Indian Institute of Science (IISc), Bangalore 2NetApp Inc. Architectural Support for Programming Languages and Operating Systems (ASPLOS)-2018 1

Virtual-to-physical address translation TLB miss overheads are (very) expensive Even more expensive in virtualized systems Resolving a TLB miss takes up to 4 memory accesses on native* Up to 24 on virtual (nested page tables)* CPU cycles spent in page-walks Native Virtual 75 Experimental Setup: 8 core Xeon Ivy-Bridge Server 24GB memory 60 % CPU cycles 45 30 15 0 mcf NPB_CG.D MySQL canneal * On x86 based systems. 2

Efficient address translation with huge pages Map large regions in TLB entries Less TLB misses Supported by hardware for nearly two decades Considerable performance benefits (under ideal conditions) Performance improvement with huge pages Native Virtual 1.8 % speedup 1.6 1.4 1.2 1 mcf NPB_CG.D MySQL canneal 3

Tales from the field 4

Tales from the field 5

Tales from the field 6

Tales from the field 7

Why arent huge pages effective (yet)? Must be mapped contiguously Difficult to allocate because of fragmentation as the system ages Defragmentation can become a bottleneck, if not done properly Memory compaction (migrate pages to restore contiguity) free allocated 8

Why arent huge pages effective (yet)? Must be mapped contiguously Difficult to allocate because of fragmentation as the system ages Defragmentation can become a bottleneck, if not done properly Memory compaction (migrate pages to restore contiguity) before free compaction allocated after 9

Why arent huge pages effective (yet)? Must be mapped contiguously Difficult to allocate because of fragmentation as the system ages Defragmentation can become a bottleneck, if not done properly Memory compaction (migrate pages to restore contiguity) before free compaction allocated after Problem: Not all pages can be moved! 10

Why arent huge pages effective (yet)? Must be mapped contiguously Difficult to allocate because of fragmentation as the system ages Defragmentation can become a bottleneck, if not done properly Memory compaction (migrate pages to restore contiguity) free allocated unmovable 11

Why arent huge pages effective (yet)? Must be mapped contiguously Difficult to allocate because of fragmentation as the system ages Defragmentation can become a bottleneck, if not done properly Memory compaction (migrate pages to restore contiguity) before free allocated unmovable compaction after cannot allocate 4 contiguous pages 12

What are unmovable pages? Movability requires reference management of every object User space page tables [pa = PTE(va)] Kernel space directly mapped [pa = va + PAGE_OFFSET*] Direct mapping makes kernel pages unmovable Tradeoff between simplicity vs flexibility Large contiguous allocations are (generally) prohibited Inevitable with huge pages 13

Fragmentation mitigation in Linux 1. Anti-fragmentation Aims to prevent fragmentation occurrence Partitions physical memory between kernel and user (at pageblock granularity) Clusters alike allocations together (to minimize pollution) 14

Fragmentation mitigation in Linux 1. Anti-fragmentation Aims to prevent fragmentation occurrence Partitions physical memory between kernel and user (at pageblock granularity) Clusters alike allocations together (to minimize pollution) 2. Memory compaction Compact regions that are not polluted by the kernel (colored green) 15

Problem (1/3): Fragmentation-via-pollution need pages? steal from movable Unmovable Movable movable free unmovable 16

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 17

Problem (1/3): Fragmentation-via-pollution Hybrid pageblock (treated as unmovable by the Linux kernel) Unmovable Movable movable free unmovable 18

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 19

Problem (1/3): Fragmentation-via-pollution Unmovable Movable need pages? steal from unmovable movable free unmovable 20

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 22

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 23

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 24

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable 25

Problem (1/3): Fragmentation-via-pollution Unmovable Movable movable free unmovable Eventually majority of pageblocks become hybrid Can lead to permanent fragmentation Why? Because hybrid pageblocks remain hidden from the allocator 26

Problem (2/3): High slab churns Many subsystems use Read-Copy-Update(RCU) synchronization mechanism Every update operation creates a new copy of the object RCU Slab Allocator Old (deferred) objects are reclaimed sometime after a safe (grace) period 27

Problem (2/3): High slab churns Extended lifetime of kernel (unmovable) objects High rate of pollution RCU Slab Allocator Buddy Allocator Unmovable alloc/free Movable 28

Problem (2/3): High slab churns Extended lifetime of kernel (unmovable) objects High rate of pollution RCU Slab Allocator Buddy Allocator Unmovable alloc/free Movable 29

Problem (2/3): High slab churns Extended lifetime of kernel (unmovable) objects High rate of pollution Why? Deferred objects remain invisible to the slab allocator until reclaimed by RCU RCU Slab Allocator Buddy Allocator Unmovable alloc/free Movable 30

Problem (3/3): LIU migration LIU (Latency-inducing unsuccessful) migration migrate HybridPageblock (unknown to the Linux kernel) 31

Problem (3/3): LIU migration LIU (Latency-inducing unsuccessful) migration migrate HybridPageblock (unknown to the Linux kernel) 32

Problem (3/3): LIU migration LIU (Latency-inducing unsuccessful) migration oops! HybridPageblock (unknown to the Linux kernel) 33

Problem (3/3): LIU migration LIU (Latency-inducing unsuccessful) migration oops! wasted effort HybridPageblock (unknown to the Linux kernel) 34

Problem (3/3): LIU migration LIU (Latency-inducing unsuccessful) migration oops! wasted effort HybridPageblock (unknown to the Linux kernel) High memory traffic, TLB shootdowns Particularly harmful for page-fault intensive workloads Why? Because hybrid pageblocks remain hidden during compaction 35

Implications Huge page allocation failures High (and variable) latency High (kernel mode) CPU utilization Performance isolation Virtualization can exacerbate performance issues Both Guest and Host OSs may perform unnecessary work Large memory large problems 18 4KB 2MB Time (minutes) 16 14 12 10 8 2GB 4GB 6GB Memory Size 8GB 10GB 12GB Workload: milc (SPEC CPU2006) 36

Illuminator Manages hybrid pageblocks explicitly Mitigates fragmentation-via-pollution Eliminates LIU migration Reduces slab churns with Prudence [1] [1] Prudent Memory Reclamation in Procrastination-Based Synchronization, Aravinda Prasad, K. Gopinath, ASPLOS 2016. 37

Preventing unnecessary fragmentation Unmovable Hybrid Movable movable free unmovable 38

Preventing unnecessary fragmentation Unmovable Hybrid Movable movable free unmovable 39

Preventing unnecessary fragmentation Unmovable Hybrid Movable movable free unmovable Existing hybrid pageblocks are utilized to prevent pollution Produces less than 10% hybrid pageblocks compared to Linux 40

Illuminator Buddy Allocator Slab Allocator alloc/free 41

Eliminating LIU migration Skip hybrid pageblocks during compaction migrate HybridPageblock (known to Illuminator) 42

Eliminating LIU migration Skip hybrid pageblocks during compaction migrate HybridPageblock (known to Illuminator) 43

Eliminating LIU migration Skip hybrid pageblocks during compaction migrate HybridPageblock (known to Illuminator) 44

Eliminating LIU migration Skip hybrid pageblocks during compaction successful HybridPageblock (known to Illuminator) 45

Eliminating LIU migration Skip hybrid pageblocks during compaction successful HybridPageblock (known to Illuminator) Reduces the cost of compaction by up to 99% 46

Experimental Framework Hardware 8 core Xeon Ivy-Bridge server 8GB and 24GB physical memory (workload dependent) Page sizes: Base-4KB, Huge-2MB Software HPC, scientific computing and a database server SPEC CPU2006, PARSEC, NAS Parallel Benchmarks Linux kernel 4.5 Ubuntu OS KVM hypervisor 47

Results(1/4): Performance Non-Fragmented Linux-Critical Linux-Moderate Illuminator Linux-High 60 40 % speedup 20 0 -20 mcf tigr NPB_CG.D omnetpp milc -40 -60 48

Results(2/4): Latency MySQL Read Latency 5 latency (seconds) Linux Illuminator 4 3 2 1 0.15 0.15 0.15 0.15 0.15 0.14 0.16 0.16 0.02 0.08 0 1 2 3 4 5 6 7 8 9 10 Setup: MySql server benchmarked with sysbench tool 32 million rows, 8 threads performing read operation 10 iterations. Graph shows the max latency from each iteration. 49

Results(3/4): Performance isolation Linux Illuminator 30 25 % slowdown 20 15 10 5 0 bodytrack vips ferret PostgreSql MySql Workloads executed (one-by-one) alongside milc 50

Results(4/4): Illuminator and virtualization Host Guest Both 120 130 80 60 % speedup 40 20 0 mummer tigr canneal mcf milc KVM hypervisor Guest 8 GB memory, 8 vCPUs Legend denotes the layer at which Illuminator was applied 51

Architectural Support for Programming Languages and Operating Systems at Indian Institute of Science (IISc), Bangalore

Download Presentation

Presentation Transcript

Related

More Related Content