Memory Models and Processor Specifications at the Forefront of Computer Architecture

Slide Note

This content delves into the evolution of memory models and processor specifications in computer architecture. It discusses the challenges in existing models, the importance of defining a good model, and the significance of sequential consistency. The text touches upon the historical context, including the Java memory model and IBM's early instruction set specifications.

max_uti Follow

Uploaded on Mar 06, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Processor specification without a memory model Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology joint work with Murali Vijayaraghavan, Adam Chlipala IFIPS WG2.8, Estes Park, Colorado August 13, 2014 August 14, 2014 1

Good news Research on memory models has penetrated the POPL community Memory models used to be exclusively in the domain of computer architects until people realized that the Java memory model (Chapter 16 of Guy Steele s reference manual did not make much sense) The fixes to Java memory model were proposed by computer architects and, to my chagrin, published by POPL + drew attention of serious PL researchers - not surprisingly many papers followed showing that the new Java model was also broken 2 August 14, 2014

My concern POPL community is too focused on trying to model warts and moles of existing processor memory models as opposed to defining what a good model ought to be For me the Java memory model was broken because it did not have an operational (computational) interpretation. This danger is present in the current C11 effort. now about processors... 3 August 14, 2014

Instruction set specifications IBM was way ahead of its time when in 1964 it published an instruction set definition and 6 implementations of it The state of the art at that time: From the reference manual of IBM 650, a drum machine with 44 instructions, Instruction: 60 1234 1009 Load the contents of location 1234 into the distribution; put it also into the upper accumulator; set lower accumulator to zero; and then go to location 1009 for the next instruction. No serious problems until multiprocessor appeared (early seventies) 4 August 14, 2014

Sequential Consistency (SC) A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Dijkstra/Dekker had already used this model in 1966 in their solution to the readers and writer problem (No multiprocessor concurrency in OS) L21-5 August 14, 2014

Memory Model Issue Architectural optimizations that are correct for uniprocessors, often violate sequential consistency and result in a new memory model for multiprocessors Memory models have rarely been designed deliberately by architects, rather they have emerged as a way of characterizing legal behaviors of processors they have built 6 August 14, 2014

A Real Multiprocessor P P P P P P Ld or St request Ld response St Ack $ $ $ $ $ $ How to make the real multiprocessor behave like SC Cache coherence msgs $ $ $ M Processors are pipelined and generate many Ld/St requests which are processed in a pipelined and concurrent manner, while in Lamport s SC system only one instruction and at most one memory request is processed a time! L21-7 August 14, 2014

Implementing SC Make the real system process one memory request at a time This is like throwing the baby out with the bath water Let processors generate many requests but make caches behave in the SC manner for performance reasons we need to process Ld requests out of order (cache-hit vs cache-miss) and in a pipelined manner solution: make adjustments to both the memory caches and the pipelined processors 8 August 14, 2014

Two surprising facts Cache Coherence (CC) protocols are not affected by memory models puzzle: what is the correctness criteria for a CC protocol? Processors also pay minimal attention to memory models but make sure it is possible to mimic SC behavior Processors provide memory fences and other ad hoc techniques to enforce SC if desired Perhaps we should think of a processor model independent of a memory model 9 August 14, 2014

A simple processor sans a memory model a flag to indicate if the processor is waiting for a memory response issues a mem request and waits for an answer L21-10 August 14, 2014

A simple multiported memory system L21-11 August 14, 2014

Cache-coherent memory systems MSC selects the first request from some queue $ $ $ $ $ $ M colors represent addresses $ $ $ MRO selects the first request for any address from some queue M MC M Use cache-coherence protocols to make MC behave like MRO (aka as atomic store) Transition systems for CC protocols are complex 12 August 14, 2014

Implementing SC using MRO P P P P P P M Since MRO can process requests out of order make the processor verify if processing the requests out of order makes any difference How? Speculative Loads! 13 August 14, 2014

Speculative processors (PS) state rob ppc (reorder buffer) npc next pc predictor commit slot (pc, npc , state ) to/from memory When a processor speculates correctly then pc must match npc; if does not then rob is emptied out and ppc is set to npc Instruction set specifies the next value of (state,npc) given the current value of (state,npc) Therefore, a correct PS must guarantee that the commit slot has the right values for (npc ,state ) when pc and npc match What about Ld? St? 14 August 14, 2014

Speculative Ld/St NO speculative Sts; a St is issued only when it reaches the commit slot Speculative Lds are harmless because they do not affect the state of the external memory However, a speculative Ld can get a wrong value because PS may have read it too soon Solution: issue the Ld again when it reaches the commit slot and compare the values; mismatch means speculation failure (empty out rob) To guarantee SC no more than one St or verification Ld should be outstanding! 15 August 14, 2014

Speculative Processor L21-16 August 14, 2014

PS Load Instruction verification load L21-17 August 14, 2014

PS Store Instruction L21-18 August 14, 2014

Vector of systems System P can be lifted to a vector of Ps: ( [p],...) P ( ...) ( s,...) P* ( s[p:= ],...) 19 August 14, 2014

System A simulating System B s1 A s2 f f + f(s1) B f(s2) Definition (A B) If system A can make a move then System B can also make a similar move A B is aka A implements B or A is sound wrt B 20 August 14, 2014

Main theorems Theorem (cache-coherent memory): MC MRO Theorem (correctly speculating processor): PS Pref (one instruction-at-a-time) Theorem: PsS + MC SC This follows easily from the lemma PsRef + MRO Psref + MSC (Definition of SC) All proofs are done in Coq by Murali under Adam s guidance 21 August 14, 2014

Conclusion The architectural idea of verification loads can be shown to work formally Proof checkers (e.g., Coq) have come of age this was not the case in 2000; we had tried using PVS for a cache-coherence proof Modular proof are essential to make progress in such complex systems e.g., is the bug in the cache-coherence protocol or the speculative processor implementation We can express these transition systems directly in Bluespec which can be synthesized into hardware Thanks 22 August 14, 2014

Extras August 14, 2014 23

Lamports SC system decode(pcs[p], s[p]) = Ld(a,x) execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc , ) (pcs, s,m) (pcs[p:=pc ], s[p:= + ],m) Ld rule decode(pcs[p], s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc ,St(a,v)) (pcs, s,m) (pcs[p:=pc ], s,m[a:=v]) St rule decode(pcs[p], s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc , ) (pcs, s,m) (pcs[p:=pc ], s[p:= + ],m) Non-mem instruction 24 August 14, 2014

Speculative Out-of-order Processor decode(pcs[p], s[p]) = Ld(a,x) execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc , ) (s,pc,) (pcs[p:=pc ], s[p:= + ],m) Ld rule decode(pcs[p], s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc ,St(a,v)) (pcs, s,m) (pcs[p:=pc ], s,m[a:=v]) St rule decode(pcs[p], s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc , ) (pcs, s,m) (pcs[p:=pc ], s[p:= + ],m) Non-mem instruction 25 August 14, 2014

Decoupled Systems Processor transition system P: (pc, ,p2m,m2p,...) (pc , ,p2m ,m2p , ...) Memory transition system M: (m,p2m,m2p) (m,p2m ,m2p ) Each system MC, MRO, Pref, MSC can be described as a transition system using a small number of rule 26 August 14, 2014

Coupling two Systems Let the two systems have shared variable s and not shared variables x and y System A: (s,x) A (s ,x ) System B: (s,y) B (s ,y ) System A+B: (s,x) A (s ,y ) (s,x,y) A+B (s ,x ,y) (s,y) B (s ,y ) (s,x,y) A+B(s ,x,y ) We can thus couple M to Ps 27 August 14, 2014

Memory Models and Processor Specifications at the Forefront of Computer Architecture

Download Presentation

Presentation Transcript

Related

More Related Content