
Building a Bytecode Virtual Machine: A Comprehensive Guide
Learn how to build a virtual machine from scratch, understand its objectives, implementation, and major components. Explore the architecture and key services provided by a virtual machine in this detailed developer's guide.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Busy Developer's Guide to Building A Bytecode Virtual Machine
Objectives Building a virtual machine understanding a VM can come from building one so let's build one! and understand that there's limits to what we can do in a single session
How, exactly, do we build one of these? VIRTUAL MACHINE IMPLEMENTATION
Virtual Machine Implementation What is a VM? an abstraction that defines a computing model (aka AbstractMachine) an implementation of an AbstractMachine on top of hardware
Virtual Machine Implementation What does a VM provide? state (data) explicit (memory, registers, stack) static stack dynamic implicit (translated code, working areas, stateful I/O, ...) execution (code) definition (bytecode, native, etc) loading/resolution
Virtual Machine Implementation What does a VM provide? libraries services automatic memory management metadata access and manipulation ... others depending on VM intent host/guest interaction (FFI)
Major moving parts of most virtual machines ARCHITECTURE
Architecture Simplified Loader and dynamic linker Loader pulls in application package Loader extracts data structures (code and data) constant pool: collection of constants Dynamic linker resolves all referenced symbols Execution engine processor: fetch-decode-execute mechanism call stack: function call frames
Architecture Simplified Memory manager application data virtual machine data ("overhead") Thread scheduler Language extension Runtime services: platform access, etc Language extension/access: FFI
Architecture Code typically unique Instruction Set Architecture (ISA) loaded in from disk (or elsewhere?) formats often optimized for easy reading typically inaccessible (directly) to runtime code usually we need to know where we are in here
Architecture IP/PC register "instruction pointer" or "program counter" tracks the current location in code indicative of the next instruction for execution manipulated by flow-control instructions
Architecture Call stack collection of "activation frames" (procedure/method/function calls) LIFO order--push and pop from one end, most recent at the top contain parameters passed to the activation provides space for local-allocated values and space for any additional in-process work
Architecture FP/SP register "frame pointer" or "stack pointer" pointer to the current activation frame offset for any direct memory access of frame data
Architecture Global memory "static" sequential memory locations often allocated by the language/virtual machine (global variables)
Architecture Global memory "heap" sequential memory locations dynamically-allocated by running code typically somewhat managed by VM for programmer simplicity
Architecture Processor a software CPU fetch-decode-execute cycle fetch: get the next instruction (from IP) decode: if the instruction is "packed" execute: utilizing any operands required
Architecture Constants fixed collection of values in memory loaded along with code values known to the VM ahead-of-time string values function locations "magic constants" stored globally for easy reference
Understanding/implementing one BYTECODE
Bytecode What is bytecode? an abstract instruction set that is, not tied to hardware usually simplified than CPU assembly low-level allowing for more complex construction providing core/basic functionality tied to the abstract machine executing it stack vs register machine, for example (not always: see LLVM)
Bytecode Bytecode categories constant values stack manipulation (push, pop, etc) arithmetic operations flow-control operations (branch-if-true, branch-if-false, ...) storage operations (store, load; global, local) type conversion operations (int/float/object/etc to int/float/object/etc) allocation/deallocation
Bytecode Bytecode categories (imperative) procedure invocation direct or indirect (via address) taking reference (messaging) message-send, "super-send" (object) method invocation (functional) closure definition (logic) backtrack manipulation
Bytecode Syntax/semantics instructions are formed of two parts operation code (opcode) operation parameters (operands) these will sometimes be supplemented by other things directives (commands to the tools) labels (symbolic names used)
Bytecode Opcodes take operands (parameters) some take none (NOP, HALT, pop, etc) some take one (a constant value, etc) some take two (add, subtract, etc) some may take a varying number
Bytecode Opcode/operand format multiple bytes/values "packed" bytes/values note that "raw" format and "programmer" format need not be identical binary assemblers (source-to-binary) s-expressions ("lisp") XML (s-expressions with angle brackets)
Everybody on, everybody off the stack STACK-BASED VIRTUAL MACHINES
Stack-based Virtual Machines Abstract operations stack Registers will be special-purpose (IP, FP, SP, etc) ... but most operations will focus on the operations stack all operands come from the stack all results go back onto the stack the stack must (eventually) balance out to zero stack width is (usually) one "word"
Stack-based Virtual Machines "Abstract" is key most abstract VMs don't ignore registers register access is much faster than actual CPU stack CPUs often have their own conventions to honor ... so the operation stack may not be entirely on the CPU stack remember, "abstract" is usually a simplification, not a straitjacket
Stack-based Virtual Machines Stack-manipulation Operations push (or const) op: push op onto the stack pop: pop op off the stack dup: duplicate (pop, push, push) ldc ix: push constant (from constant pool index ix)
Stack-based Virtual Machines Mathematics Operations binary math operations (add, sub, mul, div, mod, pow) pop sp1, sp2; sp1 (operate) sp2; push result may specialize opcodes to precision (32-bit, 64-bit, int vs float, etc) order is important! ((1 - 2) != (2 - 1), remember); left-to-right or right-to-left? unary math operations (neg, abs, etc) pop sp1; (operate) sp1; push result may specialize opcodes to precision
Stack-based Virtual Machines Bitwise Operations shift or rotate operations (shiftl, shiftr, rotl, rotr) pop sp1; (operate) sp1; push result may specialize opcodes to precision most of these are really just extensions of unary math operations
Stack-based Virtual Machines Comparison Opcodes pop sp1, sp2 sp1 (compare) sp2? if true, push 1/true if false, push 0/false equality, non-equality greater-than, less-than ... and variations (greater-than-or-equal, less- than-or-equal, etc) may specialize opcodes to precision or
Stack-based Virtual Machines Branching/"jumping" Operations must always have one operand: where to jump typically an address in the bytecode operand may be "direct" (embedded in the stored code) operand may be "indirect" (we get the address from somewhere else) operand may be "absolute" address (within all the code)
Stack-based Virtual Machines Compare-then-branch is common jump-if-zero, jump-if-not-zero the most common depends on how the opcodes and operands are encoded or how "wide" the ISA wants to be these are often mirrored by underlying CPU opcodes
Stack-based Virtual Machines Call frame Operations call typically implies a branch-and-return (ret) pair conventions are critical call conventions parameters go onto the operations stack order-sensitive "left-to-right" (from callers' perspective)? or "right-to-left" (from callers' perspective)? typically assumes pass-by-value semantics (no immediate modifications)
Stack-based Virtual Machines Call frame Operations return conventions result goes onto operations stack usually single-value results, but... ... nothing really stops us from having multiple results if we wished ... which could be one way pass-by-reference is implemented! who cleans up the stack? where is return-address stored?
How many registers do you need? REGISTER-BASED VIRTUAL MACHINES
Register-based Virtual Machines Storage/work focuses on registers general-purpose registers integer-only registers floating-point registers string/data registers note: there's still a program stack involved making register VMs something of a superset to stack VMs
Register-based Virtual Machines Registers can hold single value typically of "word" size (32 or 64 bits) frequently of different sizes (for optimization purposes) may or may not map to actual CPU registers
Register-based Virtual Machines Value-manipulation LOADRx: load value into register constant value from local or global memory from other register STORERx: store register into local or global memory into other register
Register-based Virtual Machines Stack-manipulation Operations push (or const) op or Rx: push op or the contents of Rx onto the stack pop: pop op off the stack; pop Rx pops op off the stack into Rx dup: duplicate (pop, push, push) ldc ix: push constant (from constant pool index ix)
Register-based Virtual Machines Mathematical operations one/some registers are often specific to math (accumulator) operands can often be register local memory global memory ... and some machines restrict which from where binary operations (add, sub, mul, div, mod) three operands: src1, src2 and dest
Register-based Virtual Machines Bitwise Operations shift or rotate operations (shiftl, shiftr, rotl, rotr) most of these are really just extensions of unary math operations generally these will operate against a small set of registers
Register-based Virtual Machines Comparison Opcodes operand (compare) register? if true, push 1/true if false, push 0/false equality, non-equality greater-than, less-than ... and variations (greater-than-or-equal, less- than-or-equal, etc) may specialize opcodes to precision or convenience (comparison-to-zero, for
Register-based Virtual Machines Branching/"jumping" Operations must always have one operand: where to jump typically an address in the bytecode operand may be "direct" (embedded in the stored code) operand may be "indirect" (we get the address from somewhere else) operand may be "absolute" address (within all the code)
Register-based Virtual Machines Compare-then-branch is common jump-if-zero, jump-if-not-zero the most common depends on how the opcodes and operands are encoded or how "wide" the ISA wants to be these are often mirrored by underlying CPU opcodes
Register-based Virtual Machines Call frame Operations call typically implies a branch-and-return (ret) pair conventions are critical parameters can go into registers typically size-sensitive typically ordered (some/remaining) parameters go onto the operations stack order-sensitive "left-to-right" (from callers' perspective)?
Register-based Virtual Machines Call frame Operations result goes into register who cleans up the stack? where is return-address stored?
Build it already! IMPLEMENTATION
Implementation Steps (1/2) Basic architecture and scaffolding Simple (no-operand) ops: NOP, HALT, DUMP Simple stack ops: CONST, LDC, POP Globals ops: GSTORE, GLOAD Math ops: ADD, SUB, etc
Implementation Steps (2/2) Comparison ops: EQ, NE, GTE, etc Branching ops: JMP, JT, JF, etc Call ops: CALL, CALLI, RET Locals ops: LSTORE, LLOAD