- Published on
AtlasVM: Building a Distributed Virtual Machine - Lessons in Compilers and Distributed Systems
- Authors
- Name
- by Hamza El Idrissi
This project is part of the Missing Semester cohort initiative that explores practical computer science concepts often overlooked in traditional curricula. Through this cohort, I learned a lot about compilers, distributed systems, and I also discovered the Go programming language, which became the foundation of my implementation.
Introduction
This project proposes the development of a distributed virtual machine (VM) capable of executing programs written in a basic programming language. The VM will operate in a distributed system, where multiple nodes collaborate to run a program and reach a consensus on the final output. This project aims to explore the challenges and benefits of distributed computing through a simplified virtual machine framework.
Goals
- Develop a small virtual machine: Design and implement a VM with a limited set of basic instructions suitable for educational or research purposes.
- Implement a basic programming language: Create a user-friendly language allowing programmers to write instructions for the VM.
- Distribute the VM across a network: Design a system where multiple nodes in a network can collaborate to execute a single program.
- Implement a consensus mechanism: Develop a protocol where participating nodes agree on the final output of a program, even in the presence of potential network failures or node malfunctions.
COMPILERS
When I hear the word 'compiler' I instantly get intimidated. The idea of a compiler is surrounded by a lot of complexity and mystery. And actually they are complex, for example the LLVM & Clang projects currently consist of around 3 million lines of code. The GNU Compiler Collection, GCC, is even bigger. 15 million lines of code. So WHAT IS A COMPILER?
WHAT IS A COMPILER?
“A compiler is computer software that transforms computer code written in one programming language (the source language) into another computer language (the target language). Compilers are a type of translator that support digital devices, primarily computers. The name compiler is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language, object code, or machine code) to create an executable program.” - Wikipedia
EXAMPLES OF COMPILERS
- GCC (GNU Compiler Collection): GCC is a powerful compiler that compiles code written in C, C++, and other languages to machine code.
- Babel: Babel is a JavaScript compiler that converts ECMAScript 2015+ code into a backwards-compatible version of JavaScript in current and older browsers or environments.
- TypeScript: TypeScript is a superset of JavaScript that compiles to plain JavaScript.
- Java Compiler: The Java compiler converts Java code into Java bytecode, which is then executed by the Java Virtual Machine (JVM).
HOW DO COMPILERS WORK?
- Lexical Analysis: The compiler breaks the source code into tokens (e.g., keywords, identifiers, operators).
- Syntax Analysis: The compiler checks the syntax of the code to ensure it follows the rules of the programming language.
- Semantic Analysis: The compiler checks the meaning of the code to ensure it is semantically correct.
- Intermediate Code (IR) Generation: The compiler generates an intermediate representation of the code.
- Code Optimization: The compiler optimizes the intermediate code to improve performance.
- Code Generation: The compiler generates machine code or bytecode from the optimized intermediate code.
VIRTUAL MACHINES
A virtual machine is a computer built with software. It’s a software entity that mimics how a computer works. And by software entity I mean it can be anything. A function, a struct, an object, a module, or even a whole program. What matters is what it does.
A virtual machine has a run loop that goes through the fetch-decode-execute cycle, just like a computer. It has a program counter; it fetches instructions; it decodes and executes them. It also has a stack, just like a real computer. Sometimes it has a call stack and sometimes even registers. All built in software.
In the context of this project, the Atlas Specifications are as follows:
- Memory: The VM will have a memory space where it can store data and instructions with the size of 1024 bytes, divided into:
- Data Segment: 512 bytes (addressable from 0 to 511) where the program can store variables and data.
- Code Segment: 512 bytes (addressable from 512 to 1023) where the program instructions are stored.
- Addressing: The VM will use a 8-bit addressing system to access memory locations.
- Registers: The VM will have a set of registers to store temporary data and addresses.
- Program Counter (PC): A 8-bit register that points to the current instruction being executed.
- Accumulator (ACC): A 8-bit register used to store intermediate results.
- Stack: The VM will have a stack to store return addresses and local variables. It will be implemented using a portion of the data segment. using LIFO (Last In, First Out) order.
Instruction Set
AtlasVM uses a simple instruction set designed to support basic programming constructs. Each instruction is 1 byte long, with the first 4 bits representing the opcode and the remaining bits used for operands or addressing.
Opcode (Hex) | Instruction | Description |
---|---|---|
0x00 | ADD | Add value at memory address to ACC |
0x01 | SUB | Subtract value at memory address from ACC |
0x02 | MUL | Multiply ACC by value at the memory address |
0x03 | DIV | Divide ACC by value at memory address (remainder ignored) |
0x04 | AND | Perform bitwise AND between ACC and value at memory address |
0x05 | OR | Perform bitwise OR between ACC and value at memory address |
0x06 | XOR | Perform bitwise XOR between ACC and value at memory address |
0x07 | LOAD | Load value from memory address to ACC |
0x08 | STORE | Store value from ACC to memory address |
0x09 | JUMP | Unconditionally jump to the address specified in the next byte |
0x0A | JZ | Jump to the address specified in the next byte if ACC is zero |
0x0B | JNZ | Jump to the address specified in the next byte if ACC is not zero |
0x0C | IN | Read input from users and store it in ACC |
0x0D | OUT | Write value from ACC to output |
0x0E | HALT | Terminate program execution |
Programming Language: AtlasPL
AtlasPL is a simple, C-like programming language designed to compile to AtlasVM bytecode. It supports basic arithmetic operations, variables, conditionals, and loops.
Example AtlasPL code:
@ This program checks if a number is even.
var number: int;
number = 10; @ Assign a value to number
if ((number & 1) == 0) { @ Check if last bit is 0 (even)
return (0);
} else {
return (1);
}
Bytecode Compilation
The AtlasPL compiler translates high-level code into bytecode for the AtlasVM. Here's an example of how the above code might be compiled:
LOAD 0x00 // Load value 10 into ACC
AND 0x01 // Perform bitwise AND with 1
JZ 0x0A // Jump to end if ACC is zero
LOAD 0x01 // Load value 1 into ACC
HALT // Terminate program execution
LOAD 0x02 // Load value 0 into ACC
HALT // Terminate program execution
VM Implementation
The AtlasVM is implemented as a Go struct with methods for executing instructions. Here's an overview of the key components:
type VM struct {
Memory *Memory
Registers *Registers
Stack *Stack
running bool
input io.Reader
output io.Writer
}
The VM struct encapsulates the core components of our virtual machine:
- Memory: Represents the VM's memory space
- Registers: Holds the Program Counter (PC) and Accumulator (ACC)
- Stack: Implements a stack for function calls and local variables
- Input/Output: Allows for I/O operations
The Run
method implements the main execution loop:
func (vm *VM) Run() {
vm.running = true
for vm.running {
instruction := DecodeInstruction(vm.Memory.Read(uint16(vm.Registers.PC)))
vm.Registers.PC++
vm.executeInstruction(instruction)
}
}
This loop fetches, decodes, and executes instructions until the HALT
instruction is encountered. The executeInstruction
method handles the execution of each instruction:
func (vm *VM) executeInstruction(instruction Instruction) {
switch instruction.Opcode {
case ADD:
vm.Registers.ACC += int8(vm.Memory.Read(uint16(instruction.Operand)))
case SUB:
vm.Registers.ACC -= int8(vm.Memory.Read(uint16(instruction.Operand)))
case MUL:
vm.Registers.ACC *= int8(vm.Memory.Read(uint16(instruction.Operand)))
// Handle other instructions...
case HALT:
vm.running = false
default:
panic(fmt.Sprintf("Unknown opcode: %d", instruction.Opcode))
}
}
This method dispatches the execution of each instruction based on its opcode. The VM's state (memory, registers, stack) is updated accordingly.
Distributed Execution
To distribute execution across multiple nodes, we divide the program into segments. Each node is responsible for executing a portion of the program and sharing its state with other nodes.
- Program Distribution: The bytecode is divided into equal segments and distributed to all nodes.
- State Synchronization: After executing each instruction, nodes broadcast their VM state (PC, ACC, relevant memory) to all other nodes.
- Execution Coordination: Nodes take turns executing instructions, with a designated leader node coordinating the turns.
The distributed nature of AtlasVM is implemented through a network package that handles communication between nodes. Each node runs an instance of the VM and participates in a consensus protocol to agree on the VM's state.
type Node struct {
ID string
Address string
Peers map[string]*NodeClient
VM *vm.VM
consensus *Consensus
mu sync.Mutex
}
type NodeClient struct {
ID string
Client pb.NodeServiceClient
}
The Node
struct represents a single node in the network, with fields for its ID, address, peers, VM instance, and consensus protocol. The NodeClient
struct is used to communicate with other nodes in the network.
func (n *Node) Run() {
for {
n.mu.Lock()
if n.VM.Running() {
instruction := n.VM.FetchInstruction()
n.VM.ExecuteInstruction(instruction)
n.consensus.BroadcastState(n.VM.State())
}
n.mu.Unlock()
}
}
The Run
method of the Node
struct represents the main execution loop of a node. It fetches and executes instructions from the VM, broadcasting its state to other nodes through the consensus protocol.
Consensus Mechanism
We implement a simplified version of the Raft consensus algorithm to ensure all nodes agree on the program's output:
- Leader Election: Nodes elect a leader responsible for coordinating execution.
- Log Replication: Each instruction execution is treated as a log entry, replicated across all nodes.
- Commit: An instruction's result is considered final when a majority of nodes have acknowledged it.
type ConsensusState int
const (
PrePrepare ConsensusState = iota
Prepare
Commit
Finalize
)
type Consensus struct {
node *Node
state ConsensusState
currentView int64
prepareCount map[string]int
commitCount map[string]int
mu sync.Mutex
decidedValue *pb.ConsensusMessage
decisionQuorum int
}
The Consensus
struct represents the consensus protocol's state, including the current view, prepare and commit counts, and the decided value. The protocol ensures that nodes agree on the final output of the program.
func (c *Consensus) BroadcastState(state *pb.VMState) {
// Broadcast VM state to all peers
}
func (c *Consensus) HandleMessage(msg *pb.ConsensusMessage) {
// Handle incoming consensus messages
}
func (c *Consensus) StartElection() {
// Start leader election process
}
func (c *Consensus) ExecuteInstruction(instruction *pb.Instruction) {
// Execute instruction and update consensus state
}
The Consensus
struct provides methods for broadcasting state updates, handling incoming messages, starting leader elections, and executing instructions while maintaining consensus.
Protocol Overview
- Leader Election: Nodes exchange messages to elect a leader responsible for coordinating execution.
- Instruction Execution: The leader sends instructions to nodes, which execute them and broadcast their state.
- State Synchronization: Nodes exchange state updates and acknowledge each other's progress.
- Commitment: An instruction is considered committed when a majority of nodes have acknowledged it.
The consensus protocol ensures that all nodes agree on the final output of the program, even in the presence of network failures or node malfunctions.
COMMUNICATION BETWEEN NODES
In this project, all of our nodes (individual parts of our distributed system) are written in Go. Even though we're using just one language, we use Protocol Buffers and gRPC. Let's explore why we use these technologies and how they help us build a robust distributed system.
- Protocol Buffers is a language-agnostic data serialization format, meaning you can define your data structures once and generate code for multiple languages.It is developed by Google and allows you to define the structure of your data in a
.proto
file and generate code for serialization and deserialization in various languages. In the context of AtlasVM it helps us define the messages exchanged between nodes. - gRPC is a high-performance, open-source framework developed by Google for efficient and language-agnostic communication between distributed systems. In AtlasVM, we use gRPC to facilitate communication between nodes in our distributed virtual machine.
Protocol Buffers: Structured Data Definition
Protocol Buffers help us define our data structures in a clear, efficient way. In our atlas.proto
file, we describe the format of messages that our nodes will exchange:
message VMState {
bytes memory = 1;
uint32 pc = 2;
int32 acc = 3;
}
This defines a VMState
message with three fields: memory
, pc
, and acc
.
The Protocol Buffers Compiler (protoc)
We use the protoc
compiler to generate Go code from our .proto
file:
protoc --go_out=. --go-grpc_out=. proto/atlas.proto
This creates Go structs and methods that match our Protocol Buffer definitions. The benefits include:
- Automatic Code Generation: We don't have to manually write structs and serialization code.
- Type Safety: The generated code ensures we're using the correct types for each field.
- Efficient Serialization: Protocol Buffers provide a compact binary format for data exchange.
gRPC: High-Performance RPC Framework
gRPC is a system for making remote procedure calls (RPC). In simpler terms, it allows one Go program to call functions on another Go program, even if they're running on different machines.
In our atlas.proto
file, we define our gRPC service:
service NodeService {
rpc ReceiveMessage(ConsensusMessage) returns (Empty);
}
This tells gRPC that our nodes can receive ConsensusMessage
type messages.
The benefits of using gRPC in our Go project include:
- Code Generation: gRPC generates client and server code, saving us from writing boilerplate networking code.
- Efficient Communication: gRPC uses HTTP/2, which is faster and more efficient than older protocols.
- Streaming Support: gRPC allows for advanced features like bi-directional streaming, which could be useful for continuous data exchange between nodes.
How It Works in AtlasVM
- We define our message structures and services in
atlas.proto
. - We use
protoc
to generate Go code (atlas.pb.go
andatlas_grpc.pb.go
). - In our node implementation, we use the generated code to create and handle messages:
type NodeService struct {
node *Node
pb.UnimplementedNodeServiceServer
}
func (s *NodeService) ReceiveMessage(ctx context.Context, msg *pb.ConsensusMessage) (*pb.Empty, error) {
// Handle the received message
// ...
return &pb.Empty{}, nil
}
- When one node needs to send a message to another, it uses the generated client code:
client := pb.NewNodeServiceClient(grpcConnection)
response, err := client.ReceiveMessage(context.Background(), consensusMessage)
By using gRPC and Protocol Buffers, we simplify the communication between nodes in our distributed virtual machine, making it easier to build a robust and efficient system.
The project is still a work in progress, but it has been a great learning experience in compilers, virtual machines, and distributed systems. You can find the full source code and documentation on GitHub.