This article highlights the first optimizations I made while redesigning and semi-porting the runtime from C to Rust. These aren’t benchmarked or verified, since the virtual machine is currently under construction and will probably be finished this week.
At a high level, purple-garden current works like this, with (2+3)*(4-1) as
an exemplary input:
TEXT
1.
2+- Tokenizer
3|
4]: Token(2) Token(+) Token(3)
5]: Token(*)
6]: Token(4) Token(-) Token(1)
7|
8 \
9 +- Parsing (Tokens -> Abstract Syntax Tree)
10 |
11 ]: (Asteriks
12 ]: (Plus
13 ]: Integer("2")
14 ]: Integer("3")
15 ]: )
16 ]: (Minus
17 ]: Integer("4")
18 ]: Integer("1")
19 ]: )
20 ]: )
21 |
22 |
23<planned section start>
24 \
25 +- Planned IR and Optimisation Boundary
26 |
27 / \
28 | +- JIT Compiler (IR -> x86/ARM)
29 | ^
30 | \
31 | \
32 | \
33 | \
34 | \
35<planned section end> |Calls
36 | |JIT'ed
37 \ |functions
38 +- Compiler (AST/IR -> bytecode) |
39 | /
40 ]: __entry: /
41 ]: load_imm r0, #2 |
42 ]: load_imm r1, #3 |
43 ]: add r2, r0, r1 |
44 ]: load_imm r1, #4 |
45 ]: load_imm r0, #1 |
46 ]: sub r3, r1, r0 |
47 ]: mul r0, r2, r3 |
48 | |
49 \ |
50 +- Peephole Optimiser |
51 | |
52 ]: __entry: |
53 ]: load_imm r2, #5 |
54 ]: load_imm r3, #3 |
55 ]: mul r0, r2, r3 |
56 | /
57 \ /
58 +- Baseline interpreter --+
59 |
60 ] [vm][0000] LoadImm { dst: 2, value: 5 }
61 ] [vm][0001] LoadImm { dst: 3, value: 3 }
62 ] [vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }
63 |
64 'Peephole Optimizations
Peephole optimisations
are, as the name implies, optimisations performed on a small section of a
larger input. For a virtual machine, like purple-garden this means using a
window of size 3 (anything larger is no longer local1 subject to IR
optimisation, not peephole and will have happened before peephole optimisation
is reached in the compilation pipeline) and merging operators, rewriting
redundant or useless operations.
So to summarize:
- peephole optimisations are done on the bytecode in a local setting (non-global)
- in this case they are fallback optimisations for things the IR optimisation rewriting missed
- these are local, single-pass, and meant only to catch what the IR opt didn’t
Purple garden implementation
For purple garden specifics and questions regarding the runtime, do consult:
Peephole optimisations in purple-garden are intentionally kept single pass to keep startup time cost as low as possible and to move heavy optimisation into the IR.
This introduces the problem for recursive optimisations due to the result of a previous optimisation, this is mitigated by peephole optimisations being the fallback for the previous optimisation pipeline.
RUST
1const WINDOW_SIZE: usize = 3;
2
3/// Peephole optimisations
4///
5/// See:
6/// - [Peephole optimization - Wikipedia]
7/// (https://en.wikipedia.org/wiki/Peephole_optimization)
8/// - [W. M. McKeeman "Peephole Optimization"]
9/// (https://dl.acm.org/doi/epdf/10.1145/364995.365000)
10pub fn bc(bc: &mut Vec<Op>) {
11 for i in 0..=bc.len().saturating_sub(WINDOW_SIZE) {
12 let window = &mut bc[i..i + WINDOW_SIZE];
13 bc::const_binary(window);
14 bc::self_move(window);
15 }
16
17 bc.retain(|op| !matches!(op, Op::Nop))
18}Since optimisations can both rewrite, replace and remove instructions, the
Op::Nop encodes removal by being the replacement for instructions that were
optimised away. These are then removed from the bytecode list after all
optimisations are applied.
self_move
“Self move” is a mov instruction having equivalent dst and src:
ASM
1__entry:
2 load_imm r0 #5
3 load_imm r1 #5
4 mov r1 r1 ; <-- basically NOP
5 add r1 r0If this pattern is encountered, the VM would waste processing power on running
a NOP instruction, to combat this, they are removed:
ASM
1__entry:
2 load_imm r0 #5
3 load_imm r1 #5
4 add r1 r0This is achieved by iterating the window and replacing self movs:
RUST
1/// self_move removes patterns conforming to
2///
3/// Mov { dst: x, src: x },
4///
5/// where both dst == src
6pub fn self_move(window: &mut [Op]) {
7 for op in window.iter_mut() {
8 if let Op::Mov { dst, src } = op {
9 if dst == src {
10 *op = Op::Nop;
11 opt_trace!("self_move", "removed self_moving Mov");
12 }
13 }
14 }
15}const_binary
This optimisation refers to a binary instruction and both lhs and rhs are
created via LoadImm directly in the window beforehand:
ASM
1__entry:
2 load_imm r0, #2
3 load_imm r1, #3
4 add r2, r0, r1 ; <-- this is a compile time known result := 2+3=5
5 load_imm r1, #4
6 load_imm r0, #1
7 sub r3, r1, r0 ; <-- this is a compile time known result := 4-1=3
8 mul r0, r2, r3In this example add, sub and mul are constant. The optimisation itself
applies to all arithmetics. Thus after optimising:
ASM
1__entry:
2 load_imm r2, #5
3 load_imm r3, #3
4 mul r0, r2, r3This looks like the optimiser pass missed the load_imm, load_imm, mul
pattern, but this implementation is non recursive, recursive constant folding
is subject to IR optimisation, which comes before peephole optimisation, thus
this case will not happen once the IR and the IR optimiser is done. Overflow is
currently handled silently using wrapping arithmetic; this could and probably
will be changed to trigger a compile-time error in the future.
RUST
1/// const_binary fuses
2///
3/// LoadImm{ dst: a, value: x },
4/// LoadImm{ dst: b, value: y },
5/// bin { dst, lhs: a, rhs: b }
6///
7/// into
8///
9/// LoadImm { dst, value: x bin y }
10///
11/// where bin := Add | Sub | Mul | Div
12pub fn const_binary(window: &mut [Op]) {
13 let [
14 Op::LoadImm { dst: a, value: x },
15 Op::LoadImm { dst: b, value: y },
16 op,
17 ] = window
18 else {
19 return;
20 };
21
22 let (dst, result) = match *op {
23 Op::Add { dst, lhs, rhs }
24 if lhs == *a && rhs == *b => (dst, x.wrapping_add(*y)),
25 Op::Sub { dst, lhs, rhs }
26 if lhs == *a && rhs == *b => (dst, x.wrapping_sub(*y)),
27 Op::Mul { dst, lhs, rhs }
28 if lhs == *a && rhs == *b => (dst, x.wrapping_mul(*y)),
29 Op::Div { dst, lhs, rhs }
30 if lhs == *a &&
31 rhs == *b && *y != 0 => (dst, x.wrapping_div(*y)),
32 _ => return,
33 };
34
35 window[0] = Op::LoadImm { dst, value: result };
36 window[1] = Op::Nop;
37 window[2] = Op::Nop;
38
39 opt_trace!("const_binary", "fused a constant binary op");
40}Warning
As mort96 pointed out:
Is this valid? The first const_binary optimization changes the values of r0 and r1 (before the optimization they’re 2 and 3, after the optimization they’re unknown). The second const_binary optimization changes the values of r1 and r0 (before the optimization they’re 4 and 1, after the optimization they’re unknown). An optimizer which analyzes the whole block would be able to see that r0 and r1 are written to before they’re read from, but the peephole optimizer only sees the 3 instructions, right?
Well, as it turns it, it isn’t, any instructions reusing registers optimised away under the assumption, that this could not happen, which is wrong, for instance considering:
ASM
1__entry:
2 load_imm r2, #5
3 load_imm r3, #3
4 mul r0, r2, r3In this example, with our previous assumption, we would be able to merge all
three instructions into a single load_imm r0, 15. If we were to extend the
example by a single instruction reusing r2, r3 or both, our assumptions do
not hold any longer:
ASM
1__entry:
2 load_imm r2, #5
3 load_imm r3, #3
4 mul r0, r2, r3 ; if we were to apply the pass here,
5 ; both r2 and r3 would not have
6 ; any values in them, thus the below
7 ; instruction would be operating on
8 ; undefined registers, which is
9 ; undefined behaviour and crashes
10 ; the virtual machine
11 add r0, r2, r3Since this would result in an optimisation breaking behaviour, I disabled it in
1141edf.
However, since constant folding is in the scope of IR optimisation passes, this will not be a loss for the virtual machine.
Observability
RUST
1macro_rules! opt_trace {
2 ($optimisation:literal, $text:literal) => {
3 #[cfg(feature = "trace")]
4 println!("[opt::{}]: {}", $optimisation, $text);
5 };
6}The opt_trace! macro is used through out the optimisation pipeline for
enabling insights into decision making and performed optimisations.
RUST
1opt_trace!("const_binary", "fused two imm loads and an add");Results in [opt::const_binary]: fused a constant binary op, when the
runtime is compiled with --features=trace.
Integration and flag guard
Since startup time is one of the most important parts of a fast runtime (for
me!), peephole optimisation is guarded behind -O1 at this time:
RUST
1// all error handling is hidden
2fn main() {
3 // [...] tokenizing, parsing, further setup
4
5 let mut cc = cc::Cc::new();
6 cc.compile(&ast);
7
8 if args.opt >= 1 {
9 opt::bc(&mut cc.buf);
10 }
11
12 let mut vm = cc.finalize();
13
14 // [...] other flags and running the VM
15}Running something like 2+3*4-1 through the full pipeline:
SHELL
1cargo run \
2 # compile with tracing prints included
3 --features trace \
4 -- \
5 # args for purple-garden
6
7 # print the abstract syntax tree
8 --ast
9 # disassemble the produced bytecode
10 --disassemble \
11 # set optimisation level
12 -O1 \
13 # specify input, if not set just
14 # pass a file to execute in as an argument
15 -r="2+3*4-1"TEXT
1(Asteriks
2 (Plus
3 Integer("2")
4 Integer("3")
5 )
6 (Minus
7 Integer("4")
8 Integer("1")
9 )
10)
11Cc::cc(Asteriks)
12Cc::cc(Plus)
13Cc::cc(Integer("2"))
14RegisterAllocator::alloc(r0)
15Cc::cc(Integer("3"))
16RegisterAllocator::alloc(r1)
17RegisterAllocator::alloc(r2)
18RegisterAllocator::free(r0)
19RegisterAllocator::free(r1)
20Cc::cc(Minus)
21Cc::cc(Integer("4"))
22RegisterAllocator::alloc(r1)
23Cc::cc(Integer("1"))
24RegisterAllocator::alloc(r0)
25RegisterAllocator::alloc(r3)
26RegisterAllocator::free(r1)
27RegisterAllocator::free(r0)
28RegisterAllocator::alloc(r0)
29RegisterAllocator::free(r2)
30RegisterAllocator::free(r3)
31[opt::const_binary]: fused a constant binary op
32[opt::const_binary]: fused a constant binary op
33__entry:
34 load_imm r2, #5
35 load_imm r3, #3
36 mul r0, r2, r3
37[vm][0000] LoadImm { dst: 2, value: 5 }
38[vm][0001] LoadImm { dst: 3, value: 3 }
39[vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }Testing
To ensure correctness for the expected optimisation case, tests are necessary. These tests validate pattern rewriting, not full program semantic equivalence.
RUST
1#[cfg(test)]
2mod bc {
3 use crate::op::Op;
4
5 #[test]
6 fn self_move() {
7 let mut bc = vec![
8 Op::Mov { src: 64, dst: 64 },
9 Op::Mov { src: 64, dst: 64 },
10 Op::Mov { src: 64, dst: 64 },
11 ];
12 crate::opt::bc::self_move(&mut bc);
13 assert_eq!(bc, vec![Op::Nop, Op::Nop, Op::Nop])
14 }
15
16 #[test]
17 fn const_binary() {
18 let mut bc = vec![
19 Op::LoadImm { dst: 0, value: 1 },
20 Op::LoadImm { dst: 1, value: 2 },
21 Op::Add {
22 dst: 0,
23 lhs: 0,
24 rhs: 1,
25 },
26 Op::LoadImm { dst: 0, value: 1 },
27 Op::LoadImm { dst: 1, value: 2 },
28 Op::Sub {
29 dst: 0,
30 lhs: 0,
31 rhs: 1,
32 },
33 Op::LoadImm { dst: 0, value: 3 },
34 Op::LoadImm { dst: 1, value: 5 },
35 Op::Mul {
36 dst: 0,
37 lhs: 0,
38 rhs: 1,
39 },
40 Op::LoadImm { dst: 0, value: 64 },
41 Op::LoadImm { dst: 1, value: 8 },
42 Op::Div {
43 dst: 0,
44 lhs: 0,
45 rhs: 1,
46 },
47 ];
48
49 for i in 0..=bc.len().saturating_sub(3) {
50 crate::opt::bc::const_binary(&mut bc[i..i + 3]);
51 }
52
53 bc.retain(|op| *op != Op::Nop);
54 assert_eq!(
55 bc,
56 vec![
57 Op::LoadImm { dst: 0, value: 3 },
58 Op::LoadImm { dst: 0, value: -1 },
59 Op::LoadImm { dst: 0, value: 15 },
60 Op::LoadImm { dst: 0, value: 8 },
61 ]
62 )
63 }
64}