Poking holes into bytecode with peephole optimisations

This article highlights the first optimizations I made while redesigning and semi-porting the runtime from C to Rust. These aren’t benchmarked or verified, since the virtual machine is currently under construction and will probably be finished this week.

At a high level, purple-garden current works like this, with (2+3)*(4-1) as an exemplary input:

TEXT

 1.
 2+- Tokenizer
 3|
 4]: Token(2) Token(+) Token(3)
 5]: Token(*)
 6]: Token(4) Token(-) Token(1)
 7|
 8 \
 9  +- Parsing (Tokens -> Abstract Syntax Tree)
10  |
11  ]: (Asteriks
12  ]:   (Plus
13  ]:     Integer("2")
14  ]:     Integer("3")
15  ]:   )
16  ]:   (Minus
17  ]:     Integer("4")
18  ]:     Integer("1")
19  ]:   )
20  ]: )
21  |
22  |
23<planned section start>
24  \
25   +- Planned IR and Optimisation Boundary
26   |
27  / \
28  |  +- JIT Compiler (IR -> x86/ARM)
29  |                           ^
30  |                            \
31  |                             \
32  |                              \ 
33  |                               \ 
34  |                                \ 
35<planned section end>               |Calls 
36  |                                 |JIT'ed    
37  \                                 |functions 
38   +- Compiler (AST/IR -> bytecode) |
39   |                                / 
40   ]:  __entry:                    /
41   ]:          load_imm r0, #2    |
42   ]:          load_imm r1, #3    |
43   ]:          add r2, r0, r1     |
44   ]:          load_imm r1, #4    |
45   ]:          load_imm r0, #1    |
46   ]:          sub r3, r1, r0     |
47   ]:          mul r0, r2, r3     |
48   |                              |
49   \                              |
50    +- Peephole Optimiser         |
51    |                             |
52    ]:  __entry:                  |
53    ]:          load_imm r2, #5   |
54    ]:          load_imm r3, #3   |
55    ]:          mul r0, r2, r3    |
56    |                            /
57    \                           /
58     +- Baseline interpreter --+
59     |
60     ] [vm][0000] LoadImm { dst: 2, value: 5 }
61     ] [vm][0001] LoadImm { dst: 3, value: 3 }
62     ] [vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }
63     |
64     '

Peephole Optimizations

Peephole optimisations are, as the name implies, optimisations performed on a small section of a larger input. For a virtual machine, like purple-garden this means using a window of size 3 (anything larger is no longer local¹ subject to IR optimisation, not peephole and will have happened before peephole optimisation is reached in the compilation pipeline) and merging operators, rewriting redundant or useless operations.

So to summarize:

peephole optimisations are done on the bytecode in a local setting (non-global)
in this case they are fallback optimisations for things the IR optimisation rewriting missed
these are local, single-pass, and meant only to catch what the IR opt didn’t

Purple garden implementation

For purple garden specifics and questions regarding the runtime, do consult:
purple-garden GitHub
The Manchester Garbage Collector and purple-garden’s runtime
Redesign and Semi-port to Rust #15

Peephole optimisations in purple-garden are intentionally kept single pass to keep startup time cost as low as possible and to move heavy optimisation into the IR.

This introduces the problem for recursive optimisations due to the result of a previous optimisation, this is mitigated by peephole optimisations being the fallback for the previous optimisation pipeline.

RUST

 1const WINDOW_SIZE: usize = 3;
 2
 3/// Peephole optimisations
 4///
 5/// See:
 6/// - [Peephole optimization - Wikipedia]
 7/// (https://en.wikipedia.org/wiki/Peephole_optimization)
 8/// - [W. M. McKeeman "Peephole Optimization"]
 9/// (https://dl.acm.org/doi/epdf/10.1145/364995.365000)
10pub fn bc(bc: &mut Vec<Op>) {
11    for i in 0..=bc.len().saturating_sub(WINDOW_SIZE) {
12        let window = &mut bc[i..i + WINDOW_SIZE];
13        bc::const_binary(window);
14        bc::self_move(window);
15    }
16
17    bc.retain(|op| !matches!(op, Op::Nop))
18}

Since optimisations can both rewrite, replace and remove instructions, the Op::Nop encodes removal by being the replacement for instructions that were optimised away. These are then removed from the bytecode list after all optimisations are applied.

`self_move`

“Self move” is a mov instruction having equivalent dst and src:

ASM

1__entry:
2    load_imm r0 #5
3    load_imm r1 #5
4    mov r1 r1 ; <-- basically NOP
5    add r1 r0

If this pattern is encountered, the VM would waste processing power on running a NOP instruction, to combat this, they are removed:

ASM

1__entry:
2    load_imm r0 #5
3    load_imm r1 #5
4    add r1 r0

This is achieved by iterating the window and replacing self movs:

RUST

 1/// self_move removes patterns conforming to
 2///
 3///     Mov { dst: x, src: x },
 4///
 5/// where both dst == src
 6pub fn self_move(window: &mut [Op]) {
 7    for op in window.iter_mut() {
 8        if let Op::Mov { dst, src } = op {
 9            if dst == src {
10                *op = Op::Nop;
11                opt_trace!("self_move", "removed self_moving Mov");
12            }
13        }
14    }
15}

`const_binary`

This optimisation refers to a binary instruction and both lhs and rhs are created via LoadImm directly in the window beforehand:

ASM

1__entry:
2        load_imm r0, #2
3        load_imm r1, #3
4        add r2, r0, r1 ; <-- this is a compile time known result := 2+3=5
5        load_imm r1, #4
6        load_imm r0, #1
7        sub r3, r1, r0 ; <-- this is a compile time known result := 4-1=3
8        mul r0, r2, r3

In this example add, sub and mul are constant. The optimisation itself applies to all arithmetics. Thus after optimising:

ASM

1__entry:
2        load_imm r2, #5
3        load_imm r3, #3
4        mul r0, r2, r3

This looks like the optimiser pass missed the load_imm, load_imm, mul pattern, but this implementation is non recursive, recursive constant folding is subject to IR optimisation, which comes before peephole optimisation, thus this case will not happen once the IR and the IR optimiser is done. Overflow is currently handled silently using wrapping arithmetic; this could and probably will be changed to trigger a compile-time error in the future.

RUST

 1/// const_binary fuses
 2///
 3///     LoadImm{ dst: a, value: x },
 4///     LoadImm{ dst: b, value: y },
 5///     bin { dst, lhs: a, rhs: b }
 6///
 7/// into
 8///
 9///     LoadImm { dst, value: x bin y }
10///
11/// where bin := Add | Sub | Mul | Div
12pub fn const_binary(window: &mut [Op]) {
13    let [
14        Op::LoadImm { dst: a, value: x },
15        Op::LoadImm { dst: b, value: y },
16        op,
17    ] = window
18    else {
19        return;
20    };
21
22    let (dst, result) = match *op {
23        Op::Add { dst, lhs, rhs } 
24            if lhs == *a && rhs == *b => (dst, x.wrapping_add(*y)),
25        Op::Sub { dst, lhs, rhs } 
26            if lhs == *a && rhs == *b => (dst, x.wrapping_sub(*y)),
27        Op::Mul { dst, lhs, rhs } 
28            if lhs == *a && rhs == *b => (dst, x.wrapping_mul(*y)),
29        Op::Div { dst, lhs, rhs } 
30            if lhs == *a && 
31                rhs == *b && *y != 0 => (dst, x.wrapping_div(*y)),
32        _ => return,
33    };
34
35    window[0] = Op::LoadImm { dst, value: result };
36    window[1] = Op::Nop;
37    window[2] = Op::Nop;
38
39    opt_trace!("const_binary", "fused a constant binary op");
40}

Warning

As mort96 pointed out:

Is this valid? The first const_binary optimization changes the values of r0 and r1 (before the optimization they’re 2 and 3, after the optimization they’re unknown). The second const_binary optimization changes the values of r1 and r0 (before the optimization they’re 4 and 1, after the optimization they’re unknown). An optimizer which analyzes the whole block would be able to see that r0 and r1 are written to before they’re read from, but the peephole optimizer only sees the 3 instructions, right?

Well, as it turns it, it isn’t, any instructions reusing registers optimised away under the assumption, that this could not happen, which is wrong, for instance considering:

ASM

1__entry:
2        load_imm r2, #5
3        load_imm r3, #3
4        mul r0, r2, r3

In this example, with our previous assumption, we would be able to merge all three instructions into a single load_imm r0, 15. If we were to extend the example by a single instruction reusing r2, r3 or both, our assumptions do not hold any longer:

ASM

 1__entry:
 2        load_imm r2, #5
 3        load_imm r3, #3
 4        mul r0, r2, r3 ; if we were to apply the pass here,
 5                       ; both r2 and r3 would not have 
 6                       ; any values in them, thus the below
 7                       ; instruction would be operating on 
 8                       ; undefined registers, which is 
 9                       ; undefined behaviour and crashes 
10                       ; the virtual machine
11        add r0, r2, r3

Since this would result in an optimisation breaking behaviour, I disabled it in 1141edf.

However, since constant folding is in the scope of IR optimisation passes, this will not be a loss for the virtual machine.

Observability

RUST

1macro_rules! opt_trace {
2    ($optimisation:literal, $text:literal) => {
3        #[cfg(feature = "trace")]
4        println!("[opt::{}]: {}", $optimisation, $text);
5    };
6}

The opt_trace! macro is used through out the optimisation pipeline for enabling insights into decision making and performed optimisations.

RUST

1opt_trace!("const_binary", "fused two imm loads and an add");

Results in [opt::const_binary]: fused a constant binary op, when the runtime is compiled with --features=trace.

Integration and flag guard

Since startup time is one of the most important parts of a fast runtime (for me!), peephole optimisation is guarded behind -O1 at this time:

RUST

 1// all error handling is hidden
 2fn main() {
 3    // [...] tokenizing, parsing, further setup
 4
 5    let mut cc = cc::Cc::new();
 6    cc.compile(&ast);
 7
 8    if args.opt >= 1 {
 9        opt::bc(&mut cc.buf);
10    }
11
12    let mut vm = cc.finalize();
13
14    // [...] other flags and running the VM
15}

Running something like 2+3*4-1 through the full pipeline:

SHELL

 1cargo run \
 2    # compile with tracing prints included
 3    --features trace \
 4    -- \
 5    # args for purple-garden
 6
 7    # print the abstract syntax tree
 8    --ast
 9    # disassemble the produced bytecode
10    --disassemble \ 
11    # set optimisation level
12    -O1 \
13    # specify input, if not set just 
14    # pass a file to execute in as an argument
15    -r="2+3*4-1"

TEXT

 1(Asteriks
 2  (Plus
 3    Integer("2")
 4    Integer("3")
 5  )
 6  (Minus
 7    Integer("4")
 8    Integer("1")
 9  )
10)
11Cc::cc(Asteriks)
12Cc::cc(Plus)
13Cc::cc(Integer("2"))
14RegisterAllocator::alloc(r0)
15Cc::cc(Integer("3"))
16RegisterAllocator::alloc(r1)
17RegisterAllocator::alloc(r2)
18RegisterAllocator::free(r0)
19RegisterAllocator::free(r1)
20Cc::cc(Minus)
21Cc::cc(Integer("4"))
22RegisterAllocator::alloc(r1)
23Cc::cc(Integer("1"))
24RegisterAllocator::alloc(r0)
25RegisterAllocator::alloc(r3)
26RegisterAllocator::free(r1)
27RegisterAllocator::free(r0)
28RegisterAllocator::alloc(r0)
29RegisterAllocator::free(r2)
30RegisterAllocator::free(r3)
31[opt::const_binary]: fused a constant binary op
32[opt::const_binary]: fused a constant binary op
33__entry:
34        load_imm r2, #5
35        load_imm r3, #3
36        mul r0, r2, r3
37[vm][0000] LoadImm { dst: 2, value: 5 }
38[vm][0001] LoadImm { dst: 3, value: 3 }
39[vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }

Testing

To ensure correctness for the expected optimisation case, tests are necessary. These tests validate pattern rewriting, not full program semantic equivalence.

RUST

 1#[cfg(test)]
 2mod bc {
 3    use crate::op::Op;
 4
 5    #[test]
 6    fn self_move() {
 7        let mut bc = vec![
 8            Op::Mov { src: 64, dst: 64 },
 9            Op::Mov { src: 64, dst: 64 },
10            Op::Mov { src: 64, dst: 64 },
11        ];
12        crate::opt::bc::self_move(&mut bc);
13        assert_eq!(bc, vec![Op::Nop, Op::Nop, Op::Nop])
14    }
15
16    #[test]
17    fn const_binary() {
18        let mut bc = vec![
19            Op::LoadImm { dst: 0, value: 1 },
20            Op::LoadImm { dst: 1, value: 2 },
21            Op::Add {
22                dst: 0,
23                lhs: 0,
24                rhs: 1,
25            },
26            Op::LoadImm { dst: 0, value: 1 },
27            Op::LoadImm { dst: 1, value: 2 },
28            Op::Sub {
29                dst: 0,
30                lhs: 0,
31                rhs: 1,
32            },
33            Op::LoadImm { dst: 0, value: 3 },
34            Op::LoadImm { dst: 1, value: 5 },
35            Op::Mul {
36                dst: 0,
37                lhs: 0,
38                rhs: 1,
39            },
40            Op::LoadImm { dst: 0, value: 64 },
41            Op::LoadImm { dst: 1, value: 8 },
42            Op::Div {
43                dst: 0,
44                lhs: 0,
45                rhs: 1,
46            },
47        ];
48
49        for i in 0..=bc.len().saturating_sub(3) {
50            crate::opt::bc::const_binary(&mut bc[i..i + 3]);
51        }
52
53        bc.retain(|op| *op != Op::Nop);
54        assert_eq!(
55            bc,
56            vec![
57                Op::LoadImm { dst: 0, value: 3 },
58                Op::LoadImm { dst: 0, value: -1 },
59                Op::LoadImm { dst: 0, value: 15 },
60                Op::LoadImm { dst: 0, value: 8 },
61            ]
62        )
63    }
64}