WHY2025: Reviewing live-bootstrap

WHY2025, 11 August 2025, 17:00, Cassiopeia
Frans Faase, staying in HSNL village

https://www.iwriteiam.nl/WHY2025_talk.html
(link is also found on program.why2025.org)

What is live-bootstrap?
How to review live-bootstrap
- The binary seeds
- The sources
Ways to improve live-bootstrap

Can we trust Linux?

Linux is open source
Building the Linux kernel
What is a compiler?
Malicious compiler
Reflections on Trusting Trust by Ken Thompson
Reverse engineering is hard

Linux is an open-source operating system build from available sources. For closed source operating systems, such as Windows and MacOS, there is no way to verify that it is reliable.

Although the sources of Linux are all known, a compiler is needed for building the Linux kernel from the C-source files into executables for the target platform.

A compiler is a program which translates a program from an input language to an output language. The input language is often a high-level, human readable language, while the output language is a low-level language, including machine code, which can be executed by the CPU. A compiler, like any other program, is programmed in a language, usually in machine code, or otherwise, depends on some other program based on machine code.

A malicious compiler could insert a back-door in some executables or libraries of the Linux kernel. The solution to use this compiler to compile itself from original sources, does not work, because it could recognise when it is compiled by itself and duplicate the malicious code in the resulting compiler, making it malicious as well.

In 1984 Ken Thompson already presented the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust.

Machine code is usually in a binary form that is not easily readable by humans. This poses a problem for a non-trivial program, such as an optimising compiler. Although reverse engineering executables is possible, it is a lot of work.

The live-bootstrap approach

Start with a tiny piece of machine code
Build necessary tools from this
To compile the Tiny C Compiler
gcc 4.0.4 → gcc 4.7.4 → gcc 10.4.0 → gcc 13.1.0

The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small piece of machine code. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kernel sources.

hex0-seed

The bootstrap-seeds repository
NEVER TRUST ANYTHING IN HERE
hex0_x86.hex0
> hex0-seed hex0_x86.hex0 hex0
Can we verify hex0?

The small piece of machine code is a program called hex0, which converts a file with hexadecimal numbers into a binary file. This program is specific for the target CPU for which Linux needs to be build. These programs can be found in the bootstrap-seeds repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN HERE'. For x86 (32 bits) the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will return hex0, assuming that it is not a malicious variant of hex0. The hex0 program requires two arguments, the names of the input and output files. The function of the program can also be achieved with following command line:

sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file

This makes use of the stream editing program sed and the xxd (hex dump) program that can be used to convert, in both directions, between binary files and hexadecimal files.

The important question is: Can we verify hex0? First we should verify that it indeed does what it does through an independent verify cations.

Independent verification

hex0 in Brainfuck
Esoteric programming language
Only eight commands: + - < > [ ] . ,
Verified hex0 is binary form of hex0_x86.hex0
Is this enough?

Brainfuck is an esoteric program language with only eight commands, represented by the letters '+-<>[].,'. I wanted to see if it was possible to write a Brainfuck program to replace hex0. I wrote some JavaScript to generate a program from a simple programming language: BF generator. I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did produce a file equal to hex0.

Blog posts:

Verify hex0_x86.hex0

Executable and Linkable Format (ELF)
Machine code instruction reference
Linux system calls

# SPDX-FileCopyrightText: 2019 Jeremiah Orians
# SPDX-FileCopyrightText: 2022 Andrius Štikonas
#
# SPDX-License-Identifier: GPL-3.0-or-later

## ELF Header
#:ELF_base
7F 45 4C 46                     # e_ident[EI_MAG0-3] ELF's magic number

01                              # e_ident[EI_CLASS] Indicating 32 bit
01                              # e_ident[EI_DATA] Indicating little endianness
01                              # e_ident[EI_VERSION] Indicating original elf

03                              # e_ident[EI_OSABI] Set at 3 because FreeBSD is strict
00                              # e_ident[EI_ABIVERSION] Set at 0 because no one cares

00 00 00 00 00 00 00            # e_ident[EI_PAD]

02 00                           # e_type Indicating Executable
03 00                           # e_machine Indicating x86
01 00 00 00                     # e_version Indicating original elf

54 80 04 08                     # e_entry Address of the entry point
34 00 00 00                     # e_phoff Address of program header table
00 00 00 00                     # e_shoff Address of section header table

00 00 00 00                     # e_flags

34 00                           # e_ehsize Indicating our 52 Byte header

20 00                           # e_phentsize size of a program header table
01 00                           # e_phnum number of entries in program table

00 00                           # e_shentsize size of a section header table
00 00                           # e_shnum number of entries in section table

00 00                           # e_shstrndx index of the section names

## Program Header
#:ELF_program_headers
#:ELF_program_header__text
01 00 00 00                     # ph_type: PT-LOAD = 1
00 00 00 00                     # ph_offset

00 80 04 08                     # ph_vaddr
00 80 04 08                     # ph_physaddr

00 01 00 00                     # ph_filesz
00 01 00 00                     # ph_memsz

07 00 00 00                     # ph_flags: PF-X|PF-W|PF-R = 7
01 00 00 00                     # ph_align

#:ELF_text

# Where the ELF Header is going to hit
# Simply jump to _start
# Our main function
# :_start ; (0x8048054)
    58                          ; pop_eax          # Get the number of arguments
    5B                          ; pop_ebx          # Get the program name
    5B                          ; pop_ebx          # Get the actual input name
    31C9                        ; xor_ecx,ecx      # prepare read_only, ecx = 0
    31D2                        ; xor_edx,edx      # Extra sure, edx = 0
    6A 05                       ; push !5          # prepare to set eax to 5
    58                          ; pop_eax          # the syscall number for open()
    CD 80                       ; int !0x80        # Now open that damn file
    89C6                        ; mov_esi,eax      # Preserve the file pointer we were given

    5B                          ; pop_ebx          # Get the actual output name
    66B9 4102                   ; mov_cx, @577     # Prepare file as O_WRONLY|O_CREAT|O_TRUNC
    66BA C001                   ; mov_dx, @448     # Prepare file as RWX for owner only (700 in octal)
    6A 05                       ; push !5          # prepare to set eax to 5
    58                          ; pop_eax          # the syscall number for open()
    CD 80                       ; int !0x80        # Now open that damn file
    89C2                        ; mov_edx,eax      # Preserve the file pointer we were given

    # Our flag for byte processing
    6A FF                       ; push !-1
    5D                          ; pop_ebp          # ebp = -1

    # temp storage for the sum
    31FF                        ; xor_edi,edi      # edi = 0

#:loop ; (0x8048077)
    # Read a byte
    E8 68000000                 ; call %Read_byte

    # process byte
    E8 1B000000                 ; call %hex

    # Deal with -1 values
    85C0                        ; test_eax,eax
    7C F2                       ; jl !loop

    # deal with toggle
    85ED                        ; test_ebp,ebp     # jump if ebp >= 0
    7D 06                       ; jge !print

    # process first byte of pair
    89C7                        ; mov_edi,eax
    31ED                        ; xor_ebp,ebp      # ebp = 0
    EB E8                       ; jmp !loop

# process second byte of pair
#:print ; (0x804808F)
    # update the sum and store in output
    C1E7 04                     ; shl_edi, !4
    01F8                        ; add_eax,edi

    # flip the toggle
    4D                          ; dec_ebp          # ebp = -1

    E8 39000000                 ; call %write_byte

    EB DB                       ; jmp !loop

#:hex ; (0x804809C)
    # Purge Comment Lines (#)
    3C 23                       ; cmp_al, !35
    74 1E                       ; je !purge_comment

    # Purge Comment Lines (;)
    3C 3B                       ; cmp_al, !59
    74 1A                       ; je !purge_comment

    # deal all ascii less than 0
    3C 30                       ; cmp_al, !48
    7C 1F                       ; jl !ascii_other

    # deal with 0-9
    3C 3A                       ; cmp_al, !58
    7C 1F                       ; jl !ascii_num

    # deal with all ascii less than A
    3C 41                       ; cmp_al, !65
    7C 17                       ; jl !ascii_other

    # deal with A-F
    3C 47                       ; cmp_al, !71
    7C 1C                       ; jl !ascii_high

    # deal with all ascii less than a
    3C 61                       ; cmp_al, !97
    7C 0F                       ; jl !ascii_other

    # deal with a-f
    3C 67                       ; cmp_al, !103
    7C 12                       ; jl !ascii_low

    # The rest that remains needs to be ignored
    EB 09                       ; jmp !ascii_other

#:purge_comment ; (0x80480BE)
    # Read a byte
    E8 21000000                 ; call %Read_byte

    # Loop if not LF
    3C 0A                       ; cmp_al, !10
    75 F7                       ; jne !purge_comment

    # Otherwise return -1

#:ascii_other ; (0x80480C7)
    6A FF                       ; push !-1
    58                          ; pop_eax          # return -1
    C3                          ; ret

#:ascii_num ; (0x80480CB)
    2C 30                       ; sub_al, !48
    C3                          ; ret

#:ascii_low ; (0x80480CE)
    2C 20                       ; sub_al, !32      # convert to uppercase

#:ascii_high ; (0x80480D0)
    2C 37                       ; sub_al, !55
    C3                          ; ret

# Writes byte stored in al
#:write_byte ; (0x80480D3)
    # Print our Hex
    89D3                        ; mov_ebx,edx      # Where are we writing to
    52                          ; push_edx         # protect fout
    6A 01                       ; push !1          # prepare to set edx to 1
    5A                          ; pop_edx          # set the size of chars we want
    50                          ; push_eax         # Move output to stack
    89E1                        ; mov_ecx,esp      # What we are writing
    6A 04                       ; push !4          # prepare to set eax to 4
    58                          ; pop_eax          # the syscall number for write
    CD 80                       ; int !0x80        # call the Kernel
    5B                          ; pop_ebx          # deallocate stack
    5A                          ; pop_edx          # restore fout
    C3                          ; ret

#:Read_byte ; (0x80480E4)
    # Attempt to read 1 byte from Input file
    52                          ; push_edx         # protect fout
    6A 01                       ; push !1          # prepare to set edx to 1
    5A                          ; pop_edx          # set the size of chars we want
    57                          ; push_edi         # allocate stack
    89E1                        ; mov_ecx,esp      # Where to put it
    89F3                        ; mov_ebx,esi      # Where are we reading from
    6A 03                       ; push !3          # prepare to set eax to 3
    58                          ; pop_eax          # the syscall number for read
    CD 80                       ; int !0x80        # call the Kernel

    85C0                        ; test_eax,eax     # check what we got
    74 03                       ; je !Done         # Got EOF call it done

    # load byte
    58                          ; pop_eax          # load char
    5A                          ; pop_edx          # restore fout
    C3                          ; ret

#:Done ; (0x80480F9)
    # program completed Successfully
    31DB                        ; xor_ebx,ebx      # All is well, ebx = 0
    6A 01                       ; push !1
    58                          ; pop_eax          # put the exit syscall number in eax
    CD 80                       ; int !0x80        # Call it a good day

#:ELF_end

Review of the other sources

What are all the sources?
Description in parts.rst
The kaem shell
Program to parse kaem files
C-preprocessor to discover included files
Stopped at bash
Result: live-bootstrap.html

I decided to look some deeper into live-bootstrap to figure out which programs are executed and which files are being read. There is a global description of all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a minimal shell, called the kaem shell, to execute the script. I wrote a program, kaem_parser.cpp, to parse the kaem files, which generates the HTML page live-bootstrap with all the information. To determine which header files are used when compiling C programs, the parser also implements a C preprocessor to parse the C files to find all the header files. At some point the bash shell is compiled and used to execute bash scripts. One would have to write a bash interpreter to parse the files involved in the remaining steps.

Blog posts:

Execute with emulator

Implements limited set of x86 instructions
Supports system calls
Simulates multiple processes
Writes files to file system
It was not easy, but insightful
Too slow for compiling GNU Mes Compiler,
which is used to compile the Tiny C Compiler

To even dive deeper, I decided to implement an x86 emulator that implements only the x86 instructions that are actually used. The emulator also implements systems calls and supports multiple processes. I decided to simply map file operations on file operations on the file system, instead of simulating those in memory.

This allowed me also to verify the code of hex0 and to see if no other binary seeds were used and/or executed.

The development of the emulator required some very complicated debugging, but it was also insightful.

The emulator turned out to be too slow to compile the GNU Mes compiler, which is the C compiler used to compile a stripped version of the Tiny C Compiler.

This also means, I even got less far with this approach than the previous.

Blog posts:

Using strace

Trace system calls
Shell script to be run with chroot
Program to parse log file
Generate T-diagram

With strace command it is possible to trace system calls that are performed during the execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the chroot command) and trace relevant system calls. I also wrote a program, called scan_trace.cpp to process the produced log file and generate an HTML page listing the processes and source files. It also generates a JSON file with similar information, which is used to generate a T-diagram.

T-diagram

A T-diagram (or tombstone diagram) uses T-shaped outlines that explain the languages that are involved in a program. It shows the source language (on the left), the target language (on the right) and the implementation language (on the bottom). In the above diagram it show how a program in C for reading hexadecimal data to a binary file is compiled to an executable that does the same. Notice how the execution of the compiler changes the implementation language of the source program into the implementation language of the target program.

Legend:

Red: the binary seeds
Green box: Programs run by
C-1, C-2: subset of C languages

GNU Mes Compiler

C compiler in Scheme (LISP)
Scheme interpreter in a subset of C language
Preceded by two subset of C compilers
Probably on the level of Tiny C Compiler
Compiles rather slow
The odd one out
Why not Forth?
GNU Mes replacement

GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset of C and a C compiler written in Scheme. It is a rather complete compiler including a separate linker. I get the idea that it is on par with the Tiny C compiler, which does not need a separate linker. It can compile it own sources without a separate linker. The MES compiler is rather slow, which is understandable because it uses an interpreting parser running an interpreting compiler. It is also slow because it compiles many individual C files into object files, which are linked at the end. Every time all the Scheme files are loaded before the compiler can start its actual work.

The GNU Mes compiler requires a compiler for a minimal subset of C. Because of this the bootstrap does contain include a subset of C compiler and a partial implementation of the C preprocessor. These are written in another subset of C for which there is compiler written in machine code, one for each supported CPU. I have the impression that the GNU Mes compiler is a bit of overkill for the gap it needs to bridge.

On several forums, I have read people raising the question why Forth was not used in the implementation instead of LISP.

I made some investigation and attempts to bridge the gap. To see if it would be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs in M2-Mesoplanet.

Blog posts:

Observations

hex0 is short but a bit cryptic
More source code in assembly (>6000 lines)
Two subset of C compilers
C Preprocessor
Can compile some utils: untar and ungz
Lots of code in Scheme
Alternative approach:
- Readable intermediate language
- More 'loops'

Although hex0 is short, it is also a bit cryptic. It looks like it was hand coded and that it uses different patterns for similar operations. It also uses registers to store certain variables. Besides having to verify hex0, one also has to analyse the following assembly programs:

hex1_x86.hex0: 329 lines
hex2_x86.hex1: 626 lines
M0_x86.hex2: 869 lines
cc_x86.M1: 4983 lines

Legend:

Blue: Loops

Intermediate language

Example with simple assignment in C

	y = x + 1;

In the intermediate language:

	y x ? 1 + = ;

In the M1 assembler language:

	push_eax              # y (local)
	lea_eax,[ebp+DWORD] %100
	push_eax              # x (local)
	lea_eax,[ebp+DWORD] %96
	mov_eax,[eax]         # ?
	push_eax              # 1
	mov_eax, %1
	pop_ebx               # +
	add_eax,ebx
	pop_ebx               # =
	mov_[ebx],eax
	pop_eax               # ;

In the hex0 language:

50              ;    push_eax              # y (local)
8D85 64000000   ;    lea_eax,[ebp+DWORD] %100
50              ;    push_eax              # x (local)
8D85 60000000   ;    lea_eax,[ebp+DWORD] %96
8B00            ;    mov_eax,[eax]         # ?
50              ;    push_eax              # 1
B8 01000000     ;    mov_eax, %1
5B              ;    pop_ebx               # +
01D8            ;    add_eax,ebx
5B              ;    pop_ebx               # =
8903            ;    mov_[ebx],eax
58              ;    pop_eax               # ;

The above example shows how a single assignment statement in C can be translated to the intermediate language, and how this mapped on M1 assembly, which can than be converted into hexedecimal numbers accomplished with comments showing the relationship with the intermediate langeuage. This makes it possible to see the relationship between C statement and the hexadecimal numbers.

Blog posts:

Properties

Stack based
Close to C
Simple variable declaration
Block structure with local variables
Functions
Some support for structs
Simple to compile to M1 assembly
Two stacks
Only uses C keywords

The intermediate language is a stack base language, like Forth, that would be close to C and simple to compile. It has a simple method for defining global and local variables. It has a similar block structure as C with respect to local variables. It has functions and some support for structs.

The compilation to M1 makes use of two stacks. One stack contains the intermediate values, including the parameter values passed between functions, which makes use of the normal stack. The other stack contains the return address for function calls and the local variables, which uses the ebp register as a base offset.

Because it is intended to be used as a target language for a C compiler, it uses only C keywords to avoid possible clashes with variable names in C programs. This has lead to some keyword abuse, such as 'void' being used to define a function, 'int' being used to define a variable, and 'const' being used to define a numeric constant.

Blog posts:

C compiler for TCC

Goal: Being able to compile the Tiny C Compiler sources
Build-in preprocessor based on iterators
Single pass recursive decent parser
On the fly code generation to stack language
(Only AST for expressions)
Self-hosting: Can compile itself
Now working on compiling TCC sources

The goal for this C compiler is that it is able to compile the TCC sources and also itself, the stack_c compiler, the assmbler and the hexadecimal to binary converter.

A large part of the compiler involves the C preprocessor that is implemented with several character and token iterators. This was also an experiment to see if it would be possible to implement the preprocessor with iterators.

Although you need an LR-parser if you want to parse C purely based on the syntax, it appears that you can write an LL(1), recursive decent parser, if you know the kind of the identifier, whether it is a typedef or not.

It appears that it is possible to generate code during the parser, only using abstract syntax trees.

Blog posts:

Contents

Can we trust Linux?

The live-bootstrap approach

hex0-seed

Independent verification

Verify hex0_x86.hex0

Review of the other sources

Execute with emulator

Using strace

T-diagram

GNU Mes Compiler

Observations

Intermediate language

Properties

C compiler for TCC