WHY2025: Reviewing live-bootstrap

18 min read Original article ↗





WHY2025, 11 August 2025, 17:00, Cassiopeia
Frans Faase, staying in HSNL village

https://www.iwriteiam.nl/WHY2025_talk.html
(link is also found on program.why2025.org)

(Recording, YouTube)












Contents












Can we trust Linux?

Linux is an open-source operating system build from available sources. For closed source operating systems, such as Windows and MacOS, there is no way to verify that it is reliable.

Although the sources of Linux are all known, a compiler is needed for building the Linux kernel from the C-source files into executables for the target platform.

A compiler is a program which translates a program from an input language to an output language. The input language is often a high-level, human readable language, while the output language is a low-level language, including machine code, which can be executed by the CPU. A compiler, like any other program, is programmed in a language, usually in machine code, or otherwise, depends on some other program based on machine code.

A malicious compiler could insert a back-door in some executables or libraries of the Linux kernel. The solution to use this compiler to compile itself from original sources, does not work, because it could recognise when it is compiled by itself and duplicate the malicious code in the resulting compiler, making it malicious as well.

In 1984 Ken Thompson already presented the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust.

Machine code is usually in a binary form that is not easily readable by humans. This poses a problem for a non-trivial program, such as an optimising compiler. Although reverse engineering executables is possible, it is a lot of work.












The live-bootstrap approach

The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small piece of machine code. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kernel sources.












hex0-seed

The small piece of machine code is a program called hex0, which converts a file with hexadecimal numbers into a binary file. This program is specific for the target CPU for which Linux needs to be build. These programs can be found in the bootstrap-seeds repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN HERE'. For x86 (32 bits) the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will return hex0, assuming that it is not a malicious variant of hex0. The hex0 program requires two arguments, the names of the input and output files. The function of the program can also be achieved with following command line:

sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file

This makes use of the stream editing program sed and the xxd (hex dump) program that can be used to convert, in both directions, between binary files and hexadecimal files.

The important question is: Can we verify hex0? First we should verify that it indeed does what it does through an independent verify cations.












Independent verification

Brainfuck is an esoteric program language with only eight commands, represented by the letters '+-<>[].,'. I wanted to see if it was possible to write a Brainfuck program to replace hex0. I wrote some JavaScript to generate a program from a simple programming language: BF generator. I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did produce a file equal to hex0.

Blog posts:












Verify hex0_x86.hex0

# SPDX-FileCopyrightText: 2019 Jeremiah Orians
# SPDX-FileCopyrightText: 2022 Andrius Štikonas
#
# SPDX-License-Identifier: GPL-3.0-or-later

## ELF Header
#:ELF_base
7F 45 4C 46                     # e_ident[EI_MAG0-3] ELF's magic number

01                              # e_ident[EI_CLASS] Indicating 32 bit
01                              # e_ident[EI_DATA] Indicating little endianness
01                              # e_ident[EI_VERSION] Indicating original elf

03                              # e_ident[EI_OSABI] Set at 3 because FreeBSD is strict
00                              # e_ident[EI_ABIVERSION] Set at 0 because no one cares

00 00 00 00 00 00 00            # e_ident[EI_PAD]

02 00                           # e_type Indicating Executable
03 00                           # e_machine Indicating x86
01 00 00 00                     # e_version Indicating original elf

54 80 04 08                     # e_entry Address of the entry point
34 00 00 00                     # e_phoff Address of program header table
00 00 00 00                     # e_shoff Address of section header table

00 00 00 00                     # e_flags

34 00                           # e_ehsize Indicating our 52 Byte header

20 00                           # e_phentsize size of a program header table
01 00                           # e_phnum number of entries in program table

00 00                           # e_shentsize size of a section header table
00 00                           # e_shnum number of entries in section table

00 00                           # e_shstrndx index of the section names

## Program Header
#:ELF_program_headers
#:ELF_program_header__text
01 00 00 00                     # ph_type: PT-LOAD = 1
00 00 00 00                     # ph_offset

00 80 04 08                     # ph_vaddr
00 80 04 08                     # ph_physaddr

00 01 00 00                     # ph_filesz
00 01 00 00                     # ph_memsz

07 00 00 00                     # ph_flags: PF-X|PF-W|PF-R = 7
01 00 00 00                     # ph_align

#:ELF_text

# Where the ELF Header is going to hit
# Simply jump to _start
# Our main function
# :_start ; (0x8048054)
    58                          ; pop_eax          # Get the number of arguments
    5B                          ; pop_ebx          # Get the program name
    5B                          ; pop_ebx          # Get the actual input name
    31C9                        ; xor_ecx,ecx      # prepare read_only, ecx = 0
    31D2                        ; xor_edx,edx      # Extra sure, edx = 0
    6A 05                       ; push !5          # prepare to set eax to 5
    58                          ; pop_eax          # the syscall number for open()
    CD 80                       ; int !0x80        # Now open that damn file
    89C6                        ; mov_esi,eax      # Preserve the file pointer we were given

    5B                          ; pop_ebx          # Get the actual output name
    66B9 4102                   ; mov_cx, @577     # Prepare file as O_WRONLY|O_CREAT|O_TRUNC
    66BA C001                   ; mov_dx, @448     # Prepare file as RWX for owner only (700 in octal)
    6A 05                       ; push !5          # prepare to set eax to 5
    58                          ; pop_eax          # the syscall number for open()
    CD 80                       ; int !0x80        # Now open that damn file
    89C2                        ; mov_edx,eax      # Preserve the file pointer we were given

    # Our flag for byte processing
    6A FF                       ; push !-1
    5D                          ; pop_ebp          # ebp = -1

    # temp storage for the sum
    31FF                        ; xor_edi,edi      # edi = 0

#:loop ; (0x8048077)
    # Read a byte
    E8 68000000                 ; call %Read_byte

    # process byte
    E8 1B000000                 ; call %hex

    # Deal with -1 values
    85C0                        ; test_eax,eax
    7C F2                       ; jl !loop

    # deal with toggle
    85ED                        ; test_ebp,ebp     # jump if ebp >= 0
    7D 06                       ; jge !print

    # process first byte of pair
    89C7                        ; mov_edi,eax
    31ED                        ; xor_ebp,ebp      # ebp = 0
    EB E8                       ; jmp !loop

# process second byte of pair
#:print ; (0x804808F)
    # update the sum and store in output
    C1E7 04                     ; shl_edi, !4
    01F8                        ; add_eax,edi

    # flip the toggle
    4D                          ; dec_ebp          # ebp = -1

    E8 39000000                 ; call %write_byte

    EB DB                       ; jmp !loop

#:hex ; (0x804809C)
    # Purge Comment Lines (#)
    3C 23                       ; cmp_al, !35
    74 1E                       ; je !purge_comment

    # Purge Comment Lines (;)
    3C 3B                       ; cmp_al, !59
    74 1A                       ; je !purge_comment

    # deal all ascii less than 0
    3C 30                       ; cmp_al, !48
    7C 1F                       ; jl !ascii_other

    # deal with 0-9
    3C 3A                       ; cmp_al, !58
    7C 1F                       ; jl !ascii_num

    # deal with all ascii less than A
    3C 41                       ; cmp_al, !65
    7C 17                       ; jl !ascii_other

    # deal with A-F
    3C 47                       ; cmp_al, !71
    7C 1C                       ; jl !ascii_high

    # deal with all ascii less than a
    3C 61                       ; cmp_al, !97
    7C 0F                       ; jl !ascii_other

    # deal with a-f
    3C 67                       ; cmp_al, !103
    7C 12                       ; jl !ascii_low

    # The rest that remains needs to be ignored
    EB 09                       ; jmp !ascii_other

#:purge_comment ; (0x80480BE)
    # Read a byte
    E8 21000000                 ; call %Read_byte

    # Loop if not LF
    3C 0A                       ; cmp_al, !10
    75 F7                       ; jne !purge_comment

    # Otherwise return -1

#:ascii_other ; (0x80480C7)
    6A FF                       ; push !-1
    58                          ; pop_eax          # return -1
    C3                          ; ret

#:ascii_num ; (0x80480CB)
    2C 30                       ; sub_al, !48
    C3                          ; ret

#:ascii_low ; (0x80480CE)
    2C 20                       ; sub_al, !32      # convert to uppercase

#:ascii_high ; (0x80480D0)
    2C 37                       ; sub_al, !55
    C3                          ; ret

# Writes byte stored in al
#:write_byte ; (0x80480D3)
    # Print our Hex
    89D3                        ; mov_ebx,edx      # Where are we writing to
    52                          ; push_edx         # protect fout
    6A 01                       ; push !1          # prepare to set edx to 1
    5A                          ; pop_edx          # set the size of chars we want
    50                          ; push_eax         # Move output to stack
    89E1                        ; mov_ecx,esp      # What we are writing
    6A 04                       ; push !4          # prepare to set eax to 4
    58                          ; pop_eax          # the syscall number for write
    CD 80                       ; int !0x80        # call the Kernel
    5B                          ; pop_ebx          # deallocate stack
    5A                          ; pop_edx          # restore fout
    C3                          ; ret

#:Read_byte ; (0x80480E4)
    # Attempt to read 1 byte from Input file
    52                          ; push_edx         # protect fout
    6A 01                       ; push !1          # prepare to set edx to 1
    5A                          ; pop_edx          # set the size of chars we want
    57                          ; push_edi         # allocate stack
    89E1                        ; mov_ecx,esp      # Where to put it
    89F3                        ; mov_ebx,esi      # Where are we reading from
    6A 03                       ; push !3          # prepare to set eax to 3
    58                          ; pop_eax          # the syscall number for read
    CD 80                       ; int !0x80        # call the Kernel

    85C0                        ; test_eax,eax     # check what we got
    74 03                       ; je !Done         # Got EOF call it done

    # load byte
    58                          ; pop_eax          # load char
    5A                          ; pop_edx          # restore fout
    C3                          ; ret

#:Done ; (0x80480F9)
    # program completed Successfully
    31DB                        ; xor_ebx,ebx      # All is well, ebx = 0
    6A 01                       ; push !1
    58                          ; pop_eax          # put the exit syscall number in eax
    CD 80                       ; int !0x80        # Call it a good day

#:ELF_end











Review of the other sources

I decided to look some deeper into live-bootstrap to figure out which programs are executed and which files are being read. There is a global description of all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a minimal shell, called the kaem shell, to execute the script. I wrote a program, kaem_parser.cpp, to parse the kaem files, which generates the HTML page live-bootstrap with all the information. To determine which header files are used when compiling C programs, the parser also implements a C preprocessor to parse the C files to find all the header files. At some point the bash shell is compiled and used to execute bash scripts. One would have to write a bash interpreter to parse the files involved in the remaining steps.

Blog posts:












Execute with emulator

To even dive deeper, I decided to implement an x86 emulator that implements only the x86 instructions that are actually used. The emulator also implements systems calls and supports multiple processes. I decided to simply map file operations on file operations on the file system, instead of simulating those in memory.

This allowed me also to verify the code of hex0 and to see if no other binary seeds were used and/or executed.

The development of the emulator required some very complicated debugging, but it was also insightful.

The emulator turned out to be too slow to compile the GNU Mes compiler, which is the C compiler used to compile a stripped version of the Tiny C Compiler.

This also means, I even got less far with this approach than the previous.

Blog posts:












Using strace

With strace command it is possible to trace system calls that are performed during the execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the chroot command) and trace relevant system calls. I also wrote a program, called scan_trace.cpp to process the produced log file and generate an HTML page listing the processes and source files. It also generates a JSON file with similar information, which is used to generate a T-diagram.












T-diagram

This text is displayed if your browser does not support HTML5 Canvas.


A T-diagram (or tombstone diagram) uses T-shaped outlines that explain the languages that are involved in a program. It shows the source language (on the left), the target language (on the right) and the implementation language (on the bottom). In the above diagram it show how a program in C for reading hexadecimal data to a binary file is compiled to an executable that does the same. Notice how the execution of the compiler changes the implementation language of the source program into the implementation language of the target program.


This text is displayed if your browser does not support HTML5 Canvas.

Legend:












GNU Mes Compiler

GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset of C and a C compiler written in Scheme. It is a rather complete compiler including a separate linker. I get the idea that it is on par with the Tiny C compiler, which does not need a separate linker. It can compile it own sources without a separate linker. The MES compiler is rather slow, which is understandable because it uses an interpreting parser running an interpreting compiler. It is also slow because it compiles many individual C files into object files, which are linked at the end. Every time all the Scheme files are loaded before the compiler can start its actual work.

The GNU Mes compiler requires a compiler for a minimal subset of C. Because of this the bootstrap does contain include a subset of C compiler and a partial implementation of the C preprocessor. These are written in another subset of C for which there is compiler written in machine code, one for each supported CPU. I have the impression that the GNU Mes compiler is a bit of overkill for the gap it needs to bridge.

On several forums, I have read people raising the question why Forth was not used in the implementation instead of LISP.

I made some investigation and attempts to bridge the gap. To see if it would be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs in M2-Mesoplanet.

Blog posts:












Observations

Although hex0 is short, it is also a bit cryptic. It looks like it was hand coded and that it uses different patterns for similar operations. It also uses registers to store certain variables. Besides having to verify hex0, one also has to analyse the following assembly programs:

  • hex1_x86.hex0: 329 lines
  • hex2_x86.hex1: 626 lines
  • M0_x86.hex2: 869 lines
  • cc_x86.M1: 4983 lines

This text is displayed if your browser does not support HTML5 Canvas.

Legend:












Intermediate language

	y = x + 1;

	y x ? 1 + = ;

	push_eax              # y (local)
	lea_eax,[ebp+DWORD] %100
	push_eax              # x (local)
	lea_eax,[ebp+DWORD] %96
	mov_eax,[eax]         # ?
	push_eax              # 1
	mov_eax, %1
	pop_ebx               # +
	add_eax,ebx
	pop_ebx               # =
	mov_[ebx],eax
	pop_eax               # ;

50              ;    push_eax              # y (local)
8D85 64000000   ;    lea_eax,[ebp+DWORD] %100
50              ;    push_eax              # x (local)
8D85 60000000   ;    lea_eax,[ebp+DWORD] %96
8B00            ;    mov_eax,[eax]         # ?
50              ;    push_eax              # 1
B8 01000000     ;    mov_eax, %1
5B              ;    pop_ebx               # +
01D8            ;    add_eax,ebx
5B              ;    pop_ebx               # =
8903            ;    mov_[ebx],eax
58              ;    pop_eax               # ;

The above example shows how a single assignment statement in C can be translated to the intermediate language, and how this mapped on M1 assembly, which can than be converted into hexedecimal numbers accomplished with comments showing the relationship with the intermediate langeuage. This makes it possible to see the relationship between C statement and the hexadecimal numbers.

Blog posts:











Properties

The intermediate language is a stack base language, like Forth, that would be close to C and simple to compile. It has a simple method for defining global and local variables. It has a similar block structure as C with respect to local variables. It has functions and some support for structs.

The compilation to M1 makes use of two stacks. One stack contains the intermediate values, including the parameter values passed between functions, which makes use of the normal stack. The other stack contains the return address for function calls and the local variables, which uses the ebp register as a base offset.

Because it is intended to be used as a target language for a C compiler, it uses only C keywords to avoid possible clashes with variable names in C programs. This has lead to some keyword abuse, such as 'void' being used to define a function, 'int' being used to define a variable, and 'const' being used to define a numeric constant.

Blog posts:












C compiler for TCC

The goal for this C compiler is that it is able to compile the TCC sources and also itself, the stack_c compiler, the assmbler and the hexadecimal to binary converter.

A large part of the compiler involves the C preprocessor that is implemented with several character and token iterators. This was also an experiment to see if it would be possible to implement the preprocessor with iterators.

Although you need an LR-parser if you want to parse C purely based on the syntax, it appears that you can write an LL(1), recursive decent parser, if you know the kind of the identifier, whether it is a typedef or not.

It appears that it is possible to generate code during the parser, only using abstract syntax trees.

Blog posts: