WHY2025, 11 August 2025, 17:00, Cassiopeia
Frans Faase, staying in HSNL village
https://www.iwriteiam.nl/WHY2025_talk.html
(link is also found on program.why2025.org)
Contents
- What is live-bootstrap?
- How to review live-bootstrap
- The binary seeds
- The sources
- Ways to improve live-bootstrap
Can we trust Linux?
- Linux is open source
- Building the Linux kernel
- What is a compiler?
- Malicious compiler
- Reflections on Trusting Trust by Ken Thompson
- Reverse engineering is hard
The live-bootstrap approach
- Start with a tiny piece of machine code
- Build necessary tools from this
- To compile the Tiny C Compiler
- gcc 4.0.4 → gcc 4.7.4 → gcc 10.4.0 → gcc 13.1.0
The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small piece of machine code. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kernel sources.
hex0-seed
- The bootstrap-seeds repository
- NEVER TRUST ANYTHING IN HERE
- hex0_x86.hex0
- > hex0-seed hex0_x86.hex0 hex0
- Can we verify hex0?
Independent verification
- hex0 in Brainfuck
- Esoteric programming language
- Only eight commands: + - < > [ ] . ,
- Verified hex0 is binary form of hex0_x86.hex0
- Is this enough?
Verify hex0_x86.hex0
- Executable and Linkable Format (ELF)
- Machine code instruction reference
- Linux system calls
# SPDX-FileCopyrightText: 2019 Jeremiah Orians
# SPDX-FileCopyrightText: 2022 Andrius Štikonas
#
# SPDX-License-Identifier: GPL-3.0-or-later
## ELF Header
#:ELF_base
7F 45 4C 46 # e_ident[EI_MAG0-3] ELF's magic number
01 # e_ident[EI_CLASS] Indicating 32 bit
01 # e_ident[EI_DATA] Indicating little endianness
01 # e_ident[EI_VERSION] Indicating original elf
03 # e_ident[EI_OSABI] Set at 3 because FreeBSD is strict
00 # e_ident[EI_ABIVERSION] Set at 0 because no one cares
00 00 00 00 00 00 00 # e_ident[EI_PAD]
02 00 # e_type Indicating Executable
03 00 # e_machine Indicating x86
01 00 00 00 # e_version Indicating original elf
54 80 04 08 # e_entry Address of the entry point
34 00 00 00 # e_phoff Address of program header table
00 00 00 00 # e_shoff Address of section header table
00 00 00 00 # e_flags
34 00 # e_ehsize Indicating our 52 Byte header
20 00 # e_phentsize size of a program header table
01 00 # e_phnum number of entries in program table
00 00 # e_shentsize size of a section header table
00 00 # e_shnum number of entries in section table
00 00 # e_shstrndx index of the section names
## Program Header
#:ELF_program_headers
#:ELF_program_header__text
01 00 00 00 # ph_type: PT-LOAD = 1
00 00 00 00 # ph_offset
00 80 04 08 # ph_vaddr
00 80 04 08 # ph_physaddr
00 01 00 00 # ph_filesz
00 01 00 00 # ph_memsz
07 00 00 00 # ph_flags: PF-X|PF-W|PF-R = 7
01 00 00 00 # ph_align
#:ELF_text
# Where the ELF Header is going to hit
# Simply jump to _start
# Our main function
# :_start ; (0x8048054)
58 ; pop_eax # Get the number of arguments
5B ; pop_ebx # Get the program name
5B ; pop_ebx # Get the actual input name
31C9 ; xor_ecx,ecx # prepare read_only, ecx = 0
31D2 ; xor_edx,edx # Extra sure, edx = 0
6A 05 ; push !5 # prepare to set eax to 5
58 ; pop_eax # the syscall number for open()
CD 80 ; int !0x80 # Now open that damn file
89C6 ; mov_esi,eax # Preserve the file pointer we were given
5B ; pop_ebx # Get the actual output name
66B9 4102 ; mov_cx, @577 # Prepare file as O_WRONLY|O_CREAT|O_TRUNC
66BA C001 ; mov_dx, @448 # Prepare file as RWX for owner only (700 in octal)
6A 05 ; push !5 # prepare to set eax to 5
58 ; pop_eax # the syscall number for open()
CD 80 ; int !0x80 # Now open that damn file
89C2 ; mov_edx,eax # Preserve the file pointer we were given
# Our flag for byte processing
6A FF ; push !-1
5D ; pop_ebp # ebp = -1
# temp storage for the sum
31FF ; xor_edi,edi # edi = 0
#:loop ; (0x8048077)
# Read a byte
E8 68000000 ; call %Read_byte
# process byte
E8 1B000000 ; call %hex
# Deal with -1 values
85C0 ; test_eax,eax
7C F2 ; jl !loop
# deal with toggle
85ED ; test_ebp,ebp # jump if ebp >= 0
7D 06 ; jge !print
# process first byte of pair
89C7 ; mov_edi,eax
31ED ; xor_ebp,ebp # ebp = 0
EB E8 ; jmp !loop
# process second byte of pair
#:print ; (0x804808F)
# update the sum and store in output
C1E7 04 ; shl_edi, !4
01F8 ; add_eax,edi
# flip the toggle
4D ; dec_ebp # ebp = -1
E8 39000000 ; call %write_byte
EB DB ; jmp !loop
#:hex ; (0x804809C)
# Purge Comment Lines (#)
3C 23 ; cmp_al, !35
74 1E ; je !purge_comment
# Purge Comment Lines (;)
3C 3B ; cmp_al, !59
74 1A ; je !purge_comment
# deal all ascii less than 0
3C 30 ; cmp_al, !48
7C 1F ; jl !ascii_other
# deal with 0-9
3C 3A ; cmp_al, !58
7C 1F ; jl !ascii_num
# deal with all ascii less than A
3C 41 ; cmp_al, !65
7C 17 ; jl !ascii_other
# deal with A-F
3C 47 ; cmp_al, !71
7C 1C ; jl !ascii_high
# deal with all ascii less than a
3C 61 ; cmp_al, !97
7C 0F ; jl !ascii_other
# deal with a-f
3C 67 ; cmp_al, !103
7C 12 ; jl !ascii_low
# The rest that remains needs to be ignored
EB 09 ; jmp !ascii_other
#:purge_comment ; (0x80480BE)
# Read a byte
E8 21000000 ; call %Read_byte
# Loop if not LF
3C 0A ; cmp_al, !10
75 F7 ; jne !purge_comment
# Otherwise return -1
#:ascii_other ; (0x80480C7)
6A FF ; push !-1
58 ; pop_eax # return -1
C3 ; ret
#:ascii_num ; (0x80480CB)
2C 30 ; sub_al, !48
C3 ; ret
#:ascii_low ; (0x80480CE)
2C 20 ; sub_al, !32 # convert to uppercase
#:ascii_high ; (0x80480D0)
2C 37 ; sub_al, !55
C3 ; ret
# Writes byte stored in al
#:write_byte ; (0x80480D3)
# Print our Hex
89D3 ; mov_ebx,edx # Where are we writing to
52 ; push_edx # protect fout
6A 01 ; push !1 # prepare to set edx to 1
5A ; pop_edx # set the size of chars we want
50 ; push_eax # Move output to stack
89E1 ; mov_ecx,esp # What we are writing
6A 04 ; push !4 # prepare to set eax to 4
58 ; pop_eax # the syscall number for write
CD 80 ; int !0x80 # call the Kernel
5B ; pop_ebx # deallocate stack
5A ; pop_edx # restore fout
C3 ; ret
#:Read_byte ; (0x80480E4)
# Attempt to read 1 byte from Input file
52 ; push_edx # protect fout
6A 01 ; push !1 # prepare to set edx to 1
5A ; pop_edx # set the size of chars we want
57 ; push_edi # allocate stack
89E1 ; mov_ecx,esp # Where to put it
89F3 ; mov_ebx,esi # Where are we reading from
6A 03 ; push !3 # prepare to set eax to 3
58 ; pop_eax # the syscall number for read
CD 80 ; int !0x80 # call the Kernel
85C0 ; test_eax,eax # check what we got
74 03 ; je !Done # Got EOF call it done
# load byte
58 ; pop_eax # load char
5A ; pop_edx # restore fout
C3 ; ret
#:Done ; (0x80480F9)
# program completed Successfully
31DB ; xor_ebx,ebx # All is well, ebx = 0
6A 01 ; push !1
58 ; pop_eax # put the exit syscall number in eax
CD 80 ; int !0x80 # Call it a good day
#:ELF_end
Review of the other sources
- What are all the sources?
- Description in parts.rst
- The kaem shell
- Program to parse kaem files
- C-preprocessor to discover included files
- Stopped at bash
- Result: live-bootstrap.html
Execute with emulator
- Implements limited set of x86 instructions
- Supports system calls
- Simulates multiple processes
- Writes files to file system
- It was not easy, but insightful
- Too slow for compiling GNU Mes Compiler,
which is used to compile the Tiny C Compiler
Using strace
- Trace system calls
- Shell script to be run with chroot
- Program to parse log file
- Generate T-diagram
T-diagram
A T-diagram (or tombstone diagram) uses T-shaped outlines that explain the languages that are involved in a program. It shows the source language (on the left), the target language (on the right) and the implementation language (on the bottom). In the above diagram it show how a program in C for reading hexadecimal data to a binary file is compiled to an executable that does the same. Notice how the execution of the compiler changes the implementation language of the source program into the implementation language of the target program.
Legend:
- Red: the binary seeds
- Green box: Programs run by
- C-1, C-2: subset of C languages
GNU Mes Compiler
- C compiler in Scheme (LISP)
- Scheme interpreter in a subset of C language
- Preceded by two subset of C compilers
- Probably on the level of Tiny C Compiler
- Compiles rather slow
- The odd one out
- Why not Forth?
- GNU Mes replacement
Observations
- hex0 is short but a bit cryptic
- More source code in assembly (>6000 lines)
- Two subset of C compilers
- C Preprocessor
- Can compile some utils: untar and ungz
- Lots of code in Scheme
- Alternative approach:
- Readable intermediate language
- More 'loops'
Legend:
- Blue: Loops
Intermediate language
- Example with simple assignment in C
y = x + 1;
- In the intermediate language:
y x ? 1 + = ;
- In the M1 assembler language:
push_eax # y (local) lea_eax,[ebp+DWORD] %100 push_eax # x (local) lea_eax,[ebp+DWORD] %96 mov_eax,[eax] # ? push_eax # 1 mov_eax, %1 pop_ebx # + add_eax,ebx pop_ebx # = mov_[ebx],eax pop_eax # ;
- In the hex0 language:
50 ; push_eax # y (local) 8D85 64000000 ; lea_eax,[ebp+DWORD] %100 50 ; push_eax # x (local) 8D85 60000000 ; lea_eax,[ebp+DWORD] %96 8B00 ; mov_eax,[eax] # ? 50 ; push_eax # 1 B8 01000000 ; mov_eax, %1 5B ; pop_ebx # + 01D8 ; add_eax,ebx 5B ; pop_ebx # = 8903 ; mov_[ebx],eax 58 ; pop_eax # ;
Properties
- Stack based
- Close to C
- Simple variable declaration
- Block structure with local variables
- Functions
- Some support for structs
- Simple to compile to M1 assembly
- Two stacks
- Only uses C keywords
C compiler for TCC
- Goal: Being able to compile the Tiny C Compiler sources
- Build-in preprocessor based on iterators
- Single pass recursive decent parser
- On the fly code generation to stack language
- (Only AST for expressions)
- Self-hosting: Can compile itself
- Now working on compiling TCC sources