Pnut: A C to POSIX shell compiler you can trust
pnut.sh"Because Pnut can be distributed as a human-readable shell script (`pnut.sh`), it can serve as the basis for a reproducible build system. With a POSIX compliant shell, `pnut.sh` is sufficiently powerful to compile itself and, with some effort, [TCC](https://bellard.org/tcc/). Because TCC can be used to bootstrap GCC, this makes it possible to bootstrap a fully featured build toolchain from only human-readable source files and a POSIX shell.
Because Pnut doesn't support certain C features used in TCC, Pnut features a native code backend that supports a larger subset of C99. We call this compiler `pnut-exe`, and it can be compiled using `pnut.sh`. This makes it possible to compile `pnut-exe.c` using `pnut.sh`, and then compile TCC, all from a POSIX shell."
Anywhere we can see a step-by-step demo of this process.
Curious if the authors tried NetBSD or OpenBSD, or using another small C compiler, e.g., pcc.
Historically, tcc was problematic for NetBSD and its forks. Not sure about today, but tcc is still in NetBSD pkgsrc WIP which suggests problems remain.
Problem is:
- a shell is required, which has to be built from sources, using a compiler which was also built from sources using a compile binary. That's the real boostrap.
- even if you could pick some shell, and compiled it with pnut.exe, the compiled code requires interpretation by an executable shell.
- there is no such thing as a "POSIX compliant shell"; that's an abstract category. All this amounts to is a promise that pnut.sh will not generate code that uses non-POSIX features.
If you are wondering how it handles C-only functions.. it does not.
open(..., O_RDWR | O_EXCL) -> runtime error, "echo "Unknow file mode" ; exit 1"
lseek(fd, 1, SEEK_HOLE); -> invalid code (uses undefined _lseek)
socket(AF_UNIX, SOCK_STREAM, 0); -> same (uses undefined _socket)
looking closer at "cp" and "cat" examples, write() call does not handle errors at all. Forget about partial writes, it does not even return -1 on failures.
"Compiler you can Trust", indeed... maybe you can trust it to get all the details wrong?
There seems to be libc in the repo but many functions are TODO https://github.com/udem-dlteam/pnut/tree/main/portable_libc
Otherwise the builtins seems to be here https://github.com/udem-dlteam/pnut/blob/main/runtime.sh
FYI all your functions are not "C functions", but rather POSIX functions. I did not expect it to be complete, but it's still impressive for what it is.
There are Linux ports of the plan9 `syscall` binary, which is presumably necessary to implement parts of libc with shell scripts: https://stackoverflow.com/questions/10196395/os-system-calls...
I don't remember there being a way to keep a server listening on a /dev/tcp/$ip/$port port, for sockets from shell scripts with shellcheck at least
I suspect the “trust” is a reference to Ken Thompson’s Turing Award speech “Reflections on trusting trust” where he laid out the concern of a back door in a compiler that survives updates to the compiler. In other words, the compiler injects a back door into future versions of itself in addition into your programs that source level analysis of the code will never reveal.
I think the pitch here is that it can compile TCC which can then compile GCC which makes it much more difficult for a backdoor to survive potentially, especially if the shell code is easier to read and verify than the corresponding assembly.
Within that context, an incomplete libc is irrelevant.
Implementation issues aside, while technically it should be possible to seek a file descriptor from shell through a suitable helper program in C, I believe none of the POSIX utilities provide this facility
head, read, and sed can be used for seeking forward according to POSIX (see the INPUT FILES section here <https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V...>). I doubt non-GNU implementations support it though.
If it’s in POSIX, chances are the BSDs implement it, too.
I think seeking a specific number of bytes and then writing data there will be a problem, though.
For seeking n bytes, read nor sed will work; they work with lines.
sed is the only one of those that can write, and POSIX doesn’t appear to have the -i option for in-place editing (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/s...)
So, I think head for seeking followed by sed (or ed or vi, but sed is the simpler tool, I think) for replacing the first n characters, redirecting to a temp file and then doing a mv is your only option.
Advantage will be that writes will be atomic; disadvantage that it will be slow
head was used for this purpose in the xz backdoor.
I think dd might be more reliable. (Is dd POSIX?)
maybe access to libc functions can be achieved through something like <https://github.com/taviso/ctypes.sh>. Although that very specific implementation seems to require explicitly bash and is not broadly POSIX Shell compatible as Pnut wants to be.
Can finally port systemd to shell to quell the rebellion.
Damned if that isn't the funniest thing I've heard in a long time.
I love things like these because they shake our perception of normal loose. And who said our perception of normal doesn't deserve a good shake?
A C to shell compiler might seem impractical, but you know what is even more impractical? Having a separate language for a build system. And yet, here we are. Using Shell, Make or CMake to build a C program is only acceptable because is has always been so. It's a "perceived normality" in the C world.
There is no good reason, however, CMake isn't a C library. With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable. We already have includeOS, why not includeMake?
DSLs ("microlanguages", at the time) were a big idea in the late 80s - by being expressive in ways closer to the problem, they could leave out irrelevant things and the bugs they lead to. (Do you really want to have to explicitly call malloc() in your build tools? and does gdb really feel like the right tool for debugging one?)
> Using Shell, Make or CMake to build a C program is only acceptable because is has always been so.
Nah, using shell, make or cmake is acceptable because C is obviously a terrible language for doing things. (Those languages are also all terrible, but not quite as terrible as C).
> There is no good reason, however, CMake isn't a C library.
Isn't it the other way round? There's no good reason people write programs in C rather than CMake.
> With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable.
Which is to say, with extreme difficulty?
Like, I agree with where you're coming from, it is absolutely a damning indictment of C that people don't want to express their builds in it. But writing in a build in C really would be terrible.
I think you're confusing the language and the perception of language, the "rules of C" vs. the "brand of C".
What Pnut shows us is that the language itself is a very thin construct. C could be as low-level as you want, but it can also... compile to shell. Pnut shows that C is only a set of grammatical rules, and the source code in C doesn't necessary reflect the binary program, it's only a script for the C compiler. A compiler then decides how to interpret the source and what to do with it.
Now back to builds. The difference between:
andset(SOME_VARIABLE "SOME VALUE")
is purely grammatical. The underlying functionality is the same. When I'm saying, CMake could be a C library, I'm not saying we should ditch CMake and everything it brings to the table and start writing build scripts in pure C. I'm saying we can use both C language and CMake functionality with very little, skin deep, adjustments.set(SOME_VARIABLE, "SOME VALUE");The only thing that keeps us down is the perception of C as a low-level language for low-level applications. C is for drivers and shell is for moving files around. And that's when Pnut comes up and tells us: "hold on, are they?"
> The difference between: > set(SOME_VARIABLE "SOME VALUE") > and > set(SOME_VARIABLE, "SOME VALUE"); > is purely grammatical. The underlying functionality is the same.
But in a build script you don't want to be doing either. You want SOME_VARIABLE = SOME VALUE, or at most "SOME VALUE". Grammar and syntax matter.
> Pnut shows that C is only a set of grammatical rules, and the source code in C doesn't necessary reflect the binary program, it's only a script for the C compiler.
The only thing worse than writing C is writing something that looks like C but doesn't follow the rules of C, where you have to use some other logic to understand what it actually does. Build tools that do that kind of thing have been tried and they have not turned out well.
> When I'm saying, CMake could be a C library, I'm not saying we should ditch CMake and everything it brings to the table and start writing build scripts in pure C. I'm saying we can use both C language and CMake functionality with very little, skin deep, adjustments.
"Skin deep" perhaps, but making your language uglier and weirder is still unpleasant (and CMake is unpleasant and weird enough as it is).
> The only thing that keeps us down is the perception of C as a low-level language for low-level applications.
No, the other thing is the perception of C as a crude, inexpressive language full of weird edge cases that requires dozens of lines to write even simple things, and that in turn comes from the reality of C as a crude, inexpressive language full of weird edge cases that requires dozens of lines to write even simple things.
If syntax mattered that much, CMake would have opted for SOME VARIABLE = SOME VALUE. But... they went for set(SOME_VARIABLE "SOME VALUE") instead. I don't know why.
Syntax-wise C is fine. I personally have a soft spot for Rebol's "syntax free" approach, but the world prefers C. Five out of ten TIOBE's most popular languages have C-like syntax.
And you're right that the perception of C comes from the usage of C. Of course it does. But this creates the vicious cycle, the cycle things like Pnut are trying to break.
> the world prefers C. Five out of ten TIOBE's most popular languages have C-like syntax.
I don't know which five you're classifying that way, but even for languages that started off C-like the trend is in the direction of less C-like. Even for C++ the big popular changes recently have been things like auto; similarly for Java, and C# always had a more lightweight syntax for expressing values. And certainly JavaScript has an object literal syntax good enough that people use it separately. Python is admittedly weirdly bad for writing values in; I wonder if that's why Scons has more or less failed.
Why would you need a screwdriver or a glass cutter if you already have a hammer?
With C, you have the whole toolbox and the toolbox factory.
Both the tweezers and the bit flipping magnet .. and who would want anything more?
Yeah — C would be ok as a build system language if it was easy to: invoke & manage subprocesses; build & manage dynamic dependency graphs; and, easily work with file metadata.
Or... work with me: Make does that, well enough.
And that's why I'm saying CMake should have been a library. We want the functionality of Make but not necessary the language. And Pnut shows us well that {language != functionality}.
For the sake of mental experiment, let's pretend Make is a separate executable, separate process, but with some sort of API. You can manage dynamic dependency graphs by calling its routines from C.
Now let's say Make is a dynamic library with all functionality exposed. You can invoke and manage subprocesses using its functions, but now your C program and the Make share a process together.
Now let's say Make is a C library. GNU Make is written in C so this is not impossible to imagine. Your C program shares the process, and the names on compilation+linking phase with Make, which is annoying. But you can still work with metadata using Make's facilities. Also now you can use all the tools: debuggers, profilers, static analyzers, dynamic analyzers - you use for the rest of your codebase.
We perceive C as a low-level language but, and Pnut shows it well, C is only a set of rules. We can write shell scripts with C rules. Why can't we then write build scripts?
> So we stopped selling those [hammer factory] schematics and started selling hammer-factory-building factories.
https://web.archive.org/web/20180722051250/http://discuss.jo...
Have you tried Zig? Its build system is configured in the language. It’s actually a binary you build and run to build your project. Obviously the standard library has facilities for making building easy.
> but you know what is even more impractical? Having a separate language for a build system.
I disagree. For a very simple example it really makes life easier to not have to care about quoting filenames in build systems and just list a.c b.cpp etc., while you really want strings to be quoted in normal programming languages. Build systems that tried to be based on syntax of existing PLs (for instance Meson, QBS) are a real PITA for me when compared to CMake due to a lot of such affordances.
> you know what is even more impractical? Having a separate language for a build system
Why is it you think that?
Terry Davis was right. C should be your shell, as God intended.
This is very cool, regardless of how serious it was intended to be taken. Before base-64 encoders/decoders became more common as preinstalled commands in the environments I found myself on, I wrote a base64 utility in mostly pure POSIX shell:
https://25thandClement.com/~william/2023/base64.sh
If this project had existed I might have opted to compile my C-based base-64 encoder and decoder routines, suitably tweaked for pnut's limitations.I say base64.sh is mostly pure not because it relies on shell extensions, but because the only non-builtins it depends on are od(1) or, alternatively, dd(1) to assist with binary I/O. And preferably od(1), as reading certain control characters, like NUL, into a shell variable is especially dubious. The encoder is designed to operate on a stream of decimal encoded bytes. (See decimals_fast for using od to encode stdin to decimals, and decimals_slow for using dd for the same.)
It looks like pnut uses `read -r` for reading input. In addition to NULs and related raw byte issues, I was worried about chunking issues (e.g. truncation or errors) on binary data, e.g. no newlines within LINE_BUF bytes. Have you tested binary I/O much? Relatedly, how many different shell implementations have you tested your core scheme with? In addition to bash, dash, and various incarnations of /bin/sh on the BSDs, I also tested base64.sh with Solaris' system shells (ksh88 and ksh93 derivatives), as well as AIX's (ksh88 derivative). AIX had some odd quirks with pipelines even with plain text I/O. (Unfortunately Polar Home is gone, now, so I have no easy way to play with AIX; maybe that's for the better.)
One of the example we include is a base64 encoder/decoder:
It doesn't support NULs as you pointed out, but it's interesting to see similarities between your implementation and the one generated by Pnut.https://github.com/udem-dlteam/pnut/blob/main/examples/compiled/base64.shBecause we use `read -r`, we haven't tested reading binary files. Fortunately, the shell's `printf` function can emit all 256 characters so Pnut can at least output binary files. This makes it possible for Pnut to have a x86 backend for the use of reproducible builds.
Regarding the use of `read`, one constraint we set ourselves when writing Pnut is to not use any external utilities, including those that are specified by the POSIX standard (other than `read` and `printf`). This maximizes portability of the code generated by Pnut and is enough for the reproducible build use case.
We're still looking for ways to integrate existing shell code with C. One way this can be done is through the use of the `#include_shell` directive which includes existing shell code in the generated shell script. This makes it possible to call the necessary utilities to read raw bytes without having Pnut itself depends on less portable utilities.
Sorry, but since the very goal of base64 is to encode "uncomfortable" bytes, saying that your example doesn't work with uncomfortable bytes is like providing a fibonacci demo that only works with arguments less than 3, or a clock that only shows correct time twice a day.
I'd choose a different example to showcase pnut.
In the context of what it seems to be primarily attempting to achieve, assisting in the bootstrapping of more complex environments directly or indirectly dependent on C, I found the base64 example (more so the SHA-256 example in the same directory) quite interesting and evidence of the sophistication of pnut notwithstanding the limitations. And as was pointed out, it wouldn't be difficult to hack in the ability to read binary data: just swap in a replacement for the getchar routine, such as I've done with od. In fact, that ease is one of the most fascinating aspects of this project--they've built a conceptually powerful execution model for the shell that can be directly targeted when compiling C code, as opposed to indirection through an intermediate VM (e.g. a P-code interpreter in shell). It has it's limitations, but those can be addressed. Given the constraints, the foundation is substantial and powerful even from a utilitarian perspective.
When people discuss Turing completeness and related concepts one of the unstated caveats is that neither the concept itself, nor most solutions or environments, meaningfully address the problem of I/O with the external environment. pnut is kind of exceptional in this regard, even with the limitations.
When I'm told that "I can trust" something that I feel like I had no reason to distrust, it makes me feel even more suspicious of it
Hi there! I believe the mention of "trust" is related to the paper Reflections on Trusting Trust by Ken Thompson https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref... Though I do think the tagline used could definitely be improved from a marketing standpoint.
Perhaps you're old enough to remember the Sledge's[1] motto: “Trust me… I know what I'm doing.” HHBS Perusing the pnut site I did not understand either why this is software I can trust.
Yeah, I cringed when I saw that too. It violates an important rule of selling: Never tell the customer "Trust me".
I was puzzled by the example C function containing pointers. Do I understand correctly that you implement pointers in shell by having a shell variable _0 for the first "byte" of "memory", a shell variable _1 for the second, etc.?
Author here,
That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.
Since I experimented with something similar in the past to mimick multidimensional arrays: depending on the implementation this can absolutely _kill_ performance. IIRC, Dash does a linear lookup of variable names, so when you create tons of variables each lookup starts taking longer and longer.
I hope you're not compiling C to sh for performance reasons.
It's not about performance, it's about viability. If the result is so slow that it's unusable, it doesn't matter how portable it ends up being.
We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.
One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
> We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.
Interesting. When you say "even when hundreds of KBs are allocated", do you mean this is allocating variables with large values, or tons of small variables? My case was the latter, and with that I saw a noticeable slowdown on Dash.
Simplest repro case:
Dash was ~8 times slower. Increase the side of the square "matrix" for a proportionally bigger slowdown (this one uses 250003 variables).$ cat many_vars_bench.sh #!/bin/sh _side=500 i=0 while [ "${i}" -lt "${_side}" ]; do j=0 while [ "${j}" -lt "${_side}" ]; do eval "matrix_${i}_${j}=$((i+j))" || exit 1 : $(( j+=1 )) done i=$((i+1)) done $ time bash many_vars_bench.sh 5.60user 0.12system 0:05.78elapsed 99%CPU (0avgtext+0avgdata 57636maxresident)k 0inputs+0outputs (0major+13020minor)pagefaults 0swaps $ time dash many_vars_bench.sh 40.75user 0.14system 0:41.22elapsed 99%CPU (0avgtext+0avgdata 19972maxresident)k 0inputs+0outputs (0major+4951minor)pagefaults 0swaps> One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
Yes, launching a new process is generally expensive and so is spawning a subshell. If the shell is something like Bash (with a lot of startup/environment setup cost) then you'll feel this more than something like Dash, where the whole point was to make the shell small and snappy for init scripts: https://wiki.ubuntu.com/DashAsBinSh#Why_was_this_change_made...
In my limited testing, Bash generally came out on top for single-process performance, while Dash came out on top for scripts with more use of subshells.
I used almost the same idea, but with files in my https://github.com/steveschnepp/shlibs
I can't wait to see the shell equivalents for ptrace, setjmp, and dlopen.
Do you really?
Maybe then I can also interest you in an exception handler for DOS batch scripts:
Also see this related submission from May, 2024:
Amber: Programming language compiled to Bash https://news.ycombinator.com/item?id=40431835 (318 comments)
---
Pnut doesn't seem to differentiate between `int' and `int*' function parameters. That's weird, and doesn't come across as trustworthy at all! Shouldn't the use of pointers be disallowed instead?
int test1(int a, int len) {
return a;
}
int test2(int* a, int len) {
return a;
}
Both compile to the exact same thing: : $((len = a = 0))
_test1() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
: $((len = a = 0))
_test2() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
The "runtime library" portion at the bottom of every script is nigh unreadable.Even still, it's a cool concept.
Just to be clear, the input must be written in a subset of C, because many constructs are not recognized, like unsigned types, static variables, [] arrays, etc.
Is there a plan to remove such limitations?
These are restrictions of the target language and there isn't much pnut can do about this.
Surely unsigned (aka modulo) arithmetic and arrays are expressible in shell script?
edit: For reference, someone's take on building out better bash-like array functionality in posix shell: https://github.com/friendly-bits/POSIX-arrays (there's only very rudimentary array support built-in to posix sh, basically working with stuff in $@ using set -- arg1 arg2..)
Shell is Turing complete, you could implement anything there with enough effort.
Instantly make your C code 200 times slower without any effort!
It would actually be interesting to see how much faster dash is than everything else.
From our experience, ksh is generally faster, and dash sits between ksh and bash. One reason is that dash stores variables using a very small hash table with only 37 entries[0] meaning variable access quickly becomes linear as memory usage grows. But even with that, dash is still surprisingly fast -- when compiling `pnut.c` with `pnut.sh`, dash comes in second place:
[0]: https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/...ksh93: 31s dash: 1m06s bash: 1m19s zsh: >15mEDIT: ksh93, not ksh
For me `dash` compiles in just a few seconds. If you link to a 1-line problem (here, #define VTABSIZE 39), then why not boost that to 79 or 113, say, re-compile the shell and re-run your benchmark? Might lead to a change in upstream that could benefit everyone.
Or rework the array so realloc() can expand its size?
Yes.. Another fine idea, just more work than a 2 character edit. :-)
People still use KornShell?
All of Android is still based on a pdksh-derivative known as mksh, which is an enormous install base.
http://www.mirbsd.org/mksh.htm
OpenBSD switched their default shell to their own pdksh-derivative known as oksh.
There was an effort to (re)start ksh93 development, but AT&T halted this effort. The bugfixes from the failed effort have moved back into Korn's last release.
Why is Dash frequently touted as so much faster than Bash? What is different?
On rhel9, this is a list of my installed shells. You might notice that dash is smaller than ls (and the rest of the shells).
$ ll /bin/bash /bin/dash /bin/ksh93 /bin/ls /bin/mksh -rwxr-xr-x. 1 root root 1389064 May 1 00:59 /bin/bash -rwxr-xr-x. 1 root root 128608 May 9 2023 /bin/dash -rwxr-xr-x. 1 root root 1414912 Apr 9 07:26 /bin/ksh93 -rwxr-xr-x. 1 root root 140920 Apr 8 08:20 /bin/ls -rwxr-xr-x. 1 root root 325208 Jan 9 2022 /bin/mksh $ rpm -qi dash | tail -4 Description : DASH is a POSIX-compliant implementation of /bin/sh that aims to be as small as possible. It does this without sacrificing speed where possible. In fact, it is significantly faster than bash (the GNU Bourne-Again SHell) for most tasks.It is much simpler (and therefore less resource-hungry) than bash.
I think it takes probably some effort, not all C programs will compile on this thing.
Looking forward to the point where this can build autoconf. It's great that the generated ./configure script is portable but if I want to make substantial changes to the project I need to find a binary for my machine (and version differences can be quite substantial)
> Looking forward to the point where this can build autoconf.
Autoconf is a perl program that turns (heavily customized) m4 files into shell scripts. How does a C compiler help there?
> Autoconf is a perl program
Oof, did not realize.
This is going further into the hell that is shell-generated scripts that culminated in the xz-utils attack.
We would benefit from steering away from auto-generated scripts. Autoconf included.
This is not useful if it doesn't call external libraries.
Even POSIX standard ones. Chokes on:
#include <glob.h>
int main() // must be (); (void) results in syntax error.
{
glob_t gb; // syntax error here
glob("abc", 0, NULL, &gb);
return 0;
}
Nobody needs entirely self-contained C programs with no libraries to be turned into shell scripts; Unix people switch to C when there is a library function they need to call for which there no command in /bin or /usr/bin.If I reduce it to:
#include <glob.h>
int main()
{
glob("abc", 0, NULL, 0);
return 0;
}
it "compiles" into something with a main function like: _main() {
defstr __str_0 "abc"
_glob __ $__str_0 0 $_NULL 0
: $(($1 = 0))
}
but what good is that without a definition of _glob.Hrmmm. But why?
Quite frankly I think Bash scripting is awful and frequently wish shell scripts were written in a real and debuggable language. For anything non-trivial that is.
I feel like I’d rather write C and compile it with Cosmopolitan C to give me a cross-platform binary than this.
Neat project. Definitely clever. But it’s headed in the opposite direction from what I’d prefer...
Master Foo once said to a visiting programmer: “There is more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer, who was very proud of his mastery of C, said: “How can this be? C is the language in which the very kernel of Unix is implemented!”
Master Foo replied: “That is so. Nevertheless, there is more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer grew distressed. “But through the C language we experience the enlightenment of the Patriarch Ritchie! We become as one with the operating system and the machine, reaping matchless performance!”
Master Foo replied: “All that you say is true. But there is still more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”
The programmer muttered through his beard, contemplating what Nubi had written. Finally he agreed that it was so.
“And how many hours would you require to implement and debug that C program?” asked Nubi.
“Many,” admitted the visiting programmer. “But only a fool would spend the time to do that when so many more worthy tasks await him.”
“And who better understands the Unix-nature?” Master Foo asked. “Is it he who writes the ten thousand lines, or he who, perceiving the emptiness of the task, gains merit by not coding?”
Upon hearing this, the programmer was enlightened.
This koan shows the power of a one-liner, not shell scripting in general. Both Master Foo and Nubi would agree that a string/array manipulating function in bash isn’t worth their time when python exists.
I was going to cite this on reading the parent comment after reading it. Was very glad to see you beat me to it!
And then the programmer had to debug a hundred line shell script and they realized it should have all been written in Python or Rust instead.
Master Foo is shorthand for Fool.
Shell is just one way. There’s nothing that says we can’t do better than shell, but what it’s good at is saving programmer time when the need isn’t there for more, and Rust is definitely not good at that.
My rule of thumb:
Although to be honest I'd be perfectly happy if Shell was restricted to single line commands only.Shell: <= 5 lines Python: <= 500 lines Rust: > 500 linesI've wasted a lot of time and energy deciphering undebuggable shell scripts that were written to "save programmer time". Not a fan.
My rule (and the code review policy I impose) emphasizes complexity instead - a 50 line shell script is great if it doesn't use if or case. (It's not so much of a strict rule as "once you're nesting conditionals, or using any shell construct that really needs a comment to explain the shell and not your code, you should probably already have switched to python." This is in parallel with "error handling in this case is critical, do you really think your bash is accurate enough?")
I wasn't the strictest reviewer (most feared, sure, but not strictest) at least partly because my personal line for "oh that bit of shell is obvious" is way too high.
Nothing is as obvious as it could be when it’s 3am and you’re debugging a production outage. :)
This rule of thumb is clearly too simplified, even as far as the definition goes.
Sometimes you just want to execute 50 lines with little logic.
Sometimes you just have some simple logic that needs to be repeated.
Sometimes that logic is complicated, sometimes it is not.
Sometimes someone writes 50 lines of simple logic. And then sometimes someone else needs to figure out why it’s not working. That person gets very cranky and wastes a lot of time when those “simple” 50 lines aren’t debuggable.
If shell scripting didn’t exist I would be totally fine with that. There are far more scripts that I wish were written in a real language than the other way around.
Master Foo long predates Python and Rust.
Masters live to be surpassed by their students. Just because something was best in class in the 80s doesn't mean it should still be used.
Very true, but also student hubris is legendary. Which is perfectly fine, as we all know successful students.
But let's not blind ourselves with the survivor bias. Not everything new and very bright will succeed the test of time.
So let's take evrything with a grain of salt, and wait until the time has choosen its champions. Which might not be the best technology as we learned
I don't know about the specific motivations for this project, but if you're curious about why work like this might have serious real-world relevance beyond scratching an itch, idle exploration, or meeting a research paper quota, you can look to similar work and literature:
GNU Mes: https://www.gnu.org/software/mes/
Stage0: https://bootstrapping.miraheze.org/wiki/Stage0
Ribbit (same authors): https://github.com/udem-dlteam/ribbit
stage0-posix: https://github.com/oriansj/stage0-posix
Bootstrappable Builds: https://bootstrappable.org/
See also this LWN article about bootstrappable and reproducible builds: https://lwn.net/Articles/841797/ It contains a plethora of interesting links.
I'm not the OP, but I think the goal is to make it cross architecture. Cross platform C compiler would give you cross OS compatibility, but chip architecture would still be fixed, I think.
I.e., you can take your compiled.sh and run in an obscure processor with an obscure OS, as long as it's POSIX, it should work...
I believe the goal is to defeat the compiler trust thought exercise where a malicious compiler could replicate itself when being asked to compile the compiler. Since this produces human readable code instead of assembly, the idea is it allows bootstrapping a trusted compiler, since pnut.sh and any output shell executables are directly auditable.
I suppose the trust moves to the shell executable then, but at least you could run the bootstrapping with multiple shells and expect identical output.
That's the idea!
As you point out, it moves the trust from the binary to the shell executable, but the shell is already a key piece of any build process and requires a minimum level of trust. The technique of bootstrapping on multiple shells and comparing the outputs is known as Double Diverse Compiling[0] and we think POSIX shell is particularly suited for this use case since it has so many implementations from different and likely independent sources.
The age and stability of the POSIX shell standard also play in our favor. Old shell binaries should be able bootstrap Pnut, and those binaries may be less likely to be compromised as the trusting trust attack was less known at that time, akin to low-background steel[1] that was made before nuclear bombs contaminated the atmosphere and steel produced after that time.
0: https://dwheeler.com/trusting-trust/ 1: https://en.wikipedia.org/wiki/Low-background_steel
> Hrmmm. But why?
because Bash goes brrrr
If the end goal is portability for C, would Cosmopolitan Libc be a better choice because it supports a lot more features and probably runs faster?
I cant run cosmolibc on Android, for example. Then again this converter is somewhat limited and didn't accept any of the IOCCC code I gave it.
> I cant run cosmolibc on Android, for example.
You can:
> After nearly one year of development, I'm pleased to announce our version 3.0 release of the Cosmopolitan library. [...] we invented a new linker that lets you build fat binaries which can run on these platforms: AMD ... ARM64
https://github.com/jart/cosmopolitan/releases/tag/3.5.3
> This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. See 78d3b86 for further details. Thank you @aj47 (techfren.net) for bug reports and and testing efforts.
Thanks for the link!
My comment was based on cloning master yesterday and trying to build redbean but hitting what looks like https://github.com/jart/cosmopolitan/issues/940
Indeed it lioks like the commit you mentioned should have fixed the issue with the pointer having too many bits for the weird kernel used on android and some raspis. Fingers crossed that release works.
edit:
Testing that release on Termux 118, stock Android 14 on a moto g73 5G (XT2237-2):
~/cosmopolitan $ uname -a Linux localhost 5.10.205-android12-9-00027-g4d6c07fc6342-ab11525972 #1 SMP PREEMPT Mon Mar 4 18:49:33 UTC 2024 aarch64 Android ~/cosmopolitan $ /data/data/com.termux/files/home/cosmopolitan/build/bootstrap/cocmd ape error: /data/data/com.termux/files/home/cosmopolitan/build/bootstrap/cocmd: prog mmap failed w/ errno 12That issue was fixed last month. I've freshened up the cocmd binary for you! https://github.com/jart/cosmopolitan/commit/e18fe1e1127f30db...
Awesome, thanks!
Can you run it on RISCV Android?!
No, but Android on RISC-V isn’t even considered stable. So you’ll be manually compiling a fair chunk of code to get it running. Adding a few extra tools to your build pipeline isn’t going to be a deal breaker.
Bad intention hackers are using these llm's to run extremely sophisticated hacking software. It's such a shame that AI is being taught such nasty things. Then bad apples will regret it once these things evolve into something much powerful than we can imagine with that taste for corruption. Anyhow. Me > gpt besides the fact I lost my identity forever. But I broke it .bhaha
Bad intention hackers are using these llm's to run extremely sophisticated hacking software. It's such a shame that AI is being taught such nasty things. Then bad apples will regret it once these things evolve into something much powerful than we can imagine with that taste for corruption. Anyhow. Me > gpt besides the fact I lost my identity forever. But I broke it .bhaha
I am sorry if this comes off to be negative, but with every example provided on the site, when compiled and then fed into ShellCheck¹, generates warnings about non-portable and ambiguous problems with the script. What exactly are we supposed to trust?
It seems ShellCheck errs on the side of caution when checking arithmetic expansions and some of its recommendations are not relevant in the context they are given. For example, on `cat.sh`, one of the lines that are marked in red is:
It seems to be parsing the arithmetic expansion as a command substitution, which then causes the analyzer to produce errors that aren't relevant. ShellCheck's own documentation[0] mention this in the exceptions section, and the code is generated such that quoting and word splitting are not an issue (because variables never contain whitespace or special characters).In examples/compiled/cat.sh line 7: : $((_$__ALLOC = $2)) # Track object size ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors. ^-----------------^ SC2046 (warning): Quote this to prevent word splitting. ^--------------^ SC2205 (warning): (..) is a subshell. Did you mean [ .. ], a test expression? ^-- SC2283 (error): Remove spaces around = to assign (or use [ ] to compare, or quote '=' if literal). ^-- SC2086 (info): Double quote to prevent globbing and word splitting.It also warns about `let` being undefined in POSIX shell, but `let` is defined in the shell script so it's a false positive that's caused by the use of the `let` keyword specifically.
If you think there are other issues or ways to improve Pnut's compatibility with Shellcheck, please let us know!
I'm writing something similar, but it's based on its own scripting language. The idea of transpiling C sounds appealing but impractical: how do they plan to compile, say, things using mmap, setjmp, pthreads, ...? It would be better to clearly promise only a restricted subset of C.
This is quite interesting! Without having dug deeper into it, seeing the human readable output I assume quite different semantics from C?
The C to shell transpiler I'm aware of will output unreadable code (elvm using 8cc with sh backend)
I use linux-vt-setcolors in my startup, which would be a bit more convenient if it was a shell script instead of C, but it uses ioctl.
Trying to compile with this tool fails with "comp_glo_decl: unexpected declaration"
Can it do wrapping arithmetic?
The `sum` example doesn't seem to do wrapping, but signed int overflow is technically UB so I guess they're fine not to.
Switching it to `unsigned int` gives me:
code.c:1:1 syntax error: unsupported type
It seems to have practically no error checking. Try compiling
int why(int unused) {
wat_why_does_this_compile;
no_error_checking();
}I'm still figuring out why anyone would want to write a shell script in C. That sounds like torture to me.
Several times I've found myself wishing for the reverse: a shell-to-binary compiler or JIT.
Can you trust that it faithfully reproduces undefined behavior? ;)
Love this!
It's a bad sign when I immediately look at the screenshot and see quoting bugs.
Author here,
Because all shell variables in code generated by pnut are numbers, variables never contain whitespace or special characters and don't need to be quoted. We considered quoting all variable expansions as this is generally seen as best practice in shell programming, but thought it hurt readability and decided not to.
If you think there are other issues, please let me know!
I think they're talking about the cp example, doesn't seem like it would handle filenames with spaces!
Super neat project, btw!
You're right, thanks for the bug report. It should now be fixed :)