Neverflow: C macros that guard against buffer overflows

120 points by sertraline 3 years ago · 149 comments

Reader

eqvinox 3 years ago

The problem with C and buffer overflows isn't that you can't guard against them, or that there is no existing, reusable code to do so — it's that none of this functionality is standardized. Adding another one to the existing 41383 ways of doing this is in fact the exact opposite of what's needed. Ideally C needs one way of doing this, and that would be described in the standard.

But that's not how C "rolls", and we'll never get that. So I guess we now have 41384 ways to do buffer overflow guards.

ActorNightly 3 years ago

There is value in actually understanding what someone is doing in regards to protecting against buffer overflows, instead of relying on well established patterns.
- hinkley 3 years ago
  
  Not when I’m trying to orchestrate third party libraries.
6D794163636F756 3 years ago

C never has just one way to do something. myArr[5] == 5[myArr] == (insert pointer arithmetic that I won't write here without a compiler check). I think that part of C's beauty is that it gives you freedom. Freedom to shoot yourself in the foot, freedom to write hyper efficient code, and freedom to choose another tool.
I agree that this will never be implemented as a standard, but I think that's a good thing. Higher level languages push against their boundaries non stop. Java has libraries and frameworks that fundamentally change the syntax and functionality of the language. C knows what it is. If you want something that it can't do it promises that you can either build it yourself or switch to a different tool.
All of this to say, C has a single suggested way of doing this: using a different language. That's part of why we built them
- adastra22 3 years ago
  
  Those are syntactic sugar for the same thing though. Array[5] is just shorthand for *(Array + 5), which is why 5[Array] also works (because addition is commutative).
  Note that C does have strong conventions, such as that strings are terminated by a zero byte. Nothing in the language demands that, it’s just a convention! C could adopt better conventions.
  - astrobe_ 3 years ago
    
    > Note that C does have strong conventions, such as that strings are terminated by a zero byte
    Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.
    
    Someone 3 years ago
    
    > literal strings are ASCIIZ.
    If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:
    “The basic literal character set consists of all characters of the basic character set, plus the following control characters”
    That page also explicitly says:
    The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.
    Code unit Character Glyph U+0024 Dollar Sign $ U+0040 Commercial At @ U+0060 Grave Accent `”*
    If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.
    Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
    Also (https://en.cppreference.com/w/cpp/language/charset):
    “Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*
    If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23
    Also, chances are this changed in subtle ways between C and C++ versions.
    
    adastra22 3 years ago
    
    One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be
    11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'
    ptr ^
    C could be upgraded to do this in future versions, without too much backwards incompatibility.
    
    eesmith 3 years ago
    
    From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :
    "A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"
    This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.
    How would this work?
    void *x = malloc(8); ... uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0') memcpy(x, &i, 8); char *s = x; puts(s);
    Assuming I did it correctly, this should print "Hello!".
    When the length get added to the start of the string?
    
    JohnFen 3 years ago
    
    > C could be upgraded to do this in future versions, without too much backwards incompatibility.
    But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.
    
    ranger_danger 3 years ago
    
    Could you mention one of them?
    
    eesmith 3 years ago
    
    Strings can point anywhere in the malloc'ed region:
    char buffer[] = "railroad"; char *s = buffer; char *t = buffer + 4; printf("mult: %ld\n", strlen(s) * strlen(t));
    Suppose I read 100 bytes, formatted as "{name}\t{rank}\t{serial number}\t" using variable length parts.
    I can read the data into a single string buffer, replace the commas with NULs, and set up strings pointing to the middle of the buffer;
    typedef struct {char buf[101], char *name, char *rank, char *serialno} person; /* 100 bytes formatted as: name\trank\tserial no\t. */ int read_data(FILE *f, person *p) { char *s; if (fread(p->buf, 1, 100, f) != 100) return -1; p->buf[100] = 0; p->name = p->buf; if ((s = strchr(p->buf, '\t') == NULL) return -2; *s = 0; p->rank = s+1; if ((s = strchr(s+1, '\t') == NULL)) return -2; *s = 0; p->serialno = s+1; if ((s = strchr(s+1, '\t') == NULL)) return -2; *s = 0; return 0; } person subject; if (read_data(stdin, &subject)) fail("cannot read."); print("Hello %s %s.\n", subject.rank, subject.name); ...
    Even better, the protocol might have NUL characters already in the code, expecting C strings to point to the correct start.
    
    JohnFen 3 years ago
    
    Sure. For instance, there are times when you need to pack strings tightly together. Adding an extra byte or two before the start of the string would get in the way. You could work around it in many cases, but it makes the code uglier and harder to understand/maintain.
    One of the things that makes C particularly suitable for certain sorts of tasks is that it's mostly WYSIWYG when it comes to the relationship between data structures and the actual memory layout. Having "hidden" things like a length value before the string steps on that.
    
    teo_zero 3 years ago
    
    I agree on the first paragraph, but the second one applies poorly to strings:
    char *s = "hello";
    "hello" has length 6 because there's a hidden \0 even if I never wrote it in the code.
    
    ranger_danger 3 years ago
    
    if you wanted to pack strings together tightly, couldn't your string library have a separate "array" concept where all the sizes are stored separately?
  - tedunangst 3 years ago
    
    My copy of the C standard says "A string is a contiguous sequence of characters terminated by and including the first null character."
  - thelopa 3 years ago
    
    Many of the str functions in the C standard library assume a nul terminator.
    
    adastra22 3 years ago
    
    Yes, but aside from string literals pointed out by a sibling comment, nothing in the language itself dictates this convention. The C library could be augmented with functions which expect strings structured in other ways.
    
    ixtenu 3 years ago
    
    > nothing in the language itself dictates this convention.
    String literals are nul-terminated, e.g.: "foo"[3] == '\0'
- shrimp_emoji 3 years ago
  
  Checked arithmetic has been implemented in the standard with `ckdint.h`, so give it 50 more years!
JohnFen 3 years ago

> Ideally C needs one way of doing this, and that would be described in the standard.
I'm really glad that C doesn't do this, personally. It would reduce one of the main advantages of the language.
nix0n 3 years ago

> existing, reusable code to do so
Is there a library that you recommend for this?

augustk 3 years ago

Even without array bounds checking, a bit of discipline and smart conventions will go a long way of reducing errors:

1. Define a macro function for retrieving the length of an array:

  #define LEN(arr) (sizeof (arr) / sizeof (arr)[0])

2. Don't introduce macro constants for array lengths; hard code the length in the declaration and use LEN to retrieve it. Example:

  int a[100];
  ...
  for (i = 0; i < LEN(a); i++) {
     ...
  }

3. Define a macro function for dynamic array allocation:

  #define NEW_ARRAY(ptr, n) \
     (ptr) = malloc((n) * sizeof (ptr)[0]); \
     if ((ptr) == NULL) { \
        fprintf(stderr, "Memory allocation failed: %s\n", strerror(errno)); \
        exit(EXIT_FAILURE); \
     }

4. When you create a function with an array argument, also add an argument for the array length.

5. Use a convention for naming the length of array pointer targets, for instance by adding the suffix `Len'. Example:

  int *b, bLen = 100;
  ...
  NEW_ARRAY(b, bLen);  /* nice to know that b and bLen belong together */
  ...
  SomeFunction(b, bLen, ...);
  ...
  for (i = 0; i < bLen; i++) {
    ...
  }

6. Define your own safe wrappers around unsafe standard library functions or use someone else's code that does that.

ChrisRR 3 years ago

The issue with 1 is that it only works until you pass an array into a function by pointer, then the macro no longer works.
In my experience it's most likely that a function will write past the bounds of a buffer that's been passed as an argument. In that case, make sure the size of array is always included as an argument as you said in 4.
- ollien 3 years ago
  
  > The issue with 1 is that it only works until you pass an array into a function by pointer, then the macro no longer works. GCC even has a warning for this.
  Even worse, even if you specify the argument to be "of the type" array, it will actually still decay to a pointer. Basically, this macro will only work if you use it in the same function the array is defined.
  https://godbolt.org/z/vr4za73qq
  - uecker 3 years ago
    
    You need to pass a pointer to an array: https://godbolt.org/z/jYzY79ac4
    When passing an array it decays into a pointer and the size is lost. We can also change sizeof to recover it, but there was a proposal for a _Lengthof operator which could work here.
  - bheadmaster 3 years ago
    
    One exception is if you explicitly define argument as array of fixed length.
    Downside being, obviously, that it will only work with arrays of that particular length.
- mzs 3 years ago
  
  see item 4
mtlmtlmtlmtl 3 years ago

Your allocation macro can lead to heap underflows if the multiplication wraps around. Which can definitely be exploitable.
You should either add overflow checking to the macro or even better just use the damn libc api and call calloc. Or if you really insist on avoiding zeroing overhead, there's reallocarray(NULL, ...) if you use a reasonably modern libc.
f33d5173 3 years ago
You could extend point 1. by making a convention of always declaring pointers to arrays like so:
```
  int (*data)[datalen];
```
This requires you to dereference it once to get an array, then dereference it a second time to get a value. The advantage is that the array value can be used the same as an normal array on the stack, including passing it to the array length macro you describe.
- teo_zero 3 years ago
  
  Isn't this exactly what the fine article does?
mzs 3 years ago

Nice! I don't like how C has null terminated char arrays plays with this. Ideally this would somehow enforce a guard null byte at the end of each array not included in the size.
nerpderp82 3 years ago

> bit of discipline

hgs3 3 years ago

C23 improved struct compatibility so you might be able to leverage that to craft macros that better emulate slices. [1]

There is an RFC proposal for the Clang frontend for adding bounds checking reminiscent of Microsoft's SAL. [2]

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3003.pdf

[2] https://discourse.llvm.org/t/rfc-enforcing-bounds-safety-in-...

uecker 3 years ago

You may be interested in this: https://github.com/uecker/noplate.git

kazinator 3 years ago

The following error prone: it can be mistakenly applied to a pointer:

#define LEN(NAME) (sizeof NAME / sizeof(NAME)[0])

I think gcc has a warning for this pattern now: when the size of a pointer is divided by the size of its referent type.

More importantly, it has an odd extra level of indirection. The traditional definition is:

#define LEN(ARRAY) (sizeof ARRAY / sizeof (ARRAY)[0])

This means that to use LEN on an array, we have to take the address:

   char *array[5];
   LEN(&array);  // -> 5

If we use

   LEN(array);

which is an easy mistake, we get:

    sizeof *array / sizeof (*array)[0]

which is

    sizeof (char *) / sizeof (char)

which is

    sizeof (char *)

which is likely 4 or 8.

I do see that LEN is supposed to be (only) used in conjunction with ARR:

    #define ARR(TYPE, NAME, COUNT) TYPE(*NAME)[COUNT]

but that isn't enforced. An idea would be to add some "secret" prefix or suffix to NAME like blah_ ## NAME, so that name cannot be referenced without going through the macros; i.e. if we define ARR(int, foo, 42) then there is no declared identifier foo; it actually declares blah_foo, and LEN(foo) knows about that, also adding the prefix. Thus mistakenly using LEN(foo) on something not declared with ARR will likely be a reference to an undeclared identifier.

skullchap 3 years ago

It's so funny, but i actually had this in 0.0.1 for exact same reason. I removed it in 0.0.2 today after complains that it complicates things and a bit confusing. It made harder to pass VLAs to functions. Maybe if i find a better way i will return name mangling again, but for now being able to pass arrays to functions and maintain same flexibility is more important imo

kazinator 3 years ago

The expansion of the AT macro seems a bit bloated:

  #define AT(NAME, IDX)                                         \
    ((typeof(&(*NAME)[0]))                                      \
    ((ASSERT(((size_t)IDX) * sizeof(*NAME)[0] < sizeof *NAME,   \
    "Buffer Overflow. Index [%lu] is out of range [0-%lu]",     \
    ((size_t)IDX), ((sizeof *NAME / sizeof(*NAME)[0]) - 1))),   \
    ((uchar *)*NAME) + ((size_t)IDX) * sizeof(*NAME)[0]))

Some of this might be pushed into non-inlined run-time support function. That could be static and defined in the header, to keep it header-only, but ideally there would be a .c file so it's defined only once.

When you factor in the definition of ASSERT, and the ERRLOG macro that is using, it's a lot of cruft for just one array access.

Some compile-time options (via preprocessor macros) to control the bloat would be useful; e.g. a way of compiling it so that AT will just predictably crash, without a detailed error message with __FILE__ and __LINE__ and all. Basically just the check, with a branch to some code that calls abort() if it's out of bounds.

nerpderp82 3 years ago

Benchmark it after -O3, does it really matter ?

pjmlp 3 years ago

Interesting idea, although given the demotion into optional feature in C11, it isn't necessarly portable.

Also doesn't cover all the string and memory buffer manipulations.

SAL and Frama-C are the bare minimum for security in C code.

e4m2 3 years ago

Frama-C as a bare minimum is a pipe dream.
It's a nice thought, don't get me wrong, but it's hard enough to convince people to add `-fsanitize=...` to their compiler flags. An entire separate static analysis tool with its own learning curve (and its own set of idiosyncrasies) doesn't really qualify for "bare minimum" IMO.
- pjmlp 3 years ago
  
  Thankfully the ongoing cybersecurity laws will change that mindset.
  - tptacek 3 years ago
    
    No, I don't think they will.
    
    pjmlp 3 years ago
    
    You will be surprised.
    https://www.eff.org/deeplinks/2023/05/eus-proposed-cyber-res...
    
    tptacek 3 years ago
    
    None of this is going to meaningfully impact C/C++ software. If it comes to pass at all, it'll be used at the margins to replace more C code with Rust.
    
    pjmlp 3 years ago
    
    It only needs to have money attached to code fixes.
    The problem with developers that don't do consulting is that they have no idea how each hour of their work relates to product development costs.
    In Germany, services companies are already required to provide security fixes free of charge and warranties.
    Someone has to pay those hours.
    It is no accident that Google, Apple, Microsoft always mention increasing costs with bug fixes, when pushing for writing new code in safer languages.
    
    uecker 3 years ago
    
    We will see. In a regulatory context, "the implementation is the spec" usually does not convince.
uecker 3 years ago

We will make VM-types, i.e. pointers to VLAs, mandatory in C23.
kovac 3 years ago

What is SAL?
- hgs3 3 years ago
  
  Source-code annotation language (SAL) [1].
  [1] https://learn.microsoft.com/en-us/cpp/code-quality/understan...
- pjmlp 3 years ago
  
  Besides the sibling comment, SAL was born out of the security efforts to fix Windows XP that ended up with the release of Windows XP SP2.

inetknght 3 years ago

Why use C and keep reinventing things that C++ provides?

epistasis 3 years ago

If one is ready to switch languages, then the clear winner is rust over C++, and I say that as someone who avoided diving into Rust for years because it seemed completely overhyped and with too much cryptic syntax.
C still wins by far when writing libraries that will be used by lots of other people. Doesn't matter what language they are using, they will be able to add in a library written in C very easily. However, C++ or Rust libraries, even with appropriate bindings for the target language, users of the library will need to bring in an entirely new compiler tool chain that may or may not exist on the target architecture. But the C tool chain will exist for that architecture and be robust.
- humanrebar 3 years ago
  
  Availability of C++ tooling is much, much closer to availability of C tooling (often it's the same tool!) compared to Rust. Adopting Rust isn't the same category of conversion at all.
  For new side projects, pick what you want to use of course. But for existing codebases and projects that aspire to have maximum impact, I recommend fully considering tradeoffs instead of thinking in terms of "clear winners".
  - insanitybit 3 years ago
    
    > Availability of C++ tooling is much, much closer to availability of C tooling (often it's the same tool!) compared to Rust. Adopting Rust isn't the same category of conversion at all.
    Which tooling? Just curious, asking entirely in good faith. My recollection is that the majority of tooling I was using with C++ worked with Rust - debuggers, profilers, and sanitizers being the main tools. Although I find that I use them much less frequently since I don't find debuggers as useful for the types of bugs I have these days, and sanitizers are only useful if you have unsafe, and profilers are cool but usually I just write benchmarks using a crate and then iterate from there.
    
    twic 3 years ago
    
    The parent comment says "compiler tool chain", and i understand "tooling" here as meaning that. So, compiler, linker, assembler, etc.
    All the major C compilers are also C++ compilers, and none are (yet) Rust compilers, so out of the gate, C++ has similar availability to C.
    
    epistasis 3 years ago
    
    And yet, even with that, Yann Collet credits Google's use of C++ for the compression library as a critical mistake that allowed him, an unknown, to gain traction with his own compression methods. Google later rewrote their library in C:
    https://overcast.fm/+LfVPHmBTo
    Even if the tool chain exists, it must be adopted, unless you can rely on binaries being available for your end users, which will never be the case for a library which is just starting our. And adding another dependency to your build process, especially one as complex and with as many breaking version changes as C++, is a lot of work to take on.
    
    pjmlp 3 years ago
    
    Unless we are talking about an obscure platform or some PIC CPU, a C++ compiler is available on the same box as the C compiler.
    Second, extern "C" exists.
    Third, in what concerns clang and MSVC, the C library is actually implemented in C++ with extern "C".
    
    epistasis 3 years ago
    
    My single sentence may have been too concise, there are two concepts here: 1) the tool chain may or may not exist, and 2) bringing in that tool chain to the build system.
    Even if it's the "same" toolchain for compiling C++ as it is C, adding the complexity of an additional language to the build process, and the extra versioning headaches that C++ adds over C, is enough to kill library adoption.
    As I said originally, providing bindings is not the challenge, it's all the other stuff.
    
    pjmlp 3 years ago
    
    "To Save C, We Must Save ABI"
    https://thephd.dev/to-save-c-we-must-save-abi-fixing-c-funct...
    
    bfrog 3 years ago
    
    If you are going on proprietary tool chains... most of those are moving to llvm which rust is based on. In theory any proprietary toolchain based on llvm could provide rustc given incentives to do so.
    If you are speaking to missing a rust compiler built on gcc, that seems to be an ongoing project with some momentum.
    Realistically the most widely used architectures are now supported by rustc through llvm... x86, arm, riscv, and even to some extent xtensa now.
    Power, arc, mips, sparc, and some others aren't too far away if someone cared enough.
    If Linux can support Rust, I'd think that's a good sign most project can use Rust.
    
    humanrebar 3 years ago
    
    That's just the compilation toolchain. For better or worse, existing C projects have their whole workflows sitting on top of bespoke tools with the assumption that there is a C toolchain. And Rust projects assume cargo, etc. You're more or less doing a parallel rewrite in Rust to adopt Rust in an existing C project.
    The Linux kernel already does extensive bespoke tooling and it's low level enough to skip cargo and such. It's rare to see that approach in Rust projects in the wild.
    
    insanitybit 3 years ago
    
    Are we just talking about portability then? Because "same category of conversion" seems fine - I would say that for 99.9999% of projects the difference in portability is non existent.
    
    pjmlp 3 years ago
    
    Basically all the libraries, IDEs, game engines, game console SDKs, HFT, HPC, OS SDKs, embedded OSes, High Integrity Computing certifications, and plenty more stuff deployed into production since C++ ARM [0] was published in 1990, 33 years ago.
    [0] - The Annotated C++ Reference Manual
    
    junon 3 years ago
    
    That's not C++ tooling. That's tooling written in C++. Two very different things.
    
    pjmlp 3 years ago
    
    Word games, those are domains dominated by C++, take the meaning the way it makes you happier.
    
    insanitybit 3 years ago
    
    You're saying word games but you're arguing something that the poster I replied to wasn't saying.
  - junon 3 years ago
    
    This is nonsense. Rust is based off of LLVM, which is what Clang is based off of. Name one modern, actually used, non-archaic system that LLVM doesn't run on. Beyond that, Cargo and all associated tooling run pretty much everywhere. So I'm not sure what outdated trope you're on about here.
- kllrnohj 3 years ago
  
  Exporting a stable C ABI/API in no way requires writing the implementation in C. See Android's NDK for a rather widely deployed example. All the APIs are C, yet none of the implementations are C. Same thing works great in Rust, too. You can trivially export C from a Rust implementation.
  - epistasis 3 years ago
    
    My comment acknowledged what you state, but then went on to point out that it requires adding a tool chain to compile Rust or C++, neither of which are trivia and which may not exist at all on the target architecture.
- coliveira 3 years ago
  
  People using C will not change to your language-du-jour, please stop.
  - adwn 3 years ago
    
    > People using C will not change to your language-du-jour, please stop.
    Two years ago, your argument would have implied that Rust would never be allowed into the Linux kernel, and yet here we are.
    
    BearOso 3 years ago
    
    Rust in the kernel is a whole different beast than what most people think--no standard library, no cargo or external crates, some memory safety features removed. It's kind of just an alternative syntax.
    
    coliveira 3 years ago
    
    There are all kinds of weird stuff in the kernel, many of them will just die.
    
    mikebenfield 3 years ago
    
    What other programming languages are used in Linux kernel code?
    
    netule 3 years ago
    
    I wonder if Linus it taking the time to teach himself Rust.
  - JohnFen 3 years ago
    
    Almost every dev I know who uses C (including myself) also uses other languages. Nobody should only have one tool in their toolbox.
  - epistasis 3 years ago
    
    You seem to be replying to the wrong comment, I am not suggesting that people switch away from C.
  - pjmlp 3 years ago
    
    They better improve their error free coding skills when liability laws come for them.
    
    jacquesm 3 years ago
    
    This I can't wait for but the bigger problem will be that the rest of the development process is at least as broken as the languages are.
- ActorNightly 3 years ago
  
  Rust is by far not mature enough for serious development. Recent shenanigans with crablang are a strong sign of it going down the route of Java, i.e a corporate developed language with offshoots, which will end up with Rust being in the same crappy state.
  - duped 3 years ago
    
    > Rust is by far not mature enough for serious development
    Except it's being used for serious development today
    > going down the route of Java, i.e a corporate developed language with offshoots, which will end up with Rust being in the same crappy state.
    So one of the most widely used applications programming languages in the world?
    
    ActorNightly 3 years ago
    
    >Except it's being used for serious development today
    No, its being used for pet projects by people. Serious development = major companies using it in backends.
    >So one of the most widely used applications programming languages in the world?
    Because of CS programs, and legacy software written in java. Java has a community dedicated to pushing theoretical CS concepts into the language (much like Rust), while allowing things like a logging library to fetch code from anywhere on the internet and execute it, by default (which I would bet on would be the future of Rust given current trajectory)
    
    c-cube 3 years ago
    
    > No, its being used for pet projects by people. Serious development = major companies using it in backends.
    You mean companies like Dropbox, Cloudflare, Amazon, Microsoft...? Are they too small to be relevant?
    
    ActorNightly 3 years ago
    
    Very few things in those companies are being written in Rust, and half of those projects chose Rust around ideological reasons rather than technical, with plenty of 'unsafe' thrown in for performance reasons
    https://github.com/firecracker-microvm/firecracker/search?q=...
    The fact that 'unsafe' even exists in Rust means it's no better than C with some macros.
    Don't get me wrong, Rust has it's place, like all the other languages that came about for various reasons, but it's not going to gain wide adoption.
    Future of programming consists of 2 languages - something like C that has a small instruction set for adopting to new hardware, and something that is very high level, higher than Python with LLM in the background. Everything in the middle is fodder.
    
    c-cube 3 years ago
    
    You're moving the goalposts here. Rust is being used, in significant projects (eg: proxies at Cloudflare, a company where http is somewhat of a big deal: https://blog.cloudflare.com/introducing-oxy/ ; Dropbox's new sync algorithm, a company where file syncing id kind of a big deal : https://dropbox.tech/infrastructure/rewriting-the-heart-of-o...).
    Equating the existence of unsafe with C is laughable imho (it'd be barely comparable even if 100% of the rust code was in unsafe blocs, which never happens). Not even then it doesn't matter for the original point : rust is used in production for business critical functions, in large companies.
    
    ActorNightly 3 years ago
    
    When an OS is written in Rust fully, then we can talk about acceptance.
    Parts of systems written in a language doesn't really mean anything for its adoption into mainstream. For example, Amazon uses Ruby heavily for a bunch of deployment stuff, but Ruby (sans Ruby on Rails that is in decline), is not really a mainstream language any more.
    >Equating the existence of unsafe with C is laughable imho
    Im not comparing it. The point is to demonstrate that unsafe exists for the sole reason of performance. In fast code you often want to directly access x[y] where x and y are variables, without having to run extra code around it. Its a well known computer science thing, as most of the code challenges given in interviews rely on this access pattern for optimal solutions.
    And because of Rice theorem, a compiler cannot determine whether x[y] is always safe, as determining all the values y could take would involve running the program.
    So as such, for all the advantages that Rust offers, you can have the same advantages with C with macros and LLVM extensions, albeit with less concise syntax.
    https://www.microsoft.com/en-us/research/project/checked-c/
    Similar arguments were used to justify Haskel about 6-7 years ago, and Haskel is pretty much dead in the water at this point.
    The modern way to make a memory safe language is to focus on a high level language that doesn't require programmer to deal with memory directly, and then work on the compiler to make the resultant code optimal.
    
    c-cube 3 years ago
    
    https://www.redox-os.org/
tialaramex 3 years ago

But this isn't something the C++ language provides, which is hilarious.
C++ keeps C's crap array type as its native array type. You need to reach into the C++ standard library to get this awkward library type, std::array<type,N> and then finally you get an array type that remembers how big it is and has some basic features like swap.
- pjmlp 3 years ago
  
  True, but it also adds lot of features that help to easily migrate to saner features without rewriting the world and throw away 30 years of tooling.
  Microsoft security team is on the record that just because they are adopting Rust, they won't shy away from C++.
  - tialaramex 3 years ago
    
    I'm kind of on board with this, but the problem is that it's 30 years of rotten wood. Rust started from a more secure foundation and has put a lot of effort into stabilising even the trickier ground - whereas in C++ it's too often "Yeah, we don't think about it too hard, when there are strong winds I don't go up into the top floor, the creaking is very loud, I'd rather just never find out".
    Example, Rust 1.0 had std::mem::uninitialized::<T>() which gives a T but it's obviously uninitialized. It's marked "unsafe" of course, but is that enough? Turns out they later realised that no, it's strictly never OK to do this, so the unsafe label was insufficiently cautious. Today std::mem::uninitialized is deprecated, Rust never removes stuff from the standard library, but you should not use this library call.
    The type MaybeUninit<T> is the fix. Since MaybeUninit<T> might not be initialized, it's OK if it's not initialized, and since it might be T, it's OK for it to occupy the same amount of space as T. So, then we can initialize this memory, and tell the compiler it's initialized now, it's a T not a MaybeUninit<T>.
    Can you guess how that works? It's pretty clever, and C++ could do almost the same trick, but it never has and my guess is it never will. If you don't know and are wondering, check that type definition carefully - MaybeUninit<T> is a union
    For contrast, in his safety talk Bjarne Stroustrup just says as if it's obviously true, that it's safe to have uninitialized char arrays in C++. And his rationale sounds exactly like how std::mem::uninitialized happened - any possible value of a byte is a valid byte, so that's good enough, right? Nope, ask compiler engineers, there were plenty in the room when Bjarne said that, but he didn't ask them.
    
    pjmlp 3 years ago
    
    Sometimes it is better to have rotten wood to build something than nothing at all.
    If we want to encourage Rust adoption, it is by having a middle path, not via Rust Advocacy Strike Force.
    That only shuts the audience off, specially when Rust has a glass ceiling of depending on C++ infrastructure for its reference compilers.
    
    tialaramex 3 years ago
    
    I believe conventionally they're called the Rust Evangelism Strike Force.
    And it's true that the rotten wood was better than nothing. Nobody is suggesting that NT or Linux should somehow have been developed in Rust in the 1990s. But likewise we shouldn't resist renewal in newer, better materials.
    That applies to compiler internals too. Plenty of trouble down there for C++, it's just that C++ programmers can more often be sent away by assuring them that what they did was UB and so LLVM is entitled to miscompile it whereas the Rust people keep arriving with the receipts, in the form of LLVM IR that is lowered to machine code which makes no sense
    e.g. https://github.com/llvm/llvm-project/issues/45725
    
    pjmlp 3 years ago
    
    Yet if I want to contribute to Rust backend, or its upcoming GCC implementation, write C++ I must.
    Same applies to the runtimes of the languages I use at work, and GPGPU related tooling when not using shaders.
    Maybe then do a Go/zig/D, focus on cranelift and fully bootstrap Rust, before trying to rewrite the world.
    
    tialaramex 3 years ago
    
    You certainly can go work on Cranelift or similar projects which have a coherent IR as a central goal rather than eh, it's probably good enough to compile C++.
- FpUser 3 years ago
  
  >"std::array<type,N>"
  Unless you mean array of anything like in typeless dynamic languages I do not see anything awkward about STL arrays in C++.
  - tialaramex 3 years ago
    
    It's a standard library feature, rather than a language feature.
    And you might say, "Who cares? Even freestanding has the standard library". Nope, std::array wasn't added to freestanding. You can dig into the messy details for yourself if you want, but suffice to say your freestanding C++ doesn't have std::array
    So the C++ language has "arrays" but they're garbage, and if you point out that the arrays are garbage you're told to use this library feature, which may not be available.
    
    inetknght 3 years ago
    
    The only valid complaint about std::array is that it's awkward to declare and takes more characters to type. It is, otherwise, vastly superior in every other way.
    That doesn't make them garbage. That makes them annoying.
    
    tialaramex 3 years ago
    
    I feel like I already explained it's not even part of the language itself, it's a library feature, you aren't given this feature without the rest of the hosted C++ standard library.
    Which is fine if you write Windows desktop apps, but this is an array type, unlike a GUI widget, or an XML parser, it seems like I'd probably want an array type for this $1 per unit micro controller I'm writing firmware for. In Rust the nice array type works just fine, it's a proper first class type, it knows how big it is, mutable arrays coerce into a slice I can sort (only unstably, but hey, we're embedded firmware let's not get fancy), I can iterate over it properly... in C++ only the crappy C-style array is available unless I can butcher the std::array so that it works outside the hosted library. Ugh.
    
    FpUser 3 years ago
    
    You do not have to butcher. Standalone allocation free implementations are available if you are so in need.
    But I see that you bring Rust in here. If that's your cup of tea then use it. No need to spill venom. Personally if I am dealing with $1 micros I very much prefer C with some selected libs for embedded. Do not really have problems with it for such small tasks.
    
    FpUser 3 years ago
    
    >"takes more characters to type."
    I have never perceived it as a problem. I do not think it really slows my programming. Personally I am the guy who would prefer function() vs fn() but without going into extremes of Java culture. Besides you can always alias it to whatever you want if your fingers are so sensitive.
    
    FpUser 3 years ago
    
    >"feature, which may not be available"
    Never been into this situation so from a practical standpoint it means zilch to me.
ChrisRR 3 years ago

For me the issue is that using C++ brings every single feature in with it. It's very easy to hire developers and they know the entirety of the C language, but using C++ has every feature you could ever want and multiple ways of achieving the same thing.
It makes writing (and hiring) a low-level project in C++ a much more complex task. It may have benefits, it may not. But C++ is so huge that it's difficult to judge whether it would offer an advantage.
And then there's the minefield of tooling in embedded development...
- hgsgm 3 years ago
  
  Knowing every feature of C means they have to learn custom patterns on top of the C to make things work, and that almost always means horrific unhygienic macros.
pjmlp 3 years ago

I keep having that discussion since the C vs C++ Usenet flamewars....
uecker 3 years ago

These are dependent types which C++ does not have at all. The C support is fairly weak though... But most programming language people I know agree that dependent types are they way to guard against overflow with minimal overhead. So hope we can evolve C in this direction.
- inetknght 3 years ago
  
  > These are dependent types which C++ does not have at all.
  As a C++ developer, that sounds strange. Can you point me to some documentation about "dependent types"?
zffr 3 years ago

There’s lots of software already written in C that needs to be updated and maintained
properclass 3 years ago

the obvious answer is that one does not want some things that C++ entails, three examples: - name mangling - larger gap between source code and ISA - impedance mismatch when working with C APIs
that being said, some do not want more macros either
- adwn 3 years ago
  
  > name mangling
  Can be turned off on demand for relevant symbols.
  > larger gap between source code and ISA
  There's already a huge gap between C code and machine code (see: Undefined Behavior). C hasn't been a "portable assembler" for a very long time.
  > impedance mismatch when working with C APIs
  C++ has no problem working with C APIs.

dang 3 years ago

Related ongoing thread:

Modern C (2019) - https://news.ycombinator.com/item?id=36167820 - June 2023 (19 comments)

JonChesterfield 3 years ago

Runtime bounds check tied to fprintf and abort via macros. Allocation by calloc.

mtlmtlmtlmtl 3 years ago

The calloc part is one of the most common blind spots I see among C programmers.
I try to avoid the malloc(n * sizeof (...)) pattern as much as possible. Sure there are lots of cases where it can never overflow, and you might save a bit of overhead from the zeroing and overflow checking, but most of that overhead might also be imaginary depending on allocator internals, and even kernel internals. It's the sort of thing it only makes sense to optimise when you've already squeezed out every bit of performance. And by then you've probably minimised dynamic allocation as much as possible anyway.
It's also very easy to think something like "well, n is passed in as a parameter, but it's a static function, and I know all the callers. So it's fine".
But now every caller in the future has to be aware of this possibility.
- lelanthran 3 years ago
  
  > But now every caller in the future has to be aware of this possibility.
  Can you clarify: what possibility should you be aware off with malloc that you don't need to be aware of with calloc?
  - mtlmtlmtlmtl 3 years ago
    
    Calloc is the function originally intented to allocate arrays. Instead of accepting a number of bytes, it takes two unsigned integers(size_t): the number of array members, and the the size of each member. And it checks whether the result of multiplying them fits in a size_t. If not, it returns NULL, allocating nothing(and also sets errno, iirc). Then you can have your code detect it, crash or report an error, and avoid memory corruption. Even if you sloppily don't check calloc's return value, you're probably just gonna segfault which is unlikely to lead to data leaks or code execution
    If you use malloc(n * size), and n is too large, it could wrap around, malloc gets a smaller number than the program thinks it allocated. Which means that even if the program does bounds/null checking on the array later on, it has the wrong bounds. This can be used to access or modify other objects on the heap, or even modify allocator internals in some cases, depends on the implementation details of the allocator.
    So what I meant was, you better be careful using malloc(n * size) unless n is a constant. If it's in any way tied to program behaviour or user input, it's a hole waiting to happen.
    
    JohnFen 3 years ago
    
    calloc has its own set of gotchas, though. For instance, it may allocate a different amount of memory than you requested, and it comes with the overhead of zeroing out the allocated memory.
    Neither of these may matter to you, but when they do, they really matter. So you still have to be thoughtful about using it. Not so different from how you have to be thoughtful about using malloc.
    
    mtlmtlmtlmtl 3 years ago
    
    I tend to see zeroed memory as an advantage in the vast majority of cases. And when it's actually significant overhead then s/calloc(/reallocarray(NULL,/
    The thing I like about almost always allocating through calloc is this: I know that if my code is somehow not initialising memory properly, the resulting bug will be the same each time, and therefore faster to reproduce and debug. Not that I misinitialise my memory very frequently anymore, it's not that hard to get right.
    Surprisingly often, I've found that so much of my data should probably default to zero anyway, so it doesn't really matter all that much.
    Calloc can over-allocate, which i always found annoying myself, although at least with calloc, you know that if you only index the pointer modulo the n you passed onto calloc, you won't invoke any demons from the underworld.
    But yeah, in general, to really know what you're doing in C, you kind of have to understand memory allocators at a fairly deep level, because the footguns are aplenty. You need to have a mental model of the heap and stack.
    
    lelanthran 3 years ago
    
    > And it [calloc] checks whether the result of multiplying them fits in a size_t.
    I never knew this was part of the standard. No documentation I saw for calloc (manpages, or similar) ever said it checked for overflow.
  - loeg 3 years ago
    
    Multiplying array length by sizeof(element type) can overflow.
    Of course, you can write your own malloc_array() that uses __builtin_mul_overflow() and doesn't come with calloc's drawback (the cost of zeroing the allocated memory).
    
    mtlmtlmtlmtl 3 years ago
    
    OpenBSD's libc has reallocarray for this, which is realloc with the same bounds checking as calloc, but if the first parameter is NULL, it's just calloc without the zeroing.
    And I believe you'll find it in glibc too these day? Or if not, there's always libbsd, which has lots of handy stuff anyways.
    
    loeg 3 years ago
    
    Yep, good point.

cornstalks 3 years ago

This evaluates macro parameters multiple times, so if the parameters have side effects or evaluate inconsistently this won't work. For example:

    size_t SomeIndex() {
      static size_t example_index = 0;
      return example_index++ % 2;
    }

    int main() {
      NEW(int, arr, 1);
      // This buffer overflow is not detected:
      *AT(arr, SomeIndex()) = 42;
      return 0;
    }

frabert 3 years ago

Never heard of a serious buffer overflow caused by _constant_ indices. Does it work with AT(arr, i), or only with AT(arr, 10)?

oleganza 3 years ago

"'Brother,' says he, 'greetings. Didn't I see you in Southern Missouri last summer selling colored sand at half-a-dollar a teaspoonful to put into lamps to keep the oil from exploding?'
"'Oil,' says I, 'never explodes. It's the gas that forms that explodes.' But I shakes hands with him, anyway.
...
"'Listen,' says I. 'I instruct her to keep her lamp clean and well filled. If she does that it can't burst. And with the sand in it she knows it can't, and she don't worry.
— O. Henry, The Man Higher Up
- CyberDildonics 3 years ago
  
  Did you mean to reply somewhere else? This thread is about about bounds checking arrays in the C programming language.
  - pjmlp 3 years ago
    
    You definitly didn't understood the message.
    
    CyberDildonics 3 years ago
    
    If you did understood it, then explain it so I can understood it too.
    
    pjmlp 3 years ago
    
    Somehow doesn't seem worth my time.
    
    CyberDildonics 3 years ago
    
    So it was worth your time to reply twice, but not to explain anything?
    
    pjmlp 3 years ago
    
    Yes, because I don't expect you to understand anyway and ELI5 would take a bit, much longer than these dumb comments.
    Hint, someone else got it.
    
    CyberDildonics 3 years ago
    
    I'm starting to think you didn't understood the message or you wouldn't be avoiding explaining yourself while trying to insult people.
    
    pjmlp 3 years ago
    
    Joker_vD got it.
    
    CyberDildonics 3 years ago
    
    How does what they are saying have anything to do with what I replied to ?
    
    pjmlp 3 years ago
    
    https://en.m.wikipedia.org/wiki/Allegory
    
    CyberDildonics 3 years ago
    
    That's the definition of an allegory, but how do the two things relate?
    Why did you say you "don't have time" then go to great lengths to not explain anything or back up what you're saying in any way?
    
    Joker_vD 3 years ago
    
    But it's absolutely true though: if only the C programmers right their code very carefully and in specific patterns, the buffer overflows and invalid dereferences won't happen and therefore, won't explode their programs! By the way, only today I have a silver bullet to sell with "runtime safety violations" written on it, anyone willing to buy it?
heylemao 3 years ago

Yeap, that's the whole point of it
- frabert 3 years ago
  
  Huh I misinterpreted the error messages in the example, I thought those were compiler output. This is quite cool then.
  EDIT: although, it seems like this looses much of its power once you start passing these buffers around to functions that do not use these macros.
  - teo_zero 3 years ago
    
    > this looses much of its power once you start passing these buffers around to functions that do not use these macros.
    Alas it's even worse: once you pass buffers around to functions, you can't use these macros!
    
    skullchap 3 years ago
    
    0.0.2 update is live, and solves this issue. Check for updated README.

uecker 3 years ago

See also here for my experiments, but it relies on UBSan for bounds checking: https://github.com/uecker/noplate.git

norir 3 years ago

The best way to deal with this kind of thing is to write a small language that transpiles to the subset of c that you are using.

kazinator 3 years ago

Here is a different take on it. We can use #define to inform the header about the properties of certain symbols.

Here is my oob.c program. I will show the output, and then the content of "oob.h".

  #include <stdlib.h>
  #include <stdio.h>
  #include "oob.h"

  int oob_fail(const char *file, int line)
  {
    fprintf(stderr, "%s:%d:out of bounds array access\n", file, line);
    abort();
  }

  /*
   * Declare properties of array type x
   */
  #define ARRAY_ELTYPE_x int    /* element type is int */
  #define ARRAY_SIZE_x 7        /* number of elements is 7 */

  /*
   * Ensure array type x is fully declared at file scope
   */
  ARRAY_FULLTYPE(x);

  /*
   * Inform the OOB module that the identifiers p and a are
   * used as variables related to type x: either pointers
   * to it or values.
   */
  #define ARRAY_TYPEOF_p x
  #define ARRAY_TYPEOF_a x

  int get_elem(ARRAY_TYPE(x) *p, int i)
  {
     return APREF(p, i);
  }

  int main(void)
  {
     ARRAY_TYPE(x) a = ARRAY_INIT(1, 2, 3);

     for (size_t i = 0; i <= ARRAY_SIZEOF(a); i++)
        printf("a[%zd] == %d\n", i, get_elem(&a, i));

     return 0;
  }

Output:

  $ ./oob
  a[0] == 1
  a[1] == 2
  a[2] == 3
  a[3] == 0
  a[4] == 0
  a[5] == 0
  a[6] == 0
  oob.c:31:out of bounds array access
  Aborted (core dumped)

The content of "oob.h"

  #ifndef OOB_H_435E_FDE9
  #define OOB_H_435E_FDE9

  int oob_fail(const char *file, int line);

  #define OOB_PREFIX oob_ident_
  #define OOB_XCAT(X, Y) X ## Y
  #define OOB_CAT(X, Y) OOB_XCAT(X, Y)

  #define ARRAY_ELTYPE(T) OOB_CAT(ARRAY_ELTYPE_, T)
  #define ARRAY_SIZE(T) OOB_CAT(ARRAY_SIZE_, T)
  #define ARRAY_TAG(T) OOB_CAT(ARRAY_TAG_, T)

  #define ARRAY_FULLTYPE(T)                                                     \
    struct ARRAY_TAG(T) {                                                       \
      ARRAY_ELTYPE(T) a[ARRAY_SIZE(T)];                                         \
    }

  #define ARRAY_TYPE(T) struct ARRAY_TAG(T)

  #define ARRAY_TYPEOF(V) OOB_CAT(ARRAY_TYPEOF_, V)
  #define ARRAY_SIZEOF(V) ARRAY_SIZE(ARRAY_TYPEOF(V))

  #define ARRAY_INIT(...) { { __VA_ARGS__ } }

  #define AREF(ARRAY, I)                                                        \
    (((size_t) (I) >= ARRAY_SIZEOF(ARRAY))                                      \
     ? oob_fail(__FILE__, __LINE__), (ARRAY).a[0]                               \
     : (ARRAY).a[I])

  #define APREF(PARRAY, I)                                                      \
    (((size_t) (I) >= ARRAY_SIZEOF(PARRAY))                                     \
     ? oob_fail(__FILE__, __LINE__), (PARRAY)->a[0]                             \
     : (PARRAY)->a[I])

  #endif

Preprocessor invoked on oob.c (snipped down to the relevant part after the run-time support function oob_fail):

  struct ARRAY_TAG_x { int a[7]; };


  int get_elem(struct ARRAY_TAG_x *p, int i)
  {
     return (((size_t) (i) >= 7) ? oob_fail("oob.c", 31), (p)->a[0] : (p)->a[i]);
  }

  int main(void)
  {
     struct ARRAY_TAG_x a = { { 1, 2, 3 } };

     for (size_t i = 0; i <= 7; i++)
        printf("a[%zd] == %d\n", i, get_elem(&a, i));

     return 0;
  }

It's clean enough to be readable (except, of course, code dense with AREF or APREF calls will be a mess). Uses arrays wrapped in structs, so you can pass arrays by value.

You have to make a list of your variables that are involved and write some #define lines for them.

Same for the array types.

Settings

Neverflow: C macros that guard against buffer overflows

Keyboard Shortcuts