Static Integer Types (2021)
tratt.netNote that, here in 2022, those ARM chips with CHERI do now exist, Morello: https://www.arm.com/architecture/cpu/morello
Also, although this article says Rust has to choose usize ~= u128 on Morello which is unpalatable, Aria proposes that instead Rust tweaks the definition of usize to say it's about addresses not pointers, and thus usize ~= u64 on Morello.
https://gankra.github.io/blah/fix-rust-pointers/ leading into https://gankra.github.io/blah/tower-of-weakenings/
If you have nightly Rust, you can play with Aria's new semantics because she implemented them. I think they're a good idea, but I don't have much "skin in the game", unlike, apparently the author of this article.
> I hope it’s uncontroversial that, like Rust, languages should not allow implicit casts between integer types at all.
I find this controversial. The unstated option #4 for addressing C's permissive implicit narrowing conversions is to simply disallow implicit narrowing conversions, but continue providing implicit integer conversions to types of greater-than-or-equal rank.
I suspect the reason you left out option #4 is an entirely different, self-imposed constraint in Rust:
> [Rust] defines From casts for integer types that will succeed on every platform. Since casting a 32-bit integer to a usize would fail on a 16-bit platform, I’m not ever allowed to compile — even on a 64-bit platform where such a cast would always succeed.
But therein lies the original questionable turn that Rust made wrt usize--there's no accounting for future platforms.
What makes C so portable is precisely the notion of C's integer ranking system and implicit conversions. By guaranteeing relative rank and permitting implicit widening conversions, most issues on most future architectures have been accommodated. It's not perfect, but the the value for your money is immense. The vast majority of issues like this go away.
And what do you lose by permitting implicit widening conversions? There are some potential correctness issues. For example, subexpressions computing bitmasks might not behave as expected when converted to a wider type in an outer expression. But this same problem exists in your proposed solutions and explicit conversions, generally. I would even consider implicit conversions safer because we can always add additional rules (optional or mandatory) that capture these cases (e.g. -Wwidth-dependent-shift-followed-by-implicit-widening), whereas explicit conversions usually have the effect of short-circuiting stronger type checking. (Unless you go the C++ route and add a panoply of conversion operators. But better hope you chose the right one!)
You lose the ability for a build to fail on target X because the conversion wouldn't work on target Y. But that requirement is fundamentally in conflict with accommodating future architectures, and incentives the type of explicit conversions that could hide or complicate future porting issues.
Regarding this footnote:
> [8] For reasons that are unclear to me, uintptr_t is an optional type in C99. However, I don’t know of any platforms which support C99 but don’t define it.
AFAIU, the reason is exactly because the committee foresaw that not all architectures could accommodate conversions between object pointers and integers. Relatedly, the C standard DOES NOT permit conversions between object pointers and function pointers, which also means the C standard DOES NOT permit conversions between function pointers and uintptr_t, even if uintptr_t is defined. Both capability systems and memory architectures already existed that couldn't accommodate the latter conversions. The value of function pointer/integer conversions was much less than object pointer/integer conversions, so they defined uintptr_t only for object pointers, and made uintptr_t optional.
Function pointer/object pointer conversions are widely supported by C compilers, but this is an extension that makes code non-conforming. See C11 J.5.7. Note that non-conforming does not mean undefined; it just means that such code is not C code as defined by the standard and beyond its purview. It creates a headache for POSIX, which defines a singular interface, dlsym, for acquiring both object and function references.
> […] but what would a signed memory address mean and what happens if/when we want to use all 64-bits to represent memory addresses […]
Half of this space is reserved for the kernel.
That’s an implementation choice (likely a highly popular one in 64-bit systems), not a necessity. It could be a quarter (as Windows could do in 32 bit. https://docs.microsoft.com/en-us/windows/win32/memory/4-giga...) or even nothing (that’s what Mac OS X did int 32 bit. See https://flylib.com/books/en/3.126.1.91/1/), in OSes that have a 100% separate address space for the kernel.
Both these choices (and others) were available in 32-bit Linux, even after x86-64 was popular because Enterprises can be slow moving (e.g. first AMD64 CPUs come out in 2003, but it took years for some outfits to stop buying 32-bit Intel CPUs and then years more to stop putting a 32-bit OS on them)
Linux offered a conventional 2:2 split (as your parent describes) but also a 3:1 split (like Windows /3GB) and 4:4 (like OS X, plenty of address space but now context switches are very slow) but I believe it also had 1:3 (there can be cases where what you need is a lot of RAM mapped by the kernel, but you can't or won't go 64-bit) and something like 3.5:0.5 for people who hadn't learned their lesson with 3:1 ratio and needed to be kicked in the head more often.
> Half of this space is reserved for the kernel.
That’s an implementation detail - and what if your kernel is different, or you’re the one writing the kernel?
Right, the languages which have these "static integer types" are going to be more predisposed to implementing an operating system kernel or writing bare metal code that doesn't sit on an operating system.
Linux is written in C (one of these languages) and likely parts of it will eventually be written in Rust (another of them). Python, which does not bother with this (in 3.x) isn't very well suited to writing such software, although Snek ("Snek is friend!" https://www.youtube.com/watch?v=w4sWZzYysvs&t=2401s) https://sneklang.org/ is somewhat suitable for small devices.
If you are interested in a C++ library that makes using integers a lot safer, take a look at Boost.SafeInt
https://www.boost.org/doc/libs/1_79_0/libs/safe_numerics/doc...
Ironically (?) this doesn't seem to supply the one thing implied by the explanation it gives
It begins by telling us about how in C++ you're obliged to detect potential overflow, because if the overflow already happened then in many cases that's Undefined Behaviour and all bets are off.
But you shouldn't use this to merely detect such overflow, out of the box it will give you an Exception - and C++ proponents will insist those are only for Exceptional situations you mustn't expect them to have decent performance for merely conditions such as this overflow check - which might happen in some inner loop depending on the system design.
I don't see a way to ask the equivalent of Rust's checked_add() which gives you back Some(answer) or None depending on whether this addition overflows.
> Undefined behaviour on integer types is a terrible idea (though unspecified behaviour might have a place).
Would unspecified behavior be sufficient to attain those compiler optimizations that are the reason for keeping signed integer overflow undefined in newer C and C++ versions?
No, it would not be sufficient.
“Undefined behavior” is nothing more than “the compiler is allowed to assume that your program does not do this, and can use that assumption to optimize your program.” The compiler can optimize your program to do something completely unexpected when your integers overflow.
For example, there are architectures with a special “counter” register that is more efficient to use for loops. This may come with some kind of instruction like “decrement counter and branch if not zero”. The semantics of this counter register are often not a clean match for the semantics of signed integers. For example, the counter may be wider than the integer type you’re using, or it may only support certain types of comparisons.
Edit: That’s the example that came to mind. There are a bunch of microoptimizations involving arithmetic or comparisons that are only valid when overflow is not possible. But loop optimizations are the monster that looms over everything in a discussion about signed integer UB in C, because when signed integer overflow is undefined, the compiler is allowed to make tons of inferences about loops that are much harder to reason about otherwise. The compiler can transform variables to use induction, make inferences about how memory accesses work, do certain types of vectorization, etc. If you are curious about the performance impact, try compiling various benchmarks with `-fwrapv` or `-ftrapv` and compare the performance of these benchmarks to what happens when you don’t use either of those flags.
Unspecified behavior is actually a lot narrower than it sounds. It sounds like “the compiler can do anything here”, but that’s actually what undefined behavior is. For unspecified behavior, you’re either using an unspecified value, or there’s something with a set of possible behaviors to choose from. For example, evaluation order is unspecified, the sign of certain operations is unspecified, whether two identical string literals compare equal is unspecified, etc. The compiler has to make some choice—function arguments are evaluated in some order, two particular string literals either compare equal or don’t.
Consider an optimization like (x+a)/a => x/a+1. This is possible because the compiler assumes “x+a” does not overflow. If overflow were unspecified behavior, something would have to happen when you add x+a, and the result would have to have to trap or result in a value of the correct type. No possible “unspecified” behavior would result in additional precision that prevented overflow (the standard is clear about precision for arithmetic operations).
Is there a C++ library that implements static integer types with these ideas? In principle the operations don’t seem complicated, but there are probably enough edge cases that it’s tricky to get it all right.