How to Think About Variables in C

38 points by denniskubes 13 years ago · 65 comments

Reader

Extremely uninteresting- It is like a page of "C-S 1XX: Intro to C" fell out of its bindings and landed on Hacker News.

This might have been mildly interesting if there had been the assembly for a few different architectures (x86, MIPS, ARM, PowerPC, etc) showing how the C code was translated to assembler for each. And could have been very interesting with an additional discussion of memory barriers and atomic operations in C and their relation to assignments and pointers.

holyjaw 13 years ago

Amendment: 'Extremely uninteresting' -TO YOU-.
As someone who has had difficulty picking up real programming languages, and has only found some marginal success due to Obj-C's ARC feature, I can tell you this puts everything I've read in to much better perspective.
Try not to be so negative, man, I think it's clear you weren't even the intended target anyways.
minimax 13 years ago

HN has a pretty broad audience and a pretty big chunk of it doesn't know $language. These types of beginner posts for $language pop up from time to time. It's nothing to worry about.
- voidlogic 13 years ago
  
  $language in this case is C, the lingua franca of computing.
  It is almost always the first language ported to any system, almost every computer science program at least covers the basics, it has been in 1st/2nd place on the TIOBE index for over a decade, its the 5th most popular language on github by commits and it is over 40 years old.
  But- I'm willing to accept there might be people on Hacker news that don't know C, thats why I gave suggestions to the author to expand on the content and make it interesting to a wider audience. That was the point of my post.
- mturmon 13 years ago
  
  Posts on elementary topics (should be) noteworthy only if mastery is exhibited. Hence, griping.
ultimoo 13 years ago

I agree. I liked the opening line though: "C is memory with syntactic sugar." It is a good introductory article for someone who has never used C -- CS-1xx Intro as you said.
- greenyoda 13 years ago
  
  "Syntactic sugar" generally means a syntax that's just a nicer-looking version of something that can be equivalently expressed in a more fundamental syntax. But C is more than that: it provides a way of abstracting away the details of the machine so that you don't have to explicitly deal with the fact that your machine has 64-bit pointers and 2's complement integer arithmetic and IEEE floating point and an instruction set that handles shift operations in a particular way.
  So a better formulation might be: "C provides an abstraction layer on top of a computer's memory model and instruction set that will allow your code to be portable between different machine architectures, but only if you play strictly by the rules."
  By the way, the classic K&R book explains the fundamentals of C pretty well. If you really want to understand C, I'd recommend reading it cover to cover (it's pretty short).
denniskubesOP 13 years ago

I was trying to describe a simple mental model that has been helpful to me. While I agree assembly details would have been interesting putting that in would have lost more than half the audience.
- nemetroid 13 years ago
  
  > putting that in would have lost more than half the audience.
  I surely hope not.
blt 13 years ago

The least they could have done is explain how structs work.

haberman 13 years ago

There are some subtle problems with the model as explained in this article. If you use this as your mental model, you will probably run afoul of undefined behavior without realizing it.

If you read the C standard, you'll notice it doesn't talk much about "memory" (the word only appears 13 times in C99); it mostly talks about "objects" (mentioned 735 times in C99). These objects aren't OO-objects -- obviously C doesn't have OOP built in -- but rather all the basic types like int, float, struct, etc are objects. When you declare a variable like "int x", you are creating an object.

C's aliasing rules dictate that you can only access an object via a pointer of that object's actual type. This is why it is dangerous to think of the assignment operator as a simple memory-copying operation. If assignment were a simple memcpy, you could do something like this:

  int x = 5;
  // BAD: undefined behavior, violates aliasing.
  short y = *(short*)&x;

If a variable were just a memory address and assignment were just a memory copy, this would be a valid operation. But the right way to think of it is that a variable is a storage object whose address can be taken, and and a dereference is an operation that reads a storage object.

A pointer isn't a generic memory-reading facility, it must actually point to a valid storage object of the pointer's type (or to NULL).

If you do want to read and write arbitrary objects in memory, you can always use memcpy():

  int x = 5;
  short y;
  // This is fine, and smart C compilers optimize away the
  // function call.
  memcpy(&y, &x, sizeof(y));

sillysaurus 13 years ago
If a variable were just a memory address and assignment were just a memory copy, this would be a valid operation.
It's a valid operation regardless of whether a standards body says it's not.
```
  uint32 x = 5;
  uint16 y = *(uint16*)&x;
```
The effect is to set y to the first two bytes of memory from x. Values assigned to x are serialized into memory in either big endian or little endian order. Those are the only two cases you have to account for. Quake 3 engine has a macro for the above operation which produces the same value of y on all platforms. This is useful for serializing x to disk, then loading it later (and possibly on a different architecture).
One source of confusion is that int and short are essentially, for all intents and purposes, undefined -- they are of course defined by the standards, but their implementation is allowed to vary so much that no programmer can make any assumptions about their size (in bytes) at runtime.
int8, int16, int32, int64 are all explicit and force the compiler (and the hardware) to obey the wishes of the programmer. This is, I think, the right approach. People make much ado about the fact that "a byte isn't necessarily 8 bits" and "the only assumption you can make about a short is that it's smaller than an int, and larger than a char", etc, which is probably unnecessary mental effort.
"Bytes are 8 bits. Here are four bytes. Here's the value that the four bytes store. Copy two of the four bytes to this other spot (adjusting for endianness appropriately via a macro)."
You typically don't want a memcpy in situations like this due to endianness.
The reason it's useful to explicitly "break the rules" like this is because it's important to know what assumptions you in fact can rely on, regardless of what standards bodies have to say about it. Because at that point you can do incredible things such as http://www.codercorner.com/RadixSortRevisited.htm
```
   inline float fabs(float x){
        return (float&) ((unsigned int&)x)&0x7fffffff ;
   }
```
The reason this is incredible and awesome (rather than horrible and dangerous) is because it enabled game developers to achieve a more impressive product for end users, because they were able to do more with the CPU resources that were available at the time.
It's of course not so relevant nowadays, since it's reasonable to assume that most gamers have at least a core 2 duo. But it's one of those things that isn't relevant until suddenly it is -- you're in some situation that requires sorting millions of floats, and your dataset simply demands more performance than your compiler typically gives you. Then suddenly you find you can do amazing things like this, and surprise people with how effectively you can use a modern CPU.
(Although, the modern antidote to "I need to sort millions of floats quickly" is to use SSE, not to sort floats as integers. Yet that's even more evidence that it's better to understand the capabilities of the hardware.)
- haberman 13 years ago
  
  > It's a valid operation regardless of whether a standards body says it's not.
  Whoa there, cowboy. You may not feel personally beholden to standards bodies, but compiler vendors are following their lead. The major compilers are getting more and more aggressive about optimizing away undefined behavior every year.
  > The effect is to set y to the first two bytes of memory from x.
  No, it's really not. It's undefined behavior and the compiler is free to do absolutely whatever it wants.
  > One source of confusion is that int and short are essentially, for all intents and purposes, undefined -- they are of course defined by the standards, but their implementation is allowed to vary so much that no programmer can make any assumptions about their size (in bytes) at runtime.
  I agree with this, and have made this argument before: http://blog.reverberate.org/2013/03/cc-gripe-1-integer-types...
  But this is an entirely separate issue.
  - sillysaurus 13 years ago
    
    No, it's really not. It's undefined behavior and the compiler is free to do absolutely whatever it wants.
    The point is that compilers do some specific thing, regardless of the fact that the standards bodies say they're free to reboot your computer.
    As long as all you care about is x86/x86_64/PowerPC (and probably ARM as well), then you can trust that the compiler is going to generate code which copies the first two bytes of x into the memory occupied by y.
    
    __david__ 13 years ago
    
    >As long as all you care about is x86/x86_64/PowerPC (and probably ARM as well), then you can trust that the compiler is going to generate code which copies the first two bytes of x into the memory occupied by y.
    That's the thing that haberman is trying to tell you, you can't trust that any more, even with architectures you think you know. What you said was true about 10 years ago, but things have changed. Go read about "-fno-strict-aliasing" [1].
    [1] http://thiemonagel.de/2010/01/no-strict-aliasing/
    
    haberman 13 years ago
    
    There be dragons. The following program prints 10 on gcc 4.6.3, x86-64:
    #include <stdio.h> #include <stdint.h> void f(uint32_t *x, uint16_t *y) { *x = 5; printf("%d\n", *y); } int main() { uint32_t x = 10; f(&x, (uint16_t*)&x); }
    
    sillysaurus 13 years ago
    
    The antidote is to put a memory barrier in between the assignment and the printf.
    *x = 5; __sync_synchronize(); printf("%d\n", *y);
    http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins....
    The reason this example is fundamentally different from my example is because mine doesn't create two objects that point to the same memory. In such situations, memory barriers are necessary. Also, your program won't work on different platforms due to endianness.
    
    haberman 13 years ago
    
    That is not what memory barriers are for, at all. Memory barriers are a sequencing primitive for shared-memory concurrency (an excellent intro is here: http://lxr.linux.no/linux/Documentation/memory-barriers.txt). They are never required for correctness in valid single-threaded programs.
    The memory barrier "fixed" this program similarly to how a cruise missile "fixes" a termite problem. It was just a coincidence and it was the wrong tool for the job.
    
    sillysaurus 13 years ago
    
    Except we're talking about an invalid program. The program is invalid as written. Therefore memory barriers are the antidote because they're necessary in this situation.
    A tool doesn't have a purpose. It has capabilities, and understanding why something works (and why it can be relied upon) is all that matters.
    
    haberman 13 years ago
    
    Yes, it is an invalid program. The antidote is to fix it, not to jigger it in a way that happens to work. The memory barrier is not "necessary" -- it is not even a correct fix. Even with a memory barrier as you added it, it is still an invalid program that invokes undefined behavior. The memory barrier may have coincidentally fixed the problem on your system, but there is still no guarantee it will work on another architecture, another compiler, or even another version of the same compiler.
    The problem with my program is that it casts an int32_t pointer to int16_t pointer. The correct fix is to not do that. "Fixing" the problem with a memory barrier is a step in the wrong direction.
    
    sillysaurus 13 years ago
    
    but there is still no guarantee it will work on another architecture, another compiler, or even another version of the same compiler.
    My point is that it is guaranteed to work. A memory barrier guarantees that all memory operations before the barrier take effect before any operations after the barrier.
    I think this whole exchange is fascinating because it illustrates two completely different philosophies to hacking. Both are equally valid. I tend to prefer yours because it tends to result in shorter programs. Yet this is just a programmer convention. The machines do not care.
    Yet there are some instances where my philosophy -- understanding which rules may be safely ignored -- has paid off. For example, if your invalid program were in a closed-source library which I was forced to interface with, then the program can't simply be fixed. In that case, a memory barrier would probably be the cleanest workaround.
    It's an unfortunate fact that this type of situation -- broken third-party code that can't be fixed and can't be replaced -- is quite common in the field. It seems like it's an important skill for an engineer to know how to handle such situations.
    EDIT: By the way, Scrybe Music looks really cool!
    
    haberman 13 years ago
    
    It isn't guaranteed to work even with the memory barrier, because the undefined behavior is not merely an ordering problem. The problem is that merely accessing the object through the wrong kind of pointer breaks the rules and gives the compiler a license to do anything.
    There is a time and place to break the rules, but it is a calculated risk. It can only be considered "safe" if you make assumptions about your environment (platform, toolchain, etc). You're vulnerable if any of those assumptions change. The things people considered "safe" 10 years ago aren't "safe" any more. But the people who followed the rules never have to change their approach.
    For what it's worth, a cheaper barrier in this case (if you were going to take that route) is just a compiler barrier like __asm__ __volatile__ (""); (see: http://en.wikipedia.org/wiki/Memory_barrier#Out-of-order_exe...). There's no need to emit an actual CPU barrier.
    Thanks about Scribe; it's a labour of love.
    
    dchichkov 13 years ago
    
    I think that your original suggestion - use memcpy is a better solution then volatile.
    It looks like the following code would work:
    #include <stdio.h> #include <stdint.h> void f(volatile uint32_t *x, volatile uint16_t *y) { *x = 5; printf("%d\n", *y); } int main() { volatile uint32_t x = 10; f(&x, (volatile uint16_t*)&x); }
    And the compiler is guarantied () to issue store op on x = 5 and consecutively load op on y, but the code is looking pretty ugly.
    () assuming no alignment problems
- brigade 13 years ago
  The reason it's useful to explicitly "break the rules" like this is because it's important to know what assumptions you can in fact rely on, regardless of what standards bodies have to say about it.
  Given that compilers do break when programmers violate aliasing rules, you should recheck what assumptions you think you can rely on. Non-strict aliasing is not one of them. Unless you want to slow everything down with compiler-specific flags like -fno-strict-aliasing.
  uint8_t foo[4]; *(uint32_t*)foo = 0;
  Besides even without strict aliasing, the above is not at all guaranteed to work since not all architectures support unaligned loads. (and if you think "well but no one uses them, just like no one uses 1's complement architectures anymore", keep in mind that this includes ARM)
  (also use stdint types already)
  - sillysaurus 13 years ago
    
    uint8_t foo[4]; *(uint32_t*)foo = 0;
    Besides even without strict aliasing, the above is not at all guaranteed to work since not all architectures support unaligned loads.
    So, the interesting thing about this example is that it does work. It's in fact very, very difficult to find a platform where that example won't work (i.e. crashes the program). For example, any C library involving image manipulation is likely going to have code similar to what you've described, and those libraries work on almost every platform.
    Standards are a good and useful thing. All I'm saying is that it's important to know which rules you can safely violate.
    
    __david__ 13 years ago
    
    > It's in fact very, very difficult to find a platform where that example won't work
    No, it isn't. Many ARM processors will bus error on that code if (foo & 3) != 0. I believe PowerPC doesn't do unaligned word reads either...
    It quite often has to do with the memory controller and not with the particular processor, though I believe x86 has to support unaligned reads. I've certainly worked first hand with ARMs that did not support it.
    
    sillysaurus 13 years ago
    
    That's interesting. What causes the bus error?
    Would
    uint8_t foo[4]; *(uint32_t*)(&foo[0]) = 0;
    also result in a bus error? Why?
    
    __david__ 13 years ago
    
    That's the same thing, so yes, if foo is unaligned then it will cause a bus error. It causes it because the code is generate a store word assembly instruction (as opposed to store byte) and if the address is not aligned to 4 bytes then the memory controller hardware will raise a bus error.
    Notice I keep saying "if the address is unaligned". The insidious part is that it probably will work for a while since it's likely that your "foo" array will happen to be aligned. But add one uint8_t variable to your structure or stack frame or wherever "foo" is defined and things could shift and suddenly it starts causing bus errors. It can be a very annoying type of heisenbug.
    And bus errors are actually a good thing. I believe I've used hardware (an ARM or an SH2, can't remember) where the memory controller just ignored the last 2 bits during whole word reads and writes (which works fine as long as you only read aligned words). So if run your code on that hardware it doesn't give you an error, it just subtly "corrupts" your data. Yay!
    
    brigade 13 years ago
    
    any C library involving image manipulation is likely going to have code similar to what you've described
    ...which actually is exactly how I found out first-hand that it doesn't always work. If you only ever test on x86 you'll never catch it. You might not even catch it on ARM if you're lucky.
    Which is the point - that compilers can and do make use of almost all undefined behavior of C for optimizations, which one developer might not catch because their current compiler happened to work. Then a new version is released that can find and exploit more undefined behavior. And strict aliasing is one of those rules you can't safely violate.
- 1500100900 13 years ago
  
  >int8, int16, int32, int64 are all explicit and force the compiler (and the hardware) to obey the wishes of the programmer.
  At least in C99, the compiler doesn't need to support exact-width integer types.
  >People make much ado about the fact that "a byte isn't necessarily 8 bits"
  Well, POSIX.1-2004 requires that CHAR_BIT == 8.
- derleth 13 years ago
  
  > It's a valid operation regardless of whether a standards body says it's not.
  All the world's a VAX, sure. Don't mind the next generation of hardware coming down the pike and the next wave of compiler optimizations.
  http://catb.org/jargon/html/V/vaxocentrism.html

_kst_ 13 years ago

"A data type is a number of bytes to the compiler."

The size of a type is just one of its many attributes. Even if, for example, "long", "float", and "void* " happen to have the same size, they're still very distinct types.

"Integer data types are defined in the limits.h file. Float data types are defined via macros in the floats.h file."

Integer and floating-point types are defined by the compiler, guided by the hardware and the ABI for the platform. <limits.h> and <float.h> document the characteristics of the predefined numeric types.

"A pointer doesn’t hold a memory address, it holds a number that represents a memory address."

Sure, and a floating-point object is ultimately just a collection of bits -- but that's hardly the best way to think about either of them. Integers and pointers (addresses) are logically very distinct things, even if they happen to have similar representations. For example, the addresses of two distinct variables have no defined relationship to each other (other than being unequal); just evaluating (&x < &y) has undefined behavior.

C lets you get away with a lot of type-unsafe stuff, particularly if you resort to pointer casts, but it's fundamentally much more strongly typed than the author seems to think it is.

revelation 13 years ago

See also: strict aliasing

dllthomas 13 years ago

1 int x = 10;

2 &x = 20; // this doesn't work

3 * (&x) = 20; // this does work

Why does line 2 &x not work but line 3 does? Because &x returns a pointer, a number representing a memory address. This is an important distinction. A pointer doesn’t hold a memory address, it holds a number that represents a memory address.

=======

No, that is not why. Note that the following does work:

int * x = 0;

and the following works, though typically yields a warning:

int * x = 20;

Line 2 fails because & doesn't give back an l-value.

asveikau 13 years ago

> Every variable is a starting memory address to the compiler.

Definitely not true. More like, "it will have an address, if you take the address with the & operator". Otherwise, the compiler is quite free to store locals in registers.

denniskubesOP 13 years ago

> Yes I am being simplistic and yes certain data types have certain syntactic sugar but I have found this to be a good mental model
As stated in the post.
- mturmon 13 years ago
  
  I think you're going to keep getting comments on these ill-considered asides, but here is another problem:
  "In most assembly languages, data types don’t exist. You operate on bytes and offsets."
  This is just not true.
  Most assembly languages (I learned on PDP-11 assembler, which I remember best, but what I say is true of 68000 and x86 too) have a notion of a byte, but also integers of various word lengths, and floating point numbers.
  In fact, some registers are in effect designated as "pointers" for various kinds of conventional indirect addressing (the instruction pointer, the register holding the stack pointer, and others).
  In this sense, C is even closer to assembly than you indicate, because the data types are so analogous.
- asveikau 13 years ago
  
  This reminds me of another comment I had: I personally find the phrase "syntactic sugar" irritating. As used, I don't feel like it adds anything to the blog post. IMO you could write nothing there and it'd make the exact same point.
  What exactly is the "syntactic sugar" that hides the idea that names can have addresses? Structs? Some specific kind of expression? Array index syntax? The names themselves?
- halayli 13 years ago
  
  Simplicity here doesn't help. Variables aren't about how they are stored and where but more about what gets applied to them and how.

snorkel 13 years ago

Integers are the simple case, but you really haven't grasped the C memory model until you're comfortable handling text strings at any length, calling functions by pointers, working with structure pointers, and knowing when you need a pointer to a pointer. Part of it is understanding variable scope, local vs global vs stack frame memory. It's not rocket science, just takes practice, and the courage to segfault your way through it.

denniskubesOP 13 years ago

What other mental models do people use to think about variables and memory? I would like to hear about them.

bcoates 13 years ago

My mental model for C is symbol-referent diagrams like the first picture on http://www.exforsys.com/tutorials/c-language/c-pointers.html
If you keep track of which boxes are and are not runtime memory cells, that should be enough to work out any particular C pointer problem except the pointer-array almost-equivalence mess.
- denniskubesOP 13 years ago
  
  That is nice. I have seem different pointer diagrams but none that linked it to a memory list as that does. I like.
ericbb 13 years ago

My understanding of types took a big step forward when I read some of Robert Harper's stuff. In particular, the blog post, Dynamic Languages are Static Languages, and his book, Practical Foundations for Programming Languages. (The book is a tome and I've only read parts of it but it's very good).
When it comes to understanding memory in C, another important aspect is understanding how linkers and loaders work. Also, it's good to know something about calling conventions.
georgemcbay 13 years ago

Go: Basically the same as C, but with better specification for type sizes, more rigid rules about automatic type conversion, no pointer arithmetic (you can do it using the unsafe package but it is highly discouraged by both the language design and idiomatic usage) and a compiler which can do type inference.
Also, when you get to manually allocated heap data (which this article doesn't cover) you don't have to worry about deallocations... usually.
wting 13 years ago

In Haskell:
Variables? What state? Everything is puuuuuuuuure.
In Python:
Everything is an object (numbers, true/false values, strings, etc), some are mutable and some are not. Variables are temporary labels on objects (think of them as hard links).
In Rust/C++:
There are various types of boxes / smart pointers (shared, unique, heap, etc), and unsafe / raw pointers should be avoided when possible.
In C:
Not every variable has a data type, e.g. void or function pointers.
- _kst_ 13 years ago
  
  "Not every variable has a data type, e.g. void or function pointers."
  A void pointer has type "void* "; a function pointer also has some appropriate type.
  Not every object has a type (e.g., a chunk of memory allocated by `malloc()`), but if "variable" means "object created by a declaration", then yes, every object has a type.
BruceIV 13 years ago

As the lab TA for a first year course in Java, I don't know how many times I repeated "A variable is like a box: it has a label (the variable name), and it stores something." - it's a simplistic analogy, but not far wrong (at least for Java), and it helps the new programmers get the idea.
1500100900 13 years ago

Scopes of identifiers, linkages of identifiers, name-spaces of identifiers, storage durations of objects, types, and representations of types.
jimmaswell 13 years ago

In higher-level languages I don't consciously think about how they're represented in memory.

16s 13 years ago

It sounds simple, but you'd be surprised how many programmers don't grok the fact that types/data have sizes (especially numeric types). For many tasks, this doesn't matter, but when it does matter, you need people who understand.

As an example, an IPv4 address is 32 bits. Don't convert it to a string and put it in a varchar(64) in your database when you are optimizing for space (I actually saw this once). And yes, the DB had an inet type, but no one knew how to use it, what it was or why it mattered.

__david__ 13 years ago

My favorite bit of pointer code is one I had to write in the bootstrap code of an embedded processor:

    int r = ((int (*)())startAddress)(); // Wheeee!

derleth 13 years ago

> C is memory with syntactic sugar and as such it is helpful to think of things in C as starting from memory.

http://en.wikipedia.org/wiki/Lie-to-children

> A lie-to-children, sometimes referred to as a Wittgenstein's ladder (see below), is an expression that describes the simplification of technical or difficult-to-understand material for consumption by children. The word "children" should not be taken literally, but as encompassing anyone in the process of learning about a given topic, regardless of age. [snip] Because life and its aspects can be extremely difficult to understand without experience, to present a full level of complexity to a student or child all at once can be overwhelming. Hence elementary explanations tend to be simple, concise, or simply "wrong" — but in a way that attempts to make the lesson more understandable.

OK, the very first sentence of this piece falls flat on its face when you begin to think about how a computer actually handles getting data into and out of the parts of the CPU that actually do the work of modifying data according to the opcodes in flight.

In specific, C is meant to be a pleasant syntax to sling data around a large, flat address space, where the assumption is that every part of the address space can be treated like any other, with no special consideration given to some locations being faster than others. (The 'register' keyword mucked with this a bit, but approximately nobody uses it anymore in new code. Just as well, because good compilers ignore it anyway; more below.)

This is horribly, hilariously wrong when you learn about cache hierarchy, and becomes even more wrong when you throw an OS implementing virtual memory and a disk cache into the picture. C doesn't have any way to refer to cache; you can't tell the compiler 'store this in cache' because that would break the abstraction C enforces.

So we loop back around: C enforces the abstraction for a good reason; namely, compilers are better than humans at scheduling memory use in practically every case, and in the few cases they aren't, you're doing something hardware-specific enough you'll need to drop into assembly anyway. This is also the reason the 'register' keyword is a no-op and has been for decades. Compilers can schedule registers better than humans because compilers know more about all of the optimizations in play, and when they can't, you'll have to drop into assembly anyway.

TL;DR: This is a basic introductory post. Nitpicking it for things that compilers take care of for you anyway is pointless.

denniskubesOP 13 years ago

Thank you.

Settings

How to Think About Variables in C

Keyboard Shortcuts