Do references and pointers in the new breed of programming languages still fall short of 1968 technology?
Press enter or click to view image in full size
A new trend has been brewing quietly for the past ten years and is now going mainstream: programming languages whose memory allocation and pointers are efficient, safe and smart. These are statically typed compilable imperative languages that want to give the power and joy of pointers back to the programmer. Their compilers track pointer usage as much as possible at compile time, aiming to eradicate unnecessary garbage collection and to help make dynamic memory allocation and deallocation efficient and deterministic.
The Rust language is the most vivid example here, and Go and the recent advances in C++ are also indicative of this trend.
With such developments finally comes the opportunity to revise and clean up the concepts of variables and lvalues, pointers and mutability.
“Let me put this into historical perspective, because I anticipate considerable skepticism for this idea.”
— Chris Messina, Hashtag Data Positive
Press enter or click to view image in full size
1967: Life before pointers
Say, let x be an integer variable. The tokens ‘x’ and ‘5’ thus have the same type.
A type defines a set of values and specifies what can be done with them. But there is one thing that can be done to x but cannot be done to 5: assignment. Type systems of 1960s were not rich enough to express the difference between assignable and non-assignable expressions, or, if you like, between mutable state and immutable values.
This led to the introduction of L-values and R-values, as described by Christopher Strachey in his paper ‘Fundamental Concepts in Programming Languages’, an excellent survey of the 1967 state of the art. An L-value represents a location, or “address-like object appropriate on the left of an assignment”. An R-value is “the contents-like object appropriate for the right”. A location (L-value) has a content (R-value), which can in general be changed. x has both L- and R-values, while 5 has only an R-value.
More complex expressions, such as array indexing (e.g., a[1]), can also have L-values; in this example, as long as a itself has an L-value.
1968: Reference types
One year later, Algol-68 pushes the concept of L-values into the type system, calling them references. For any type T (types are actually called modes in Algol-68, but I shall stick to the commonly used terminology) there is a type ref T, just like pointer types in C.
This eliminated type system inconsistency (when different actions are applicable to expressions of the same type, like x and 5) and the need for L- and R-values through two simple principles:
- Variables are references; and
- Implicit dereferencing is good.
Let’s expand on this.
Variables are references
(Or pointers. I’m going to use these two words interchangeably.)
When we say that x is a variable of type int, it means that x is bound to a value of type reference to int. This binding is final, and the storage for the value pointed to by x is allocated somewhere: on the stack, heap, .bss, in a register, or wherever the compiler decides.
This is how it is written using Algol-68's strict syntax, where each declaration binds an identifier to an expression of the declared type:
int five = 5; # five is a constant #
ref int var = loc int; # var is effectively an int variable,
or a ref int constant #(Keywords in Algol-68 are symbols in a different namespace than identifiers, and are often printed in bold in literature. There are various conventions for entering them in actual program text.)
Here loc int, called ‘local generator’, allocates local storage for a value of type int, whose lifetime is bounded by the current block. Thus, five is a name for the value 5 (i.e., a constant), and var is de-facto a variable. A ‘heap generator’ can be used for allocating storage of unlimited lifetime.
This way of declaring variables is rather wordy and unconventional, but is easily improved with a little syntactic sugar:
int var; # short for ref int var = loc int #Note the lack of ‘=’ differentiating it from the strict syntax.
This also extends to variable declaration with initialisation:
int initialised var := 239;which is short for
ref int initialised var = loc int;
initialised var := 239;(Algol-68 uses ‘:=’ for assignment and ‘=’ both for value binding in declarations and for equality comparison. It also allows spaces in identifiers.)
Now, since var is of type ref int, does it not require dereferencing in order to obtain its value?
var := int(var) + 88 + five; # increase var by 93 #That would be very inconvenient, so a simple and elegant solution was devised: to make dereferencing an implicit type conversion.
(Note that with loc int, no actual memory allocation need take place if var is only used for assignments and dereferencing. The compiler can allocate var in a register.)
Implicit dereferencing is good
We are used to implicit type conversions (aptly named coercions in Algol-68) such as integer promotion and pointer upcast (from derived to base class in OOP languages). Dereferencing can just be another one of such conversions.
It would not work well in C and C++, because they do pointer arithmetic using the same operators as with numbers. Implicit dereferencing would thus lead to ambiguity. But it most other languages it would feel natural.
Take for example methods, selectors, and index expressions in Go. The language specification is peppered with special cases and exceptions that allow using type *T instead of T and value *x instead of x. Making dereferencing an implicit conversion would have made the specification so much simpler and more elegant.
In fact, when programming in imperative languages, we use it all the time with pointers disguised as lvalues. Somehow it is considered acceptable to dereference an lvalue to an rvalue without an explicit operator. The concept of lvalue is a syntactic oddity, because semantically it is indistinguishable from a pointer.
Type inference
There is a small counter-intuitive effect that the above concepts have on type inference: a variable is inferred as a reference type in the absence of constraints. Algol-68 does not have type inference, but let’s see what would happen if we added the keyword let to declare an implicitly typed identifier:
int a := 239; # a is of type ref int #
let b := a; # b is of type ref ref int #
let c = a; # c is of type ref int #Or, having substituted the actual right hand side type for let and expanded the syntactic sugar of ‘:=’:
ref int a = loc int;
a := 239;
ref ref int b = loc ref int;
b := a;
ref int c = a;Here b is a variable, initialised to a’s address, and c is an alias for a, not for the value 239. This is perhaps not what we would expect, but is nevertheless logical. Using an explicit dereferencing operator like * would be a simple and practical solution.
Mutability
Imperative programs work with values and state. State is mutable, values are not.
One important function of pointers is they identify which portion of the program’s state is going to be read or written. When something is said to be mutable, it just means that a pointer to it is available.
Consider this fragment of a Rust program:
let program = "+ + * - /";
let mut accumulator = 0;It declares an immutable program and a mutable accumulator, whose types are inferred as str and i32, respectively. (Rust does not have the int type.) Behind the genuinely clever idea of making immutable the default, and requiring an explicit declaration of mutability, lies the same old concept of a variable as a name that acts as a value in some contexts, and a state reference (“lvalue”) in others. The language specification itself prescribes unnecessary tagging of identifiers with the ‘mutable’ flag, even though pointer and reference types exist in the language and effectively do the same job.
Rust also differentiates between two address-taking operators: ‘&’ and ‘&mut’, which obtain an immutable and a mutable reference. This is important due to their different borrowing semantics. However, a language without lvalues, as proposed here, does not have the address-taking operator: anything whose address can be taken is already a reference. And this reference is mutable. So how do we go about obtaining an immutable reference?
A better question is, do we actually rather mean ‘value’ most of the time when we speak of immutable references? Let’s cast optimisations aside, which are a compiler’s concern, and focus on what we want to say. It’s hard, because we are so used to passing compound values around by references/pointers purely for efficiency reasons, that the idea of the compiler deciding on the best technique for passing values of different sizes feels unusual. Semantically, the only difference between using immutable references and the values they point at is what happens when the stored value is replaced through another, mutable reference.
In a language like Rust, with compile-time tracking of reference lifetimes and ownership, there is a good potential for eradicating a lot of complexity through the use of concepts discussed here. However, a thorough treatment of this topic is outside the scope of this article.
The duality of pointers
Of course, representation of mutable state is not the only reason why we have pointers. There are also linked structures. These two roles probably have nothing in common. It is perfectly acceptable to have pointers that prohibit assignment to them—for example, in an immutable linked list.
Duality is well known to exist in the nature: for example, gravitational and inertial mass, or wave and particle. If pointers happen to play two roles equally well, there is no reason not to take advantage of it.
To copy or not to copy?
Pointers are often used in practice to avoid copying when passing large compound values as arguments to functions/procedures. This historic oddity has very little to do with what pointers conceptually are for.
When C passes arguments to a function, it copies them. So do, probably, many languages invented around that time, that use the pass-by-value semantics. Back then, it was totally logical. Since the caller and callee had different contexts, all values passed to the callee had to be copied into its context. The compiler had to do the copying anyway, so the language naturally allowed the programmer to take advantage of it. Formal parameters were considered as local variables, initialised by the caller.
This fact has had a profound effect on how we think about parameter passing in imperative languages. A lot of prominent languages developed after C still follow this copy-on-call semantics, because we’ve grown so used to it.
However, what started as an innocent little optimisation, soon backfired violently, as data types became more abstract, and programmers got encouraged to think in terms of what rather than how, and to say what you mean. Maintaining the copy-on-call semantics became expensive — in terms of performance, program complexity, and language complexity. In C++, a copy constructor would be invoked even when the called function never wanted to modify its argument and thus never needed a copy. To address this, C++ introduced copy elision and move semantics (along with rvalue references and move constructors), resulting in a significant rework of its compilers and standard library and bringing in a lot of complexity to the language.
Let’s recall the first point of this article: variables are references. Effectively, this means that in C and many other languages the types of formal and actual parameters are different. In
double sin(double x);the actual parameter type is double, but the formal parameter x is, in fact, a reference to double.
The approach taken in Algol-68 avoids this mix-up. Values are passed into procedures as values, and references as references:
proc max = (real a, b) real: if a > b then a else b fi;
proc increment = (ref int n): n +:= 1;In max, parameters a and b are bound to values of type real and cannot be assigned to. Increment, however, demonstrates what used to be called ‘passing by reference’ (before references became proper data types): n is assignable and is bound to a variable in the caller’s context, which gets incremented. Increment cannot be called with an actual parameter of 5.
If a procedure wants a modifiable copy of its argument, it must make it explicitly. Say what you mean and let the compiler optimise the extra copy away if it has already made one at entry.
Even when a parameter is a large compound value, the programmer shouldn’t be forced to optimise the call by passing an ‘immutable reference’ instead. The compiler (and the ABI specification) can handle this.
Indexing and member access
Values often happen to be compound, constructed with arrays and structures. Array indexing and structure member access operators are used to extract the values of individual elements/members of compound values. But we also want to be able to generate new compound values by replacing selected elements of an existing value.
One way to do that is to introduce an operator that takes a compound value, an index or member name, and a replacement element value as input, and returns the modified compound value. Let’s take an integer array for example, then this operator would be equivalent to this replace procedure:
proc replace = ([]int a, int index, int value) []int: ...which returns the same array as a, except that its index-th element is set to value. (Actual concrete syntax of this operator can be made prettier, but let’s stick to this procedure for now for the sake of clarity. Note that the procedure call syntax cannot be used with structures, as there is usually no way to pass a member name to a procedure.)
Alternatively, here is a trick employed by most, if not all, imperative languages: if a compound value is stored in a variable, then a piece of that variable can be modified through assignment. For example, if a is an int array variable, then instead of
a := replace(a, 5, 239);we can assign
a[5] := 239;A modern compiler would generate identical code in both cases, as long as replace is a built-in operator rather than an external library function. However, the latter idiom is traditional and widespread, and matches the way we think. Note that almost any language implementing it requires two different [] operators: one that produces element value from array value, and another one that makes element reference from array reference. (C does not differentiate between array values and references and thus gets away with just one [].)
With that in mind, it is now easy to see that implicit dereferencing works well in both cases. In the ‘replace’ example, the first parameter receives the implicitly dereferenced value of a, and then a is assigned the value returned. The traditional idiom works too, as long as implicit dereferencing stops as soon as the type becomes applicable in its context. In the above example, the type of a is ref []int, which is already a suitable type for one of the two [] operators, and a need not be dereferenced. This [] operator returns ref int, a reference to the storage location of the 5th element of a, which is assigned the new value.
The rule of using as few type conversions as possible is quite common and works in many other situations too.
Reference comparison
When variables are syntactically indistinguishable from references, a slight difficulty arises with the comparison operator:
int a := 5;
int b := 5;
if a = b then ...(We use Algol-68 notation, where comparison is denoted by just one equal sign.)
If the = operator could be applied to any type, the above comparison would yield false, because it would compare a and b’s references. This would be very counter-intuitive. Yet a = 5 would be true due to the implicit dereferencing.
Algol-68's solution was to specify = and ≠ for non-reference types only, and to introduce separate operators :=: and :≠: to compare references. There was still some subtlety; consider the following C code:
int a = 5;
int *x = &a;
int *y = &a;
x == y; // trueand this seemingly analogous Algol-68 fragment:
int a := 5;
ref int x := a; # x is a variable containing a's address #
ref int y := a; # y is a variable containing the same value as x #
x :=: y; # false because we are comparing ref ref ints #but in such cases, it is easy to track which reference is where and apply explicit dereferencing where applicable (in the above example, to either x or y).
Orthogonality
The concepts described here took their origin in the application of the orthogonality principle to the design of Algol-68:
“The number of independent primitive concepts has been minimized in order that the language be easy to describe, to learn, and to implement. On the other hand, these concepts have been applied “orthogonally” in order to maximize the expressive power of the language while trying to avoid deleterious superfluities.”
[ALGOL 68 Revised Report, section 0.1.2 Orthogonal design]
Programming languages of that time had a lot of such specific independent constructs occupying their small individual niches, which got replaced by more abstract concepts, applied uniformly, for example:
- assignment statement of the form a := b := c := expression, which assigned the same value to multiple variables, was absorbed into the concept that everything is an expression; statements such as assignment and conditional became expressions and could be used as part of other expressions;
- the difference between functions, which returned a value, and procedures, which did not, was erased by introducing the void type;
- three kinds of parameter passing: by value, by reference, and ‘by name’ (a rather esoteric Algol-60 mechanism), were replaced with always passing by value, and by extending the type system to incorporate reference and procedural types.
Procedures became values, and by following the orthogonality principle, they could be used like all other values: assigned, passed to and returned from procedure calls, and entered as literals anywhere in the program. This concept encompassed what would later be called ‘first-class functions’, ‘anonymous (lambda) functions’, and ‘closures’.
Conclusion
Programming language design is somewhat prone to occasional inelegance and excessive complexity. For example, in imperative languages, the role of pointers/references in representing mutable state, and implicit type conversions (including dereferencing) are two simple, yet elegant and easily applicable concepts that are often overlooked, but could make the language much cleaner by:
- eliminating the concepts of lvalues and rvalues and the linguistic complexity they bring with them;
- making the type system more consistent;
- getting rid of copy semantics in pass-by-value contexts;
- uniform treatment of Go-like methods and operators like ‘.’ and ‘->’ in C due to implicit dereferencing;
- following the say what you mean principle by not forcing programmers to use pointers and references where they mean values, and by allowing compilers and ABI authors to carry out optimisations on large compound values;
- treating the mutable vs immutable division as that of value vs state.
If these ideas are so good, why don’t we hear of them more often? Most likely, the reason is the role of C and its influence on programming languages.
Historically, some of the pioneering concepts devised in the late 1960s were considered too heavy for a systems programming language like C that required a simple, efficient, and straightforward implementation. C cherry-picked several useful constructs from Algol-68, such as the conditional operator ?: and the void type, but not the underlying concepts. Similarly, C++ began humbly as a thin layer on top of C and grew organically with time. By now, C and C++ continue to be very well known and influential languages in their domain, and have eclipsed their origins, concealing them from the view of language designers.
As we find ourselves in the quest for a better C, now can be a good time to revisit what was left out then.
Thanks to Cyril Schmidt and redditors on /r/rust for reading drafts of this article.