What's missing from values in data tables

Storing data in tables is such an integral part of data collection, curation, processing, analysis, and archiving. And for each of the these different purposes, there are different opinions and best practices for how that data should be stored.

Many data table projects, like Pandas and data.table, focus on how to do complex manipulations of data tables. Because these operations need to be performed quickly and cheaply, it takes a lot of work to make them function properly.

But in my experience, so much of the data table software focuses on what to do with data when you have it. It doesn’t really focus on how you obtain data in the first place.

Both of these are important perspectives, but they are certainly different. I’d like to explore a bit of how you obtain data in the first place impacts how data can (and maybe should) be represented in data tables. I’d also like to comment on how these problems are typically glossed over in modern data table software packages, and why.

An example scenario: clinical trials

I’ll start with a relatively simple example scenario. In cancer clinical trials, many studies are performed on a set of patients anywhere between 10 and 100 patients. Patients will be split into two (or more) groups, for example, where one group is given the treatment being studied, and the other is given a placebo ¹.

The sad reality is that many of these patients will die while the study is ongoing, due to their disease. So how do you keep track of which patient is in which group, and what happened to that patient over time? How will you know how effective or safe this treatment is?

We’re going to use this example to tease out the different kind of data we’re working with. You should see how this simple question of “how should we store our data?” quickly spirals out of control.

Value types

One way to figure out how a patient is progressing is to keep in regular contact with them. You check in to ask if there are any side effects, if their disease is progressing or improving, if there are major changes going on in their lives, book follow-up appointments for scans, etc.

Your table containing this data might look something like this:

Patient ID	Treatment Arm	Height (m)	Weight (kg)	Date of Birth	Trial Enrollment Date	Date of Last Contact	Living Status at Date of Last Contact
1	Treated	1.70	89	1970-12-13	2020-01-01	2020-06-01	Alive
2	Control	1.55	56	1985-05-05	2020-01-02	2020-06-01	Dead
…	…	…	…	…	…	…	…

Example clinical trial data table.From this simple table, we can already see different values coming into play. Firstly, we have values of different types:

Patient IDs appear to be integers ²,
the Treatment Arm each patient belongs to and their living statuses are both categorical variables that we can represent with strings
the Height and Weight are real numbers with a fractional part,
and the various dates are specific days in history

These types are represented by computers in different ways and we need to formalize this in some data structure so that a computer can store it, perform calculations on it, and display it to a person. Here is one way that we might want to store these values:

Column(s)	Value types we can use to represent them
`Patient ID`	unsigned integers
`Treatment Arm`, `Living Status at Date of Last Contact`	categorical variable or enumerated type
`Height`, `Weight`	float
`Date of Birth`, `Trial Enrollment Date`, `Date of Last Contact`	ISO 8601 dates ³

One possible choice of how to represent the clinical trial data with types.

Without getting too technical here, this is just one choice of the many ways we could possibly represent this data. But it is pretty sensible, so let’s roll with it.

For most data table packages, we can store this metadata (i.e. data about the data itself) in the form of a data schema. This schema will look something like a set of columns (i.e. the variables of interest) and their corresponding types. This informs the data table software how to represent all this information in memory, how to efficiently retrieve subsets of this data, and how to perform calculations on this data. All of that is just great. And for many goals of collecting data, that’s all you’ll really need.

But if you want to do more than just collect data (i.e. you want to make it useful), then some more work is required. Numbers on their own are often meaningless. But they can gain meaning by being compared to other numbers, and you can only compare numbers if they have the same units, so let’s talk about those next.

Value units

As stated above, we can use integers to represent each patient. But it doesn’t really matter what integer we assign to any individual patient. It just matters that we give each unique patient a unique number so we can keep track of the patient. The number is a placeholder, but the value itself doesn’t mean anything.

Contrast that with the Height and Weight columns, where the values themselves matter, and they each have units associated with them. We could say that the Patient ID is a unitless number, whereas the Height and Weight columns have units of meters and kilograms, respectively. These units are important and prevent us from doing silly things with arbitrary numbers ⁴.

So not only do we have values whose numerical value is meaningless, we also have values whose numerical value is meaningful, and whose meaning only makes sense in the context of a given unit. This is the first example of something I find lacking from data table software packages. So much work behind these packages is dedicated towards the keeping track of the different types between each column, because that’s where all the bottlenecks and computational problems show up. But units are often neglected, despite being just as crucial as the value itself. It is often left to the user to ensure that the numerical values in each column are stored in a sensible unit, and that the unit is the same for each row.

There are too many good reasons why this is the case to get into here. But think for a moment how easy this is to mess up if this is simply left to the user.

Most of the world works in the metric system, with the US being a notable exception. But Canada and many other Commonwealth countries live in this weird limbo where metric is used for “formal” things, and imperial units are still used for “informal” things. Values that people often mention could be given in metric or imperial units, usually depending on what they learned in school growing up. Canadians often refer to temperature in degrees Celsius, but older generations still use degrees Fahrenheit. Most people will list their height in feet and inches and weight in pounds. But many government or medical forms will list those values in (centi)meters and kilograms, respectively.

So if you’re working in the clinical trial and need to record all this information, you can simply ask someone these questions. But you’ll get responses in every imaginable unit, so you can’t just record what you hear. If you’re only working with 10 people, converting those values to the same unit is easy enough. But what about 100 people? What about 1 000? What about 1 000 000? And if you’re looking at this table after the data is collected, how sure are you that every number is recorded correctly? How likely is it that someone forgot to convert 5’ 10” into meters and now it looks like someone is inhumanly tall at 5.10 m?

These types of errors are usually identifiable in small datasets. But in larger datasets, they’re tougher to spot unless you go looking for them. You might not even see these errors until you find that the average height of your study group is > 2 m tall and you start wondering if your study is entirely comprised of professional basketball players.

Exact values

But more than just types and units, there are different kinds of values in this table, too. And for lack of a better word (and because words like type and unit are already taken), I’ll just use the word kind to specify it.

The Patient ID column contains numbers that the clinicians have assigned to the patient. These are exact values because there’s no uncertainty about them. The same goes for the Treatment Arm. But Height and Weight are clearly not assigned by people conducting the study. Those values were obtained in some way.

How were these values obtained? Were the patients asked for this data, or did the clinicians measure them? Are there different amounts of uncertainty in those values based on how they were obtained? How much precision matters for the purposes of this clinical trial?

The distinguishing feature I want to highlight here is that some values are exact, whereas some have some degree of uncertainty around them. It’s worth noting that this is a different concern than their type that computers use to represent the values. A patient’s height might be 1.70 m, +/- 0.5 cm because of the notches on the ruler used to measure them, whereas their date of birth is a fixed day we can treat as exact ⁵. There is more to discuss here about uncertainty, so let’s put a pin in this idea and come back to it later.

So far we’ve covered:

types used to represent the value in a format the computer can store and use
the units of a particular value
the uncertainty or exactness of a particular value

Much writing about data analysis and computer science has been made about #1. #2 tends to be left as an afterthought for good reason ⁶, and #3 is rarely represented in data tables at all, despite it being so critical to proper scientific analysis. The uncertainty discussion only seems to start in statistics and probability classes, and rarely carries over into the actual representation of the data, themselves.

And as crucial as these three properties of the data are, there is still more to consider. The next scenario to address is that of unknown values.

Unknown values

Missing values

Returning again to our patients in this clinical trial, we want to keep in regular contact with them to see how they’re doing. But subjects in real life clinical trials often cannot be contacted for various reasons. Maybe the subject died, or changed their phone number, or moved to a new residence. Maybe their work schedule doesn’t allow them to be contacted during hours that staff try to contact them. Maybe the person is much sicker than before and decides they no longer want to participate in the trial at all, refusing to answer any questions. There are a whole host of reasons why, but what matters is that you don’t have the information you need.

If we denote whether a patient, \(i\), is alive at some arbitrary time, \(t\), by the function \(S_i(t)\), what is its value at a given time?

Patient 1 was alive as of 2020-06-01 and they were enrolled in the trial on 2020-01-01, so they were definitely alive during that time interval. But what about after 2020-06-01?

In this case, our value for any time after is 2020-06-01 is missing; the value exists in reality (the patient could be Alive or Dead ⁷), but we need to do some work to find it. The technical terminology for this type of data is censoring, and there is much work around this topic because in clinical trials, this type of data cannot be ignored.

Our function, \(S_1(t)\) would look something like this:

\[S_1(t) = \begin{cases} \textrm{alive} & t \in [2020-01-01, 2020-06-01] \\ \textrm{missing} & t \in [2020-06-01, \infty) \end{cases}\]

Often, in different programming languages, a missing value could be represented in a few ways, depending on the data engineer or analyst. In the R, there are lots of ways to represent different missing-like values.

> is(NA)
[1] "logical" "vector"

> is(NA_integer_)
[1] "integer" "double" "numeric" "vector" "data.frameRowLabels"

> is(NA_real_)
[1] "numeric" "vector"

> is(NA_complex_)
[1] "complex" "vector"

> is(NA_character_)
[1] "character" "vector" "data.frameRowLabels" "SuperClassMethod"

> is(NULL)
[1] "NULL" "OptionalFunction" "optionalMethod"

> is(NaN)
[1] "numeric" "vector"

And this ignores the common shortcuts like using an empty string "" for a missing String, or a 0 or -1 for a missing numeric value that must be positive.

In isolated cases, the difference between these different representations of the idea of missing values is fine. But in large datasets that contain millions, or even just hundreds of entries, these little collisions and confusions about what means what does make quite the difference.

If your clinical trial is made up of 10 people in their 80s, but someone forgot to record the date of birth for a few patients and their ages get listed at 0, suddenly the average age of the group could drop into the 50s. This simple mistake would give some other person a misrepresentation of what was actually going on in this trial. This may be a harmless example, but I’ve made many of these dumb mistakes myself and am usually only able to track these problems down after wasting many hours of my time ⁸.

Unknowable values

Let’s now say that you’ve tried to contact Patient 1 to see if they’re alive after 2020-06-01. Imagine that you weren’t able to contact the patient directly, but were eventually able to contact their emergency contact or some next-of-kin. If that next-of-kin refuses to tell you whether the patient is alive or dead, what should you do to store that value in the data?

The data is missing, in the sense that the value exists but you don’t know it. But now you’ve put in the work to find out and are refused an answer. It doesn’t feel right to list the value as missing, so I think it’s worth thinking about a new kind of value: an unknowable value.

We can define an unknowable value as a value that exists in reality, but no amount of work will allow us to figure it out. The difference between missing and unknowable is our inability to record the value at all. A missing value is one that we are uncertain of but might be able to find out, whereas a unknowable value is one that you are certain you will never know. This distinction can give someone filling out data the confidence that effort has been made to fill in a value. Think of it like a “this entry is intentionally left blank” in your table.

For both of these cases, missing and unknowable, I have yet to come across a data table package that handles this distinction. But that’s not exactly their fault. The programming languages themselves rarely have enough subtlety in their non-numeric types to make this distinction. So if the language you’re working in doesn’t have a type you can use to make this distinction, how is software in that language supposed to do it?

Not applicable values

In some clinical trials, patients have already undergone the standard-of-care therapy before being put on drugs undergoing trials. This could happen if the standard therapy fails or if the cancer has recurred from an earlier diagnosis many years before. Here, it’s useful for scientists to know what previous therapy (if any) a patient has had before analyzing their response. If we include that data in our table above, our table might expand to something like this:

Patient ID	Treatment Arm	…	Previous Therapy	Time Since Previous Therapy (years)
1	Treated	…	Yes	2
2	Control	…	No	?
…	…	…	…	…

What value should go in the ? cell? We know that Patient 2 has not had previous therapy, so a missing or unknowable value isn’t appropriate here. But an explicit value isn’t appropriate either. We could follow a common convention that says -1 means no therapy, but like above, this can cause some confusion if we start doing calculations with this data and we aren’t careful to check for -1 beforehand.

Since no previous value kind seems to be appropriate here, we can make use of a NA or not applicable value in this entry. Not applicable applies where a value does not exist because it is not relevant. This differs from an exact or non-exact value because there is no relevant value to be had. It also differs from missing and unknowable because it is available and we know that it’s not applicable.

NA values are an idea that data table packages actually do handle well. This is such a common problem that you can’t really ignore it, and both Pandas and data.table can do this well. As shown above, R has multiple different NA values, which makes working with them relatively easy.

Non-exact values

“Not a value” values

Everything I’ve written above relates to storing values themselves. However, the original stored values are often just the first step towards getting at the thing we actually care about. Scientists almost always use these values for calculations of some sort. But calculations with values can be tricky and fraught with problems, like integer overflow or underflow, accidental division by zero, and more.

We could ignore these errors in calculations and deal with them later. But that’s a mess that requires lots of cleaning up after the fact, if you even spot the problem at all. So it would be better to know that these problems exist and be prepared to handle them safely in the first place.

Like the introduction of previous value kinds, an error doesn’t really fit into any of the previous kinds that have been raised. The value is relevant, so it’s not not applicable. We are in possession of the values required to calculate this new value, so it’s not missing or unknowable. But the value isn’t known exactly, so it can’t be exact. We should be in posession of the value, but we aren’t due to some technical error resulting from the representation of the value or its manipulation, not from the value itself.

Because none of the previous value kinds satisfy this state, we can introduce a new kind that I’ll denote as not a value. What’s new about this not a value is that it also has an implied error associated with it. This error is the thing that leads to the value not being exact. This associated error will have implications for how to represent this value, so we can put a pin in this idea and come back to it later.

In Python and R there is a NaN value. Data table packages make use of this value, which is great. But what I do find lacking is that associated error I mentioned. It is almost always useful to know how the NaN was generated, especially because integer overflows are a very different source of error than division by zero. This is especially relevant when working with probability distributions, where so many values are close to 0 that you often have to work in log-space to avoid entire classes of problems. And like the distinction between missing and unknowable values, I have yet to come across software that attaches the error to a NaN when it is generated.

Statistical values

We previously put a pin in the discussion about non-exact values that have some kind of associated uncertainty with it. Let’s remove that pin and talk about that uncertainty some more.

Take someone’s height for example. Let’s say you measure someone and find out they are 150 cm tall. How sure are you that they’re not 151 cm? Or 149 cm?

There will be markings on the ruler (or whatever tool you used) to denote each nominal length. So when you say the person is 150 cm, it’s because their height is closer to the 150 cm marker than either the 149 or 151 cm markers. But that still leaves you with some uncertainty. How do you know the person isn’t 150.5 cm tall? Or 149.9 cm? With nominal markings like those on a ruler, our uncertainty is bounded by +/- 0.5 cm, because we can see what the closest marking is.

This uncertainty is also distinct from, but related to, the units. Maybe a doctor measured the patient’s height in a country that uses feet and inches instead of meters. The uncertainty around that measurement could be +/- 0.5 inches instead of 0.5 cm. The explicit numerical value of the uncertainty is the same, but those uncertainties are very different from each other.

An always-important question around the uncertainty is also how much does this uncertainty matter?. If I’m just measuring someone’s height because I can, then keeping uncertainties to +/- 0.5 cm is probably fine. However, if I’m measuring the volume of someone’s tumour over time to see how it responds to treatment, that person’s life depends on how accurate my measurements are. I should invest in high quality instruments and learn how to measure accurately in a case with extreme consequences like this.

Without going too far down the metrology rabbit hole, in general, we will have some degree of uncertainty in measured values. Because they have some uncertainty associated with them, we cannot say that they are exact. We must have some new way of representing a value with some uncertainty around it.

To do this, we can introduce a statistical value kind. A statistical value is a non-exact value with some uncertainty distribution around it. In the case of someone’s height, this distribution could be something like a uniform distribution in the range (149.5, 150.5). For something more complicated, like measuring a tumour’s volume using MRI scans, this distribution is going to be more complicated.

Let’s say that I’m trying to measure the volume of a rectangle for some construction project. If I measure the length, width, and height to be 6 cm, 5 cm, and 8 cm, respectively, and each measurement has an associated error of +/- 0.5 cm, then the true volume of that rectangle could be anywhere between 185.625 cm³ (5.5*4.5*7.5) and 303.875 cm³ (6.5*5.5*8.5). An uncertainty of +/- 0.5 cm in linear distances translated into +/- 59.125 cm³ uncertainty in the volume.

Calculations and measurements are two of many technical factors through with uncertainty can creep into your data. While we don’t always need to explain exactly why a measurement is uncertain, we do need to keep track of it. Having a statistical value kind can help with this error propogation and tracking.

Statistical kinds are another feature I find almost entirely lacking from data table packages. Again, this is arguably not the point of these packages in the first place. But requiring that data tables have this embedded uncertainty with each statistical value is also not necessarily a good thing. It can add an extreme amount of overhead to just storing a single value in the first place that people would be unlikely to use the software at all. There are many real, important discussions about how you could even do something like this from a technical level and user experience level.

There is clearly a need to representing these kinds of values. But there is a conundrum about how to do it that makes actually doing this at all very hard.

Synthesizing properties of different data kinds

Let’s summarize the different questions we’ve asked about our data and how that changes what kinds of value we have.

Flowchart of how the /kind/ of value should be decided. -100%

Hopefully through the clinical trial example you can see how each of these questions arises pretty easily. And hopefully you can also see that each of these questions, despite being very simple, has important consequences for how data is stored.

So what can we do about it? Is there a way that we can actually use these questions, together with value types, units, and kinds, to actually store data in a more robust way? Let’s try to do exactly that.

Defining a generic value that includes kinds

We can summarize a lot of the above ideas and put them into a single data structure. To do this, I’m going to borrow Rust notation because of how expressive and succinct it is.

enum Value<T, R, U, E, D> {
    NotApplicable(Option<R>),
    Unknowable(Option<R>),
    Missing(Option<R>),
    Exact(ExactValue<T, U>)
    NotAValue(E),
    Statistic(StatisticalValue<D, T, U>),
}

Here’s what each part of this means:

A Value is not simply a collection of bytes stored in some given type. It’s a generic enumerated type that must be one of a few distinct variants.
Each variant is one of the kinds I have mentioned above: NotApplicable, Unknowable, Missing, Exact, NotAValue, or a Statistic.
For the Exact and Statistic variants, the type of the value is denoted by the generic parameter T. This could be a u8, f64, String, enum, or any other type.
For the Exact and Statistic variants, the unit of the value is denoted by the generic parameter U. This could be a meter, a kilogram, or any other relevant unit for the application area.
For the NotApplicable, Unknowable, and Missing variants, there is a generic R type that refers to the reason why that value is not applicable, unknowable, or missing. It is embedded within an Option, because there may or may not be an actual reason.
For the NotAValue variant, there is a generic E type that refers to the error associated with the NotAValue. This would be something like a DivisionByZero error or some panic!(), say, in Rust terminology. ⁹
For the Statistic variant, there is a generic D type that refers to the distribution of this value’s uncertainty. This could be a Gaussian distribution with a given mean and variance, say.
The ExactValue type is its own generic data structure that will hold a specific quantity of the type T with unit U, ensuring that these two generic parameters are always coupled together. Operations on the ExactValues will need to ensure that they share the same unit before
The StatisticalValue type is, again, its own generic data structure that will hold not a specific quantity, but a distribution D with an associated type T and unit U. This distribution will have its own set of parameters, based on the causal graph that generated this data point.

For all the complexity I’ve described above, this actually seems like a really modest and flexible solution that addresses many of the major concerns I have with features that are missing from current data table packages.

Each element in the table is a Value and each column will share the same type T, which is similar to how many data table packages work right now. Values can be stored in different units U, but any calculation between two or more values will need make the units equivalent before proceeding (or defining some error of type E if that is not possible). The units might be different within a column. But using two different units, say U1 and U2, might help for finding for space-/memory-efficient representations of quantities in the same column (e.g. so all Values in a column share the same number of bits for quick indexing). These units might also help for performing efficient lazy calculations (e.g. if one Value has kilometers as its unit and another has nanometers, then you might be able to skip over all the Values in nanometers if you’re summing them).

You could simplify things, theoretically, by observing that there is no real difference between a StatisticalValue and an ExactValue. You could represent the ExactValue by a StatisticalValue whose distribution is the Dirac delta distribution centred on the exact quantity, for example. But for the purposes of clarity and efficient computation, it might be easier to store the actual quantity than some more abstract representation of it.

In any case, I think there are enough interesting computer science, user experience, and data analysis questions that could be explored by adapting this data structure for data table software packages. Rust is clearly a language that can express such a low-level representation of each Value and have the potential to build a very efficient library. And, inspired by Rust’s origins of eliminating entire sets of errors, this flavour of a Value data structure could potentially also eliminate entire sets of problems for storing and using data stored in tables, if implemented well.

I’m just hypothesizing here, but I do think that this idea has some value ¹⁰. This idea could fail spectactularly, though, if it turns out that storing data like this is just so inefficient that it makes it not worth the effort in the first place. But given the number of mistakes that I have personally made, or that I have seen others make, with poorly formatted and validated data tables, I could see this being useful.

Conclusions

Data storage, processing, and analysis is not easy. The last few decades of computer science and software engineering have made huge strides in how we can actually do useful things with computers in almost every aspect of our lives. But that doesn’t mean that there’s no more room for growth in the fundamentals of how we store and process data.

I’ve described above some general shortcomings of current data table software packages that stem from how the data was obtained in the first place and how those origins can be represented in the data itself. I’ve also proposed a very minor solution to those problems. I hope you can see how flexible it is and how it encapsulates everything I’ve talked about up until this point.

I have left a few points out of this discussion, like imputed values. Imputed values are a tricky thing to handle, because they are usually dependent on other values in the dataset. This breaks some assumptions about how data is usually collected and analyzed, even if it doesn’t quite interfere with the types, units and kinds that I’ve talked about, above. This topic is complicated enough that I’ll just link to a recent talk about missing and imputed data from a lecture series from The Alan Turing Institute.

Sadly, I don’t have some big announcement about creating some data table software that makes use of this Value enum. Nor do I have any idea about how using this data structure would or wouldn’t complicate how calculations are performed. Those are a bit beyond my area of expertise. But I hope no one tells me that this single data structure is too complicated for people to use. If someone does, I’ll just point them towards the documentation for the Apache Arrow, Apache Parquet, PostgreSQL, and SQLite data formats and ask them what is simpler about any of these?

Overall, thinking about this proposed Value data structure has clarified some misunderstandings that I have with datasets. And it mulling over potential solutions here, and how well it scales across fields of study, has changed my approach a bit to interacting with data, in general. So hopefully the ideas here may be useful to other people, too.