The Good, the Bad and the GUI
haskell.org>There is no excuse for a spreadsheet quietly taking a never-assigned cell as zero, but indeed it does. WHAT THE HELL WERE THESE PEOPLE SMOKING?
I think the head scratching about Excel's (and probably other spreadsheets) behavior on empty cells is misguided. Yes, if you have a computer scientist mindset then it might seem very mathematically satisfying to have strict rigorous logic around empty cells but we're getting into 3-valued NULL handling like databases. That type of extra logic checking is misplaced considering the origins of spreadsheets (e.g. Dan Bricklin's VisiCalc) as a business computation tool instead of a GUI for an E.F. Codd relational db.
If a particular power user finds the typical "null" cell behavior to be "wrong", he can add formulas to his cells such as:
=IF(ISBLANK(C3), NA(), C3)
... such that #NA# (Excel's quasi "null") propagates throughout any SUM() and other calculations.
However, imagine if the situation was reversed and we had an alternate universe where Excel propagated empty cells as #NA# instead of zero to satisfy the logic of computer scientists. We'd inevitably have:
1) Microsoft adds a button wizard on the toolbar to bulk fill "null" cells with 0.
2) User forums with power-users trading VBA macros to fill in zeros of empty cells. "Ugh! How do I get rid of all these useless #NA# littering my spreadsheet???"
3) Microsoft adds an user option checkbox to "treat empty cells as zero instead of #NA#" -- which everyone ends up activating as a defacto setting. This cancels out the strict null handling the computer scientists were imposing on the spreadsheet!
I contend this alternate universe scenario is much worse for business users of Excel. In other words, you can't "force fit" the concept of NULL as a default where it is not natural. The end users will just work their way around it.
Yeah, hate to break it to the OP, because God knows I speak against the use of spreadsheets whenever possible...but the most common use-case for a spreadsheet is not as part of a reproducible, automated data pipeline. Many times, it's hand-entered data entry, with the goal of making one of the pre-baked Excel visualizations. For users who want type-checking, there's Access, which is more attuned to handling non-trivial datasets.
> Many times, it's hand-entered data entry
Yes, many companies pays good money for people to update those Excel sheets.
Me? I see almost every single Excel sheet that is updated over more than a week as a possible sale. But I'm not replacing Excel spreadsheets full time, at least not yet : )
It's just a good thing no one ever makes important decisions based on spreadsheets.
I don't understand why its usual to assume that the only two choices are
Why not make the calculation while adding a visual cue (e.g. a yellow or red indicator in the corner) that marks the field as tainted? Now the users get to decide whether they want to investigate or ignore that.* show N/A everywhere * silently ignore sources of potential errors.You mean like Excel actually does? http://i.imgur.com/kMPKnqO.png
Does that propagate, or is it only one hop?
Yes.
> if you have a computer scientist mindset...
Even things written by computer scientists sometimes follow the empty-is-zero model for convenience. awk's arrays behave very similarly to Excel's cell grid in this particular manner: they are conceptually of infinite size, and any element not yet set is implicitly 0. Hence you can calculate word-counts with something along the lines of count[$i]++ without first checking whether count[$i] exists; and you can also sum the counts of a predefined list of words, without throwing in if-count[$i]-is-defined checks.
You can't accuse Alfred Aho (the 'A' in 'awk') of not being a computer scientist. Yet in awk, nonexistent varaibles serve as zero or empty strings, and strings that look like numbers can be used for arithmetic. This is a matter of pragmatism which depends on the goal of the tool and the target users.
No doubt, it poorly serves users who want to create huge, complicated spreadsheets, or large programs.
Spreadsheets could address the problem by having this as a cell property. Suppose you could highlight a rectangular region of the spreadsheet and mark the whole region as having "strict initialization": any cells in the region which are empty will trigger a diagnostic if they are accessed.
What if instead of "propagating empty cells as #NA everywhere", instead the functions operating on cell values returned #NA by default if an empty cell was present in the input, but also included an optional parameter which was either a true/false flag to treat the value as the appropriate identity (makes sense for things like SUM, etc.) or which provides a default to use in the case of an empty cell (which may make more sense to use for functions that aren't the append function of a monoid.)
(Though I think the failure to use exact arbitrary precision numeric values except where an imprecise operation forces fallback to limit precision binary floating-point approximations -- and then visually distinguishing when the latter has occurred -- is a bigger problem for common spreadsheet use cases than the bad null handling.)
I get it, I really do. But I'm a programmer.
Spreadsheets are the dominant end-user-programmable tool in existence. Much of what people user them for is horrifying to programmers who know better.
But if spreadsheets actually did the "right" thing, they would never have been so widely adopted in the first place. That's an unprovable assertion on my part, but I strongly suspect it's true. I think it's a clear case of Worse is Better[0].
> But if spreadsheets actually did the "right" thing, they would never have been so widely adopted in the first place.
I don't think spreadsheets are widely adopted because they do the wrong thing, I think they are widely adopted because they provide UI affordances that systems that do the Right Thing didn't at the time that spreadsheets became widely adopted, and, since then, IT departments have imposed lockdown requirements which prevent anything programmable from being accessible to end-users except spreadsheets, because spreadsheets were so widely adopted before that lockdown began and end users simply refuse to give them up.
Its not, IMO, a worse-is-better situation, its simply better-is-better (in terms of UI affordances at the time of wide adoption) combined with non-technical, socially-imposed constraints which have locked the dominance in place by preventing any competition in the end-user accessible programming space.
Thats because most tools that strongly focus on correctness are also designed to not allow the user to proceed despite the detected errors.
The real right thing to do there is to give the user a warning: visually mark the field as potentially tainted, but assume the user knows what they're doing and give them a result anyway. Let them decide how much they care about correctness in that particular case.
I've learned to love Excel. Despite the horribleness mentioned and the seemingly intractable mess people make when trying to codify business processes with it, its still the very best way I've found for the end user to tell me what they want the "real" program to do when I finally write it for them. It forces them to stop thinking about specs in a genie-in-a-lamp way and start thinking about how an actual machine might make it happen. They glimpse the programmer's world through a glass darkly.
So much this. People who can't be convinced to write out a spec in a way that's meaningful and unambiguous are often more than happy to take a spreadsheet that mirrors the database, write the formulas that generate the outputs they want and give that to me. Hours and hours of back-and-forth are eliminated when I can convince people to do things this way.
This should be in lesson 1 of any programming course.
> As for "the developer is supposed to implement it",
> next week I'll be giving my annual ethics lecture
> and I'll be pointing out to students that the codes
> of practice of the various professional societies
> all agree that your duty goes beyond simply doing
> what you are told.
>
> If you are told to write consumer software that
> gets its sums wrong, you should not do it.I think rule #1 should be don't try to second-guess what the end user wants or needs when you are writing or maintaining a piece of software that you yourself don't use or need.
This is a cultural issue. Most business I know see a program telling you data is invalid just means the program is annoying and getting in the way.
They want the program to just assume you know what you are doing.
This is what they want, so this is what they get.
Back in the mid 1980's I used software by Javelin software that could be used to serve the same purpose as a spreadsheet. Javelin was different in that one wrote rules based on variables not cells. It was much easier to avoid the errors that so often lurk hidden beneath the visible cells of a spreadsheet.
Unfortunately, I never met another user of the software and the company eventually went nowhere. The spreadsheet was a concept very easy for bookkeepers, accountants and other non-programmers to understand. Not so much with (the better) Javelin.
Here is an add for the software from 1985: http://www.thecomputerarchive.com/archive/Software/Applicati...
The closest modern equivalent to Javelin is Quantrix Modeler, although it doesn't seem to have the breadth of functionality that Javelin did.
In every language I know there is a set if conventions around Null values, not set values, true, false and zero
We have only had the concept of zero for a thousand years and null for about fifty. Expecting there to be a common way of handling this in all domains and industries is a bit much and expecting it to be handled the way the OP prefers is highly unlikely (although I do agree with their opinion)
If the convention in accountancy is BLANK == 0 then that's their convention. Implement it using a single method that's nice and easy to instrument and discover for later on, perhaps raising warnings out to the reports produced.
Leave the worrying over not set values for when Unicode and UTC handling in excel is brought up :-)
The concept of zero is at least 3,700 years old. The Egyptians used it in their accounting. See http://en.wikipedia.org/wiki/Egyptian_numerals#Zero_and_nega...
Null is probably also much older: https://en.wikipedia.org/wiki/Mu_%28negative%29
I continue to be terrified at how spreadsheets are used in business applications. Squashing data that shouldn't be in tables into tables "so you can work with them" was the worst thing we ever taught non-programmers to do. We should have taught them how to program instead.
(you can get close to programming in a spreadsheet if you know what you're doing with Insert > Name > Create, and one day I'd like to see a spreadsheet that lets you write your macro functions in Python rather than VB, but nobody is taught to use Excel that way except people who also program.)
It's not so much Excel's tables that people want, as its dataflow programming. Excel is a widely available environment that lets you define data slots in terms of functions on other data slots, with the whole chain updated live as values change (no "manual" update logic, just functions of cells with auto-update). The table layout is just a default way to view the slots. Until very recently that programming style was not widely available elsewhere, especially with a GUI. The only other semi-widely used system I can think of that sort of has that functionality is Mathematica (where you can link slots in a notebook), but Mathematica is more niche and expensive than Excel.
Heck, even without the GUI requirement, "real" programming languages have only very recently added competitive functionality, with the exception of Common Lisp, which had the Cells package ages ago. Now it's getting more common to find various kinds of dataflow/reactive/data-binding constructs in mainstream languages other than Excel, but it's quite new.
In terms of auto-update, sure. But I was talking mostly about how to structure your data - for most business activities, lists of objects with properties (or dictionary keys) are far superior to cells, and collections of cells, with obscure names.
How much easier would spreadsheets be to understand if you had
instead ofsum([widget.price for widget in catalogue])
?SUM(F3:F405)Teaching people to think more about their data structures, rather than teaching them to squash everything into a table even if it doesn't really fit into one, would dramatically expand their skills in this type of analysis!
"Dramatically expand their skills" and "easier to understand" are opposite directions; ease of use means getting some results with as little skill expanding as absolutely possible.
For immediate results and learning, your proposal spreadsheets would be harder to understand - if your data fits on a single screen (not ...F405), then you can just point your finger at it and say "Here! here are the widget prices!", but any structured model would (a) require thinking about what a right structure would be for the situation (that's the hard to use part, takes effort), (b) require understanding about what makes a proper data model (that's the hard to learn part, many people won't know how) and (c) makes it nontrivial to update that model (that's the hard to maintain part - less chance to break stuff, but more effort to do it properly).
For reusable datasheets and estabilished processes it's an entirely different situation, and for that there are not-Excel apps or named ranges within excel; but creating an usable dataflow and a reasonable structure is a much, much more difficult skill than using excel currently. It may be easy for you, but not for the general population; heck, I see heaps of second year CS students that have no clue about how to structure their data, so requiring that can't make spreadsheets easier to understand.
You can essentially do what you're asking in Excel using named ranges. Getting novices to use advanced Excel features is just as difficult as getting novices to use advanced features in any other language.
Indeed, I mentioned this in my original post (child of the article) - such features are only taught to users so advanced that they are (more than likely) also programmers (or certainly should be). I think this is a shame.
one day I'd like to see a spreadsheet that lets you write your macro functions in Python rather than VB
That's LibreOffice :) Though it's not "rather", but "besides".
Interesting. I will have to try this when I have time!
Anecdotally, trillions of dollars daily are moved around through chains of excel spreadsheets. It's been an eye opener for me :-)
In the posters variant of a spreadsheet where a blank value is invalid, how would the common practice of SUM(column A) be handled, where column A has an unknown number of rows (that keeps on being added to)?
Require all formulas that refer to column A be continually changed to reference the true amount of rows? Stop at the first blank row? Stop at the last blank row? It seems more confusing to define rules for that case than to assume "undefined values are zero".
> In the posters variant of a spreadsheet where a blank value is invalid, how would the common practice of SUM(column A) be handled, where column A has an unknown number of rows (that keeps on being added to)?
IME, that's not really all that important of a case, because this:
> Require all formulas that refer to column A be continually changed to reference the true amount of rows?
Is the normal way to handle it in Excel, so normal in fact that Excel has a couple of automated ways to handle it.
If you are doing a table (which is the only case where "sum of a column" really makes sense), then for Table T and column C, the formula is SUM(T[[#Data],[C]]).
The other is the fact that Excel automatically updates ranges in formulas as rows are inserted, etc. (though this doesn't work if you are appending, only if you are inserting within the range.)
The two cases I've seen where relying on blank row handling is a common way of dealing with data that can be added rather than relying on either structured references in tables or excels automated updates are:
1) Spreadsheets written prior to Excel supporting structured references in tables (or by people who learned Excel that long ago and haven't updated their skills), and
2) Creating forms with a fixed number of rows (usually, to be printed in hardcopy form, so that page layout is an issue), where simply zero-filling the relevant cells with a formula that doesn't display zero values would be a simple solution (since you have a fixed set of cells to fill.)
Of course, you could also have function fail by default on blank cells but take an optional parameter to treat blank cells as the appropriate identity, the same way Excel lookup functions rely on sorted data by default but can be given flag that tells them to find an exact match without relying on sorted input.
1. Has anyone ever audited a non-trivial computer program and found it to be free of errors? (I'm sure I've created trivial spreadsheets with no errors.)
2. What if spreadsheets started with 0s in every cell, would this "solve" the problem? I don't think so, and yet it would address this argument.
I was particularly impressed by this detail further down the conversation:
Let A1 = 1, A2 = blank, A3 = 3
PRODUCT(A1:A3) is 1
A1 * A2 * A3 is 0
Well that's just inconsistent and annoying and wrong.
I think the original comments are still over the top, as it seems to be arguin that a defined convention exists but the OP disagreed with it.
However if your example is what they were actually complaining about then hell yes.
Within Excel, ranges ignore blank cells in computation, but a reference to a blank cell returns 0. The idea being that if you specify a cell in a computation, you're expecting something to be there, whereas when you have a range, you may have sparse data throughout that range, and therefore, the result shouldn't break.
SQL behaves similarly. See: http://www.sqlfiddle.com/#!15/37025/5
> Let A1 = 1, A2 = blank, A3 = 3
> PRODUCT(A1:A3) is 1
3, I think (based on experimentation).
That appalled user would have a heart attack at Awk, a language in which a previously unseen variable serves as zero or an empty string.
count++ # if count doesn't exist, it becomes 1
A previously unseen array reference materializes automatically too: count[$1]++
(I think this is garbage too, but it leads to throwaway programs that are very concise. If you write anything large, it's going to bite you on the butt: mispelled[foo] += bar; # weeI personally would prefer a spreadsheet that treated null as the identity for whatever function it was acting as an input to (0 for addition, 1 for multiplication, e for exponentiation, Identity matrix for matrix multiplication, etc.), only throwing an error when a null does not resolve to a single value (i.e. the same cell is used for both addition and multiplication). But that would apparently require me to smoke some stuff.
Actually, I wouldn't even throw an error. It's quite intuitive for the user when "sum" and "product" apply only to non-empty cells, and empty cells are ignored. If all cells are empty, the return value should be the identity of the function, i.e. 0 for sum and 1 for product. Also it's nice to have a function "count" that counts non-empty cells, with identity value 0.
Note that the last line wouldn't work under your proposal, because the identity for "product" isn't the same as the identity for "count". I think the user would be massively confused by an error like "using conflicting identity values for cell A2", and would prefer a spreadsheet that just gave them the damn geometric mean.A1 = 1, A2 = blank, A3 = 3 sum(A1:A3) = 4 product(A1:A3) = 3 count(A1:A3) = 2 average(A1:A3) = sum(A1:A3)/count(A1:A3) = 2 geometric_mean(A1:A3) = product(A1:A3)**(1/count(A1:A3)) = sqrt(3)But what is the geometric mean of 2, 3, 4 and potato?
An error.
I think the right thing might be, "default empty to a tainted zero", and then propagate and display taint. That gives immediate feedback and lets you get some sense of things playing with an incomplete spreadsheet but won't silently give you bad data that you think is good.