"Parse, don't validate" through the years with C++

86 points by dwrodri 2 months ago · 51 comments

Reader

The C++11 example is the weakest in the article by its own thesis. Public throwing constructor, no year check, no leap-year check, so Birthdate(0, 2, 30) constructs cleanly. The C++17/23 shape (private ctor + static factory) is the actual mechanical insight from King's essay. Make the constructor a function that can fail, so the type itself carries the proof.

simonask 2 months ago

Just to note, a throwing constructor is “just as good” as static factory method, provided you want to use exceptions for validation errors. Which you shouldn’t, but from the perspective of testing types as proof, it’s just as good.
noitpmeder 2 months ago

exactly, use std::expected as the return type, avoid exceptions, and make a failable factory constructor to build your type. Make invalid states unrepresentable!!!
- dietr1ch 2 months ago
  
  Aren't you time-travelling? std::expected is C++23 (so available starting from 2025-2027 xd)
  https://en.cppreference.com/cpp/utility/expected
  - diath 2 months ago
    
    It has been available since GCC 12.1 (May 2022), Clang 19.1 (Sep 2024), and Visual Studio 17.13 (2022~): https://godbolt.org/z/on1v6qdf3
    These days compiler developers implement accepted standard features pretty fast.
    
    noitpmeder 2 months ago
    
    And tl::expected (a largely identical impl) has been available similarly as long!

foobar1726 2 months ago

It seems like the C++98 example is the best by far? Keeps all error information while remaining concise and easy to understand. Not to mention 50 times faster. (Could be improved by adding some simple type aliases like BirthYear that explicitly start from 1900.)

IMO the main takeaway is that malformed input is not an exceptional state when parsing, and should be treated as a first class citizen. Everything else is yak shaving how you want to handle the (status, validObject) tuple coming from the parser.

philip-b 2 months ago

The compile time is 50 times faster, not the runtime.

mayoff 2 months ago

The second sentence of your summary is fine, but I don’t like the first sentence:

> Use your language’s type system to parse unstructured inputs.

We don’t use the type system to parse. We use the type system to provide evidence (also called a proof or a witness) that parsing was successful, and we rely on the language’s access control facilities (public/private) and the soundness of its type system to prevent fabrication of false evidence.

dwrodriOP 2 months ago

I like the linking of "construction of a type is evidence of correctness"!

gsliepen 2 months ago

The C example could have implemented a lot of validation just by checking the return value of sscanf():

    if (sscanf(user_input, "%4u-%2u-%2u", &year, &month, &day) != 3) {
        // return an error
    }

This still does not catch trailing garbage, but you could check for that as well:

    if (sscanf(user_input, "%4u-%2u-%2u%c", &year, &month, &day, &dummy) != 3) {
        // return an error
    }

The result would be 4 if there was at least one trailing character. Too bad there is still no std::scan() companion to C++23's std::print().

tialaramex 2 months ago

Although it feels intuitively as though a std::scan could make sense, it doesn't, at least not with the sort of API I've seen suggested
Consider a hypothetical Goose type, we can express any Goose usefully as output and, conveniently, some potential inputs could be read as a Goose successfully though most arbitrary strings cannot be understood as a Goose.
Providing std::print for Goose is simple, we've got a variable (or maybe a constant) of type Goose, we just emit the correct sequence of symbols. It's annoying to actually write all the boilerplate in C++ 23 but that's mechanical it's not actually tricky to do just very boring (and so hence maybe C++ 26 makes that easier via reflection)
But how could std::scan for Goose work? We need a Goose variable to potentially store the Goose if we read one, but how can we make a default Goose? No, each Goose is unique and there is no substitute, this can't work.
The std::scan idea seem attractive for simple almost untyped input, strings, integers, that sort of thing, but the whole point of "Parse, don't validate" is that you probably want to parse email addresses and ISBNs and ISO dates, you don't want a string, another string and a third string.
Rust's FromStr trait is more appropriate. Given a type implements FromStr we can parse any string to (maybe) get an instance of that type, but we don't need an "empty" instance first because we're doing the construction when we call the function.
- gsliepen 2 months ago
  Rust's FromStr only deals with parsing a single object. However, ideally std::scan() would be an exact counterpart of std::print() and would be able to parse multiple objects. I totally agree that the C way of passing references to already existing variables is not great. Ideally you return a tuple of objects, but then it becomes very annoying to specify the types. Maybe something like this?
  auto [value, text, goose] = std::scan<int, std::string, Goose>(input, "{} {} {}");
  A halfway solution would be to have the hypothetical std::scan() take references to std::optional<>s or std::expected<>s:
  std::optional<int> value; std::optional<std::string> text; std::optional<Goose> goose; /* auto result = */ std::scan(input, "{} {} {}", value, text, goose);
  The latter would be type safe, close to how scanf() works, but less satisfying from a functional programming standpoint.
  Orthogonal to that, adding support for scanning a Goose would be just like how you add a formatter for it, and would be quite similar to a Rust trait. One could imagine having to define something like this:
  template<> struct std::scanner<Goose> { constexpr auto parse(std::format_parse_context& ctx) {…} auto scan(std::format_context& ctx) const -> std::optional<Goose> {…} };

MarsIronPI 2 months ago

Heh, I can especially tell the first code example is LLM-generated. Humans don't usually write comments like:

   // There are a few ways to let API callers bring their own 
   // memory, as they would in a no-malloc environment and this
   // stack-friendly c'tor is a stand-in for that.

There's just something about this comment that doesn't feel right. I've seen these kinds of phrasings in LLM output before but I'm not sure exactly how to describe them.

dwrodriOP 2 months ago

Author here. The post didn't get much traffic when I uploaded so I didn't engage much with the thread. Looks like I should've come back!
I specifically wrote that by hand to note the specific shortcomings of this approach when evaluated under King's thesis. I do acknowledge that I use LLM models heavily when drafting the code snippets in this blog post, and I do a mini review in the conclusion of the downsides of using these models.

usefulcat 2 months ago

I don't see how this is in any way preferable to having an ordinary default constructor that does the same thing:

    // There are a few ways to let API callers bring their own 
    // memory, as they would in a no-malloc environment and this
    // stack-friendly c'tor is a stand-in for that. 
    static Birthdate epoch() { return Birthdate(1900, 1, 1); }

plorkyeran 2 months ago

Some readers will expect Birthdate() to be equivalent to Birthdate(0, 0, 0), and naming it Birthdate::epoch() makes it clear that it is not that. I don't think it's worth it, but there is an upside.

bregma 2 months ago

Author has used LLMs to generate Java code in C++. It detracts from his point.

pjmlp 2 months ago

What Java code?
Regardless of how they might have used LLMs, I tend to have an issue with this kind of complaint, given the C++ example code on the Design Patterns: Elements of Reusable Object-Oriented Software book, released in 1994, 2 years before Java was made public.
Or the examples from "Using the Booch Method: A Rational Approach", "Designing Object Oriented C++ Applications Using The Booch Method", or "Using the Booch Method: A Rational Approach".
Additional there are enough framework examples starting with Turbo Vision in 1990, MacAPP in 1989, OWL in 1991, MFC in 1992,....
Somehow a C++ style that was prevalent in the industry between 1990 and 1996, that I bet plenty of devs still have to maintain in 2026, has become "Java in C++".
- bregma 2 months ago
  
  > What Java code?
  A class with a passel of static member functions is Java code. It is not in any way idiomatic C++ code which has had namespace-level ("free") functions since it was invented as C-with-classes many decades ago. Using classes holding a whole lot of static member functions is strongly frowned on in the professional C++ community.
  - dwrodriOP 2 months ago
    
    Author here:
    A lot of my professional C++ experience comes from the computer vision space where I am specifically linking against FFmpeg (libav does its own share of memory management tricks that don't always play well with RAII).
    I think of static functions (even within member classes) as a signifier of "hey, you don't need a constructed object for this to work and it doesn't depend on class instance state".
    In application code, I was typically relying on Myers Singletons and the implicit thread-safeness more than what you see here. I debated dropping the static keyword because it stands out as odd especially in a private class method, but settled on keeping it.
  - pjmlp 2 months ago
    
    Certainly not the professional C++ comunity that still uses frameworks born in the 1990's predating Java, or game engines.
- antonvs 2 months ago
  
  > Somehow
  There's not much mystery about that - Java took that approach and ran with it, and now has much greater mindshare than C++.
  Also, the mid-90s were before most software developers working today were born, I suspect. They'd have to go find a graybeard and ask them to tell them tales of yore, to find out about any of this.
  - pjmlp 2 months ago
    
    We gladly tell bonefire tales. :)
SuperV1234 2 months ago

No, it doesn't.

jsymolon 2 months ago

First thought, assuming that birth year starts at 1900 is bad for a number of reasons; one of which, "process this list of authors and ..."

What about everyone born before 1900?

alpinisme 2 months ago

It’s a contrived example. And I have to assume the author intended it to be contrived given that he also put an upper bound at 1999 in an article written in 2026 in an industry that skews young.
But the pattern applies regardless of the validation logic.
psychoslave 2 months ago

Assuming it is necessarily known which is the birth year of anyone assumed to have been in existence is already a big hypothesis if we go in that direction.
Neywiny 2 months ago

Or what if they were born after 1999?
It's just a toy example not a production ready birthday validation library.

blt 2 months ago

I'm not a Haskell programmer, but from my limited awareness: Wouldn't they want to encode the restriction that April 31 doesn't exist directly in the type system instead of using raw integers for the underlying struct?

dwrodriOP 2 months ago

A very specific shortcoming of this implementation is indeed "Day of Month" and "Month of Year" aren't given their own types! The type specification should likely be applied all the way down! I felt the examples conveyed the point well enough and it was shorter in many cases.

kstenerud 2 months ago

C is perfectly capable of type-driven design. He's already got the type (struct), and although C is a bit limited, he can:

* return pointer-or-null

* choose "invalid" sentinel values and then use birthdate_is_valid(...) to check validity.

* Add an is_valid bool field (or even an error enum like in the C++23 example)

* Add an out field in the constructor function for the error code (similar to how ObjC does things).

wk_end 2 months ago

The point of parse-don't-validate is that the type checker prevents you from having a value of a particular type that's invalid.
Pointer-or-NULL doesn't work, because all pointers are nullable in C; you can always have a Foo* (NULL) that's doesn't actually point to a valid Foo.
Invalid sentinel values are definitionally values of a particular type that are invalid. Same with an is_valid field.
An out field in the constructor means that whatever you actually return in the case of an error is going to be a well-typed Foo that's invalid.
- kstenerud 2 months ago
  
  My point is that you do the checking at the call site, and then use a static analysis tool or an AI to enforce checking the result right after calling parse_birthday.
  Sure, Optional is more elegant, but the end result is the same: Now none of the other code needs to validate; it's already been verified valid at all points where a parse error could have occurred.
  C may not be an easy language, but with the right tooling you can make code safer, and idioms like parse-dont-validate possible.
mrkeen 2 months ago

Cool, incredibly low bar.
All four of your examples are validate.
Know any languages that are worse than C at this?
tech_hutch 2 months ago

Or use an out field for the type itself, and use the return value for an error code (or just a bool). A common pattern in C#.

rienbdj 2 months ago

C++ could use some do-notation

marcosdumay 2 months ago

Abstracting any part of code structure in C++ is a wasps nest that will attack you back.
- lstodd 2 months ago
  
  Did you mean "abstract you back"?
  Being abstracted by code you just wrote is quite a painful experience, yes.

actionfromafar 2 months ago

Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming? Like parsing and validating blurs into each other.

LittleLily 2 months ago

In my experience it makes even more sense in functional programming languages, not less, since they usually also have more powerful type systems that help with actually representing parsed vs unparsed data.
gspr 2 months ago

> Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming?
Parse, don't validate was written around Haskell!
- actionfromafar 2 months ago
  
  What I tried and apparently failed to express with "parsing and validating blurs into each other." was that parsing more easily becomes "just what you do" in functional style of programming. To the point that nowadays I can no longer really remember what I did back when I tried to "validate" things instead of parsing them.
andrepd 2 months ago

The tl;dr is that instead of representing emails as type String and manually sprinkling is_email(str) throughout your code, you represent as type Email, which has a function parse(String) -> Option<Email>. The type system then ensures the checks are present whenever they have to be, and nowhere else.
This is extremely natural to do in a language like Haskell or Rust. And incredibly unnatural to do in C++ for instance.
- short_sells_poo 2 months ago
  
  I hope this is not trolling so I'll bite. It is incredibly natural to represent an object, such as an email, as an Email class in object oriented languages like C++. It'd then have a constructor that accepts a string and constructs the email object from said string, or maybe a parse(string) -> Option<Email> thingy. The type system then ensures the checks are present whenever they have to be, and nowhere else.
  Tl;dr: there's nothing extra that functional or OO programming give you here. Both allow you to represent the problem in a properly typed fashion. Why would you represent an email as a string unless you are a) deeply inexperienced or b) have some really good reason to drop all the benefits of a strongly typed language?
  - bananaboy 2 months ago
    
    I completely agree with you but I think sometimes folks carry some piece of data around as a string or int instead of something more concrete like a class or a strongly typed enum etc purely out of laziness!
    
    MarsIronPI 2 months ago
    
    I think the old Lisp tradition of using lists for everything is related to this somehow. On the other hand, in Common Lisp programmers can define custom types that have to fulfill a predicate function. Then, if they declare the types of their functions, most implementations will generate type-checking code unless instructed not to. So in Common Lisp you can use lists for everything but still have type-checking, at some cost to efficiency. :D
  - leodavi 2 months ago
    
    Well, in C++ the constructor must return a value of its class type - you can't return an Option<T> from a constructor on T, for example, and since constructors are the canonical way to construct an object, it creates stylistic and idiomatic friction when you start using free functions to create a Maybe<T> instead of constructors.

Settings

"Parse, don't validate" through the years with C++

Keyboard Shortcuts