"Parse, don't validate" through the years with C++
derekrodriguez.devIt seems like the C++98 example is the best by far? Keeps all error information while remaining concise and easy to understand. Not to mention 50 times faster. (Could be improved by adding some simple type aliases like BirthYear that explicitly start from 1900.)
IMO the main takeaway is that malformed input is not an exceptional state when parsing, and should be treated as a first class citizen. Everything else is yak shaving how you want to handle the (status, validObject) tuple coming from the parser.
The compile time is 50 times faster, not the runtime.
The C++11 example is the weakest in the article by its own thesis. Public throwing constructor, no year check, no leap-year check, so Birthdate(0, 2, 30) constructs cleanly. The C++17/23 shape (private ctor + static factory) is the actual mechanical insight from King's essay. Make the constructor a function that can fail, so the type itself carries the proof.
Just to note, a throwing constructor is “just as good” as static factory method, provided you want to use exceptions for validation errors. Which you shouldn’t, but from the perspective of testing types as proof, it’s just as good.
exactly, use std::expected as the return type, avoid exceptions, and make a failable factory constructor to build your type. Make invalid states unrepresentable!!!
Aren't you time-travelling? std::expected is C++23 (so available starting from 2025-2027 xd)
It has been available since GCC 12.1 (May 2022), Clang 19.1 (Sep 2024), and Visual Studio 17.13 (2022~): https://godbolt.org/z/on1v6qdf3
These days compiler developers implement accepted standard features pretty fast.
And tl::expected (a largely identical impl) has been available similarly as long!
Heh, I can especially tell the first code example is LLM-generated. Humans don't usually write comments like:
// There are a few ways to let API callers bring their own
// memory, as they would in a no-malloc environment and this
// stack-friendly c'tor is a stand-in for that.
There's just something about this comment that doesn't feel right. I've seen these kinds of phrasings in LLM output before but I'm not sure exactly how to describe them.The C example could have implemented a lot of validation just by checking the return value of sscanf():
if (sscanf(user_input, "%4u-%2u-%2u", &year, &month, &day) != 3) {
// return an error
}
This still does not catch trailing garbage, but you could check for that as well: if (sscanf(user_input, "%4u-%2u-%2u%c", &year, &month, &day, &dummy) != 3) {
// return an error
}
The result would be 4 if there was at least one trailing character. Too bad there is still no std::scan() companion to C++23's std::print().Although it feels intuitively as though a std::scan could make sense, it doesn't, at least not with the sort of API I've seen suggested
Consider a hypothetical Goose type, we can express any Goose usefully as output and, conveniently, some potential inputs could be read as a Goose successfully though most arbitrary strings cannot be understood as a Goose.
Providing std::print for Goose is simple, we've got a variable (or maybe a constant) of type Goose, we just emit the correct sequence of symbols. It's annoying to actually write all the boilerplate in C++ 23 but that's mechanical it's not actually tricky to do just very boring (and so hence maybe C++ 26 makes that easier via reflection)
But how could std::scan for Goose work? We need a Goose variable to potentially store the Goose if we read one, but how can we make a default Goose? No, each Goose is unique and there is no substitute, this can't work.
The std::scan idea seem attractive for simple almost untyped input, strings, integers, that sort of thing, but the whole point of "Parse, don't validate" is that you probably want to parse email addresses and ISBNs and ISO dates, you don't want a string, another string and a third string.
Rust's FromStr trait is more appropriate. Given a type implements FromStr we can parse any string to (maybe) get an instance of that type, but we don't need an "empty" instance first because we're doing the construction when we call the function.
Rust's FromStr only deals with parsing a single object. However, ideally std::scan() would be an exact counterpart of std::print() and would be able to parse multiple objects. I totally agree that the C way of passing references to already existing variables is not great. Ideally you return a tuple of objects, but then it becomes very annoying to specify the types. Maybe something like this?
A halfway solution would be to have the hypothetical std::scan() take references to std::optional<>s or std::expected<>s:auto [value, text, goose] = std::scan<int, std::string, Goose>(input, "{} {} {}");
The latter would be type safe, close to how scanf() works, but less satisfying from a functional programming standpoint.std::optional<int> value; std::optional<std::string> text; std::optional<Goose> goose; /* auto result = */ std::scan(input, "{} {} {}", value, text, goose);Orthogonal to that, adding support for scanning a Goose would be just like how you add a formatter for it, and would be quite similar to a Rust trait. One could imagine having to define something like this:
template<> struct std::scanner<Goose> { constexpr auto parse(std::format_parse_context& ctx) {…} auto scan(std::format_context& ctx) const -> std::optional<Goose> {…} };
The second sentence of your summary is fine, but I don’t like the first sentence:
> Use your language’s type system to parse unstructured inputs.
We don’t use the type system to parse. We use the type system to provide evidence (also called a proof or a witness) that parsing was successful, and we rely on the language’s access control facilities (public/private) and the soundness of its type system to prevent fabrication of false evidence.
I don't see how this is in any way preferable to having an ordinary default constructor that does the same thing:
// There are a few ways to let API callers bring their own
// memory, as they would in a no-malloc environment and this
// stack-friendly c'tor is a stand-in for that.
static Birthdate epoch() { return Birthdate(1900, 1, 1); }Some readers will expect Birthdate() to be equivalent to Birthdate(0, 0, 0), and naming it Birthdate::epoch() makes it clear that it is not that. I don't think it's worth it, but there is an upside.
Author has used LLMs to generate Java code in C++. It detracts from his point.
What Java code?
Regardless of how they might have used LLMs, I tend to have an issue with this kind of complaint, given the C++ example code on the Design Patterns: Elements of Reusable Object-Oriented Software book, released in 1994, 2 years before Java was made public.
Or the examples from "Using the Booch Method: A Rational Approach", "Designing Object Oriented C++ Applications Using The Booch Method", or "Using the Booch Method: A Rational Approach".
Additional there are enough framework examples starting with Turbo Vision in 1990, MacAPP in 1989, OWL in 1991, MFC in 1992,....
Somehow a C++ style that was prevalent in the industry between 1990 and 1996, that I bet plenty of devs still have to maintain in 2026, has become "Java in C++".
> Somehow
There's not much mystery about that - Java took that approach and ran with it, and now has much greater mindshare than C++.
Also, the mid-90s were before most software developers working today were born, I suspect. They'd have to go find a graybeard and ask them to tell them tales of yore, to find out about any of this.
We gladly tell bonefire tales. :)
No, it doesn't.
I'm not a Haskell programmer, but from my limited awareness: Wouldn't they want to encode the restriction that April 31 doesn't exist directly in the type system instead of using raw integers for the underlying struct?
First thought, assuming that birth year starts at 1900 is bad for a number of reasons; one of which, "process this list of authors and ..."
What about everyone born before 1900?
It’s a contrived example. And I have to assume the author intended it to be contrived given that he also put an upper bound at 1999 in an article written in 2026 in an industry that skews young.
But the pattern applies regardless of the validation logic.
Or what if they were born after 1999?
It's just a toy example not a production ready birthday validation library.
Assuming it is necessarily known which is the birth year of anyone assumed to have been in existence is already a big hypothesis if we go in that direction.
C is perfectly capable of type-driven design. He's already got the type (struct), and although C is a bit limited, he can:
* return pointer-or-null
* choose "invalid" sentinel values and then use birthdate_is_valid(...) to check validity.
* Add an is_valid bool field (or even an error enum like in the C++23 example)
* Add an out field in the constructor function for the error code (similar to how ObjC does things).
The point of parse-don't-validate is that the type checker prevents you from having a value of a particular type that's invalid.
Pointer-or-NULL doesn't work, because all pointers are nullable in C; you can always have a Foo* (NULL) that's doesn't actually point to a valid Foo.
Invalid sentinel values are definitionally values of a particular type that are invalid. Same with an is_valid field.
An out field in the constructor means that whatever you actually return in the case of an error is going to be a well-typed Foo that's invalid.
My point is that you do the checking at the call site, and then use a static analysis tool or an AI to enforce checking the result right after calling parse_birthday.
Sure, Optional is more elegant, but the end result is the same: Now none of the other code needs to validate; it's already been verified valid at all points where a parse error could have occurred.
C may not be an easy language, but with the right tooling you can make code safer, and idioms like parse-dont-validate possible.
Cool, incredibly low bar.
All four of your examples are validate.
Know any languages that are worse than C at this?
Or use an out field for the type itself, and use the return value for an error code (or just a bool). A common pattern in C#.
C++ could use some do-notation
Abstracting any part of code structure in C++ is a wasps nest that will attack you back.
Did you mean "abstract you back"?
Being abstracted by code you just wrote is quite a painful experience, yes.
Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming? Like parsing and validating blurs into each other.
In my experience it makes even more sense in functional programming languages, not less, since they usually also have more powerful type systems that help with actually representing parsed vs unparsed data.
> Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming?
Parse, don't validate was written around Haskell!
What I tried and apparently failed to express with "parsing and validating blurs into each other." was that parsing more easily becomes "just what you do" in functional style of programming. To the point that nowadays I can no longer really remember what I did back when I tried to "validate" things instead of parsing them.
The tl;dr is that instead of representing emails as type String and manually sprinkling is_email(str) throughout your code, you represent as type Email, which has a function parse(String) -> Option<Email>. The type system then ensures the checks are present whenever they have to be, and nowhere else.
This is extremely natural to do in a language like Haskell or Rust. And incredibly unnatural to do in C++ for instance.
I hope this is not trolling so I'll bite. It is incredibly natural to represent an object, such as an email, as an Email class in object oriented languages like C++. It'd then have a constructor that accepts a string and constructs the email object from said string, or maybe a parse(string) -> Option<Email> thingy. The type system then ensures the checks are present whenever they have to be, and nowhere else.
Tl;dr: there's nothing extra that functional or OO programming give you here. Both allow you to represent the problem in a properly typed fashion. Why would you represent an email as a string unless you are a) deeply inexperienced or b) have some really good reason to drop all the benefits of a strongly typed language?
I completely agree with you but I think sometimes folks carry some piece of data around as a string or int instead of something more concrete like a class or a strongly typed enum etc purely out of laziness!
I think the old Lisp tradition of using lists for everything is related to this somehow. On the other hand, in Common Lisp programmers can define custom types that have to fulfill a predicate function. Then, if they declare the types of their functions, most implementations will generate type-checking code unless instructed not to. So in Common Lisp you can use lists for everything but still have type-checking, at some cost to efficiency. :D
Well, in C++ the constructor must return a value of its class type - you can't return an Option<T> from a constructor on T, for example, and since constructors are the canonical way to construct an object, it creates stylistic and idiomatic friction when you start using free functions to create a Maybe<T> instead of constructors.