Effective Property-Based Testing
blog.auxon.ioI would like to make a shout out to quickcheck (Haskell, https://hackage.haskell.org/package/QuickCheck) as it was my first introduction to property based testing and has since let me to use property based testing everywhere in the last 10 years. Then there is scalacheck (https://www.scalacheck.org/). Both let you write your own generators and are quite good at shrinking down to minimal use cases.
The suggestion elsewhere in this thread to decrease the number of iterations during normal testing and crank it up during nightlies is also good.
The only thing I’m still missing from the libraries is a convenient mechanism of remembering previously failing generated inputs and use them as a static list of testcases next to the at runtime generated ones like a regression test of sorts.
Edit: typos
> I would like to make a shout out to quickcheck (Haskell, https://hackage.haskell.org/package/QuickCheck) as it was my first introduction to property based testing
As far as I know the Haskell Quickcheck library by John Hughes was in fact the first library to lay out property-based testing. He then went on to create a paid and expanded version of the library in Erlang. And then as Quickcheck rose in popularity it's been re-implemented in many many different languages.
Indeed. It's an unfortunate general omission in the ecosystem. It's one of the things that our product does (in its appropriation from different inspirations like MC, PBT, SBFL, etc., but we move things up to assessing systems rather than programs) where contradictions and counter-examples are curated specifically for making it easier to reuse them. They're provided back to the user as executable "properties" in our DSL to apply to past, present, future data.
disclaimer: Auxon co-founder
Also, big (h/t) to Quickcheck from me as well. Getting into it via Erlang many, many years ago was among the more impactful and transformative developments in my approach to thinking about improving software quality.
In the article you (auxon) wrote to avoid using type based generators and unbounded collections and instead write your own generators. Usually I find it more convenient to write filters on the generated types then create my own generators. As the post mentions, this can lead to many discarded inputs. Have you ever considered how we could use the predicates in the filters to create Specialized generators that only generate inputs that will be accepted by the filter? You probably need some meta programming or reification of the predicates for that to work.
You can already do that with a family of predicates if you write preconditions in your Python code (see my previous comment [1]). There is an ongoing discussion how to bring this in into Hypothesis (see the issue [2]).
[1]: https://news.ycombinator.com/item?id=26018386
[2]: https://github.com/HypothesisWorks/hypothesis/issues/2701
edit: newlines
> Have you ever considered how we could use the predicates in the filters to create Specialized generators that only generate inputs that will be accepted by the filter? You probably need some meta programming or reification of the predicates for that to work.
That's a really nice idea. It doesn't fit into the usual compositional design of the proptest libraries that I've used, but it certainly seems like it should be possible in principle.
Hypothesis does what you’re talking about. It stores failing examples and chooses them during subsequent runs.
If you use Python and want to infer test strategies from contracts, you might want to check out this library of mine: [1].
There are also plugins for IDEs (Pycharm, VS Code and vim), which can be quite helpful during the development.
This is a great set of ideas for using property-based testing. I've found it useful to think of code in terms of invariants and contracts, and property-based testing lets me express those very directly in code. No other testing method comes close.
Brilliant article.
If I were to add just one thing to the list: metatest. Write a test that asserts that your generated test cases are "sufficiently comprehensive", for whatever value of "sufficiently" you need. In an impure language, this is as easy as having the generator contain a mutable counter for "number of test cases meeting X condition" for whatever conditions you're interested in. For example, say your property is "A iff B". You might want to fail the test if fewer than 10% of the generated cases actually had A or B hold. (And then, of course, make sure your generators are such that - say - A and B hold 50% of the time; you want an astronomically small chance of random metatest failure.)
(I did a brief intro to this in the Metatesting section of a talk I did two years ago: https://github.com/Smaug123/talks/blob/master/DogeConf2019/D... . On rereading it now, I see there's a typo on the "bounded even integers" slide, where the final `someInts` should read `evenIntegers`.)
Hypothesis can report statistics on user-defined events, as well as the usual timing stuff: https://hypothesis.readthedocs.io/en/latest/details.html#tes...
I'd just check that when you're writing or changing the tests though; for nontrivial conditions it can take a very long time to get neglibible probability of any metatest failing in a given run, and flaky metatests are just as bad as the usual kind.
If this split is particularly important, we'd usually recommend just writing separate tests for data that satisfy A or B; you can even supply the generators with pytest.mark.parametrize if copy-pasting the test body offends.
I've been trying to push property-based testing at work but there's not a load of enthusiasm showing. I guess it's in part because we work in C# and the best option appears to be FsCheck, which is very inconsistently documented between its F#-native and C#-accessible APIs.
I think there's a strong argument with FsCheck to write all your proptest code in F# just to take advantage of the vastly better generator syntax, but that's a hard sell for a team who mostly don't know F# and aren't convinced proptests are much better anyway. Writing the generators in C# seemed really incredibly tedious. I did start to get the hang of identifying properties to test though. Once you're past the mechanics of "how does this work" that can become much easier.
A load road to travel here, but I kind of gave myself a remit to improve software quality and I do think we need to be looking at this kind of testing to help.
Where do people who are using it find that it offers the most value? I keep feeling that we could really solidify some of our bespoke parsing and serialisation code using this kind of tech.
FsCheck maintainer here.
I'd very much welcome contributions on documentation, and esp. on approaches of how to keep the C#/F# documentation consistent and still accessible for both types of users. Even if it's just ideas/comments - how would you like the documentation presented? What are examples of excellent C# documentation? We need to balance that with available resources - we don't have a team of ghostwriters to write docs and examples for every language, as you can imagine. I know it's a cliche by this time, but if every user would take a couple minutes to write a paragraph or example where e.g. the C# docs are lacking, it might be in a much better state. From our side, if something is stopping you from contributing in this way, we'd like to hear about it. Addressing that is important.
Separately, I'm surprised you experienced that generators are significantly more tedious to write in C# vs F# - could you open an issue with a few examples of this? This would inform v3.0 where we will stop trying to use tricks to make the F# API accessible, but instead add a bespoke C#/VB.NET API in the FsCheck.Fluent namespace, and separating F#'y bits in FsCheck.FSharp.
Hypothesis:
Property Based Testing is Monte Carlo simulation for model checking.
Yep! And it's also possible to run fuzzers [1, 2] or SAT-based verifiers against the same test harness :-)
1: https://hypofuzz.com/docs/literature.html 2: https://google.github.io/oss-fuzz/getting-started/new-projec...
This is a good overview of property based testing but the mentions of Cucumber threw me off.
In my job Cucumber seems to add little more than just commenting and sequencing functions, tasks that are better suited to your programming language of choice, while adding overhead and complexity.
What am I missing?
To me, Cucumber is a way to write executable specifications that can also be read by non-developers. This can be tremendously powerful if done judiciously, or tremendously pointless if not. This may be because you do not in fact want or need executable specification, or perhaps because there's nobody interested in reading it.
I believe that Cucumber is at its best in situations where it's clear to all parties that a specification is valuable. In that case, making the specification executable is very clearly a massively useful way to spend your time.
But if they're only read, not written, by non-developers, wouldn't it make more sense to generate a report from your spec code, rather than using the report as a cumbersome authoring language?
Property testing is awesome, but it does significantly slow down test suites. Are there any standard practices for striking a balance of getting the added confidence/safety of using property tests without introducing large delays into your CI pipeline?
We like to run the tests as part of CI with a relatively small number of iterations, and then turn the knob way up in a nightly or weekly scheduled test job.
Yup. We've got some stuff set up so that in the IDE your tests finish in ~100ms; in precommit testing pipelines you get several seconds; in nightly tests you get two hours.
My understanding, shallow as it is, is that property testing goes great with pure functional code, because without side effects you can run tests in parallel, taking advantage of all of these cores and build servers we have.
If your tests are coupled, you're already in a bad way whether you know it or not. Dumping property testing on top of that without addressing the underlying cause sounds like a recipe for misery.
It's probably a great stick and carrot if you're pushing a tech debt reduction agenda though.
Item 5 in TFA is one that I found related to performance most. Using filters to toss out bad input means you're potentially generating thousands of inputs for dozens of tests. Aggregate this across all your tests and it's an incredible waste of time. Constructing proper inputs directly can get you closer to the minimum time required to execute the tests and help achieve that desired performance threshold. (NB: Not used professionally, only got one office to even entertain them but they wouldn't take it and run with it even after I showed them a dozen errors I'd found in an hour of using them. Consequently my experience is still limited.)
You could fan-out tests to multiple FaaS instances.
Cloudflare Workers, for example, have no cold-starts. But they are JS/WASM only
That’s a lot of buzzwords you got there
For JS and TypeScript, the best property testing library I've encountered so far is fast-check https://github.com/dubzzz/fast-check
Does anyone have any thoughts on the graphic showing the spectrum of testing options listed in the article (or additional resources that cover at a high-level this topic of understanding range of testing options)?
link to the graphic for ease of reference:
https://blog.auxon.io/images/posts/effective-property-based-...
When you first start on PBT, the first hurdle is finding good properties of your system to test. I found this article [1] to be a great overview to get started.
[1]: https://fsharpforfunandprofit.com/posts/property-based-testi...
Auxon also looks like an interesting company.
Turning off shrinking makes little sense to me. If tests are passing there's nothing to shrink. If you are bombarded with so many failures you can't afford to shrink them then more testing is pointless.
The author’s point is that shrinking can take a very long time, and that’s what you should consider turning it off. In my experience this is also true.
They’re saying that sometimes it’s better to just see the error in full and try and figure it out.
That is not my experience (doing this sort of testing on compilers). In my experience, the shrinking is usually very fast. When it takes a non-negligible amount of time, debugging the failure would have been hopeless on the non-shrunk input.
What takes the time in my case was simply getting a failure to occur at all. It might take days on a mature compiler before a failure occurs, if then. This would be millions of attempts.
These are great tips. Thanks.