@Kent: You’ve made it clear that you believe the kernel development rules/processes are inadequate for bcachefs. That’s your prerogative. But surely, given how long you’ve been around, you knew that long before submitting bcachefs for mainline. Given this, why did you submit it for mainline at all? Did you expect that bcachefs would be exempted from following those rules/processes? This isn’t a rhetorical question, I’m genuinely trying to understand your thought process.
> So now, it probably won't go back upstream until it's well and truly finished
This kind of implies that Linus will one day start accepting your bcachefs PRs again. Is it something that he confirmed to you?
So what exactly *is* in the cards, then?
Posted Sep 5, 2025 19:16 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (5 responses)
The kernel development process as it is normally applied would've been fine for bcachefs: like I mentioned elsewhere, I started perusing pull requests from other subsystems and I was actually legitimately surprised to see that it looks like I've been stricter with what I consider a critical bugfix than other subsystems. (While still in the experimental phase I do accept a slightly higher risk of (non serious!) regressions that I will post experimental so that I can prioritize throughput of getting bugfixes out; that's why I was surprised.)
Other subsystems will absolutely send features outside the merge window if there's a good reason for it; I even saw refactorings go in for XFS during rc6 or rc7 recently.
It's normally based on just common sense and using good judgement, balancing how important a patch is to users vs. the risk of regression. That should take into account QA processes, history of regressions in that subsystem (which tells us how well those QA processes are working), how sensitive the code is, and how badly the patch is needed. And when there's concerns they're talked through; things break down when people start dictating and taking an "I know better, even though I'm not explaining my reasoning" attitude.
The real breakdown was in the private maintainer thread, when Linus had quite a bit to say about how he doesn't trust my judgement based on, as far as I can tell, not much more than the speed with which I work and get stuff out. That speed is a direct result of very good QA (including the best automated testing of any filesystem in the kernel), a modern and very hardened codebase, and the simple fact that I know my code like the back of my hand and am very good at what I do.
I've been working in storage for going on 20 years at this point, and I've always been the one ultimately responsible for my code, top to bottom, from high level design all the way down to responding to every last bug report and working with users to make sure that things are debugged and resolved thoroughly and people aren't left hanging. People are still running, and like and trust, code that manages their data that I wrote when I was 25, and there's a bunch of people who are getting their kernel from my git repository - and for a lot of people it's explicitly because they've lost data to our other in-kernel COW filesystem and needed something more reliable, and they have found that bcachefs delivers. I don't know anyone in the filesystem world with that kind of resume.
> This kind of implies that Linus will one day start accepting your bcachefs PRs again. Is it something that he confirmed to you?
We both explicitly left the door open to that in the private maintainer thread, although on my end it will naturally be contingent upon having better processes and decisionmaking in place.
So what exactly *is* in the cards, then?
Posted Sep 5, 2025 22:40 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (3 responses)
> I was actually legitimately surprised to see that it looks like I've been stricter with what I consider a critical bugfix than other subsystems.
> ...
> I even saw refactorings go in for XFS during rc6 or rc7 recently.
Surprising, can you please share some commit IDs?
> It's normally based on just common sense and using good judgement, balancing how important a patch is to users vs. the risk of regression.
The most important points seem to be missing from that list: size and nature of the changes. For both risk and maintainer bandwidth reasons.
If a "critical bug fix" has a non-negligible risk of regression, then either there's a clear divergence on the definition of a "critical bug fix", or the whole feature should be temporarily disabled (cause it has no bug fix simple enough for an RC phase). Or just filed and advertised, e.g. "don't use version X".
> (While still in the experimental phase I do accept a slightly higher risk of (non serious!) regressions that I will post experimental so that I can prioritize throughput of getting bugfixes out; that's why I was surprised.)
I think I've been noticing a bit of dissonance on that "experimental" topic...
- Either a significant number of bcachefs people use Linus' mainline and trust it with their data. Then that branch is not really "experimental" any more (whatever the label says), and no large change should ever be submitted in the RC phase but only small, "critical bug fixes"
- Or, it really is still "experimental", users should not trust that mainline branch, and then there is no emergency to fix problems in it! Because users shouldn't trust anyway. It's "experimental" after all.
In BOTH cases, no large change should ever be submitted in the RC phase! I mean, in neither case is any time-consuming _process exception_ needed.
> I've been working in storage for going on 20 years at this point, and I've always been the one ultimately responsible for my code,...
That sounds like 20 years of filesystem experience and 0 year experience of not being the boss?
Learning is hard, unlearning is much harder. Unlearning complete control seems crazy hard.
> things break down when people start dictating and taking an "I know better, even though I'm not explaining my reasoning" attitude.
Maintainers don't really have time to explain; the onus is on the submitter to make them understand and build trust. Whatever the perception is, using words as "dictating" can only backfire. Looks like it does. Maybe the submitter does not communicate well and should try harder. Maybe the maintainer is not smart enough or does not have enough time. Then the submitter should fork (and maybe come back later). Maybe both sides have issues.
So what exactly *is* in the cards, then?
Posted Sep 6, 2025 19:43 UTC (Sat)
by koverstreet (✭ supporter ✭, #4296)
[Link] (2 responses)
> Surprising, can you please share some commit IDs?
Try git log v6.16-rc1..v6.16 -- fs/xfs
> The most important points seem to be missing from that list: size and nature of the changes. For both risk and maintainer bandwidth reasons.
> If a "critical bug fix" has a non-negligible risk of regression, then either there's a clear divergence on the definition of a "critical bug fix", or the whole feature should be temporarily disabled (cause it has no bug fix simple enough for an RC phase). Or just filed and advertised, e.g. "don't use version X".
There was ~0 risk of regression with the patch in question.
bcachefs's journalling is drastically simpler than ext4's: we journal btree updates and nothing else - it's just a list of keys. For normal journal replay, we just sort all the keys in the journal and keep the newest when there's a duplicate. For journal_rewind, all we do is tweak the sort function if it's a non-alloc leaf node key. (We can't rewind the interior node updates and we don't need to, which means alloc info will be inconsistent; that's fine, we just force a fsck).
IOW: algorithmically this is very simple stuff, which means it's very testable, and it's in one of the codepaths best covered by automated tests - and it's all behind a new option, so it has zero affect on existing operation. This is about as low regression risk as it gets, and the new code has performed flawlessly every time we've used it.
> - Either a significant number of bcachefs people use Linus' mainline and trust it with their data. Then that branch is not really "experimental" any more (whatever the label says), and no large change should ever be submitted in the RC phase but only small, "critical bug fixes"
No, you've got it backwards. The experimental label is for communication to users, it's not for driving development policy.
We ALWAYS develop in the safest way we practically can, but we do have to balance that with shipping and getting it done. Getting too conservative about safety paralyzes the development process, and if we slow down to the point that we're not able to close out bugs users are hitting in a reasonable timeframe or ship features users need (an important consideration when e.g. we've got a lot of users waiting for erasure coding to land so they can get onto something more manageable, robust and better supported), then we're not doing it right.
OTOH, there's generally no need to hair split over this, because if you're doing things right, good techniques for ensuring reliability and avoiding regressions are just no brainers that let you both ship more reliable code and move faster: if you strike a good balance, most of the techniques you use are just plain win/win.
E.g. good automated testing is a _massive_ productivity boost; you find bugs quicker (hours instead of weeks) while the code is in your head. Investing in that is a total no brainer. Switching from C to Rust is another obvious win/win (and god I wish bcachefs was already written in Rust).
Work smarter, not harder.
But one of the key things we balance in "fast vs. safe" is regression risk, and that does vary over the lifecycle of a project. Early on, you do need to move quicker: you have lots of bugs to close out, features that may require some rearchitecting, so accepting some risk of regression is totally fine and reasonable as long as those regressions are minor and infrequent compared to the rest of the bugs you're closing out (you want the total bugcount to be going down fast) and you're not creating problems for yourself down the road or your users: users will be fine with that as long as you're quickly closing out the actual issues they hit. I eyeball the ratio of regression fixes to other bugfixes (as well as time spent) to track this; suffice it to say regressions have not generally been a problem. (The two big ones that we were bit by in the 6.16 cycle were pretty exceptional and caused by partly by factors outside of our control, and both were addressed on multiple levels - new hardening, new tests - to ensure bugs like that don't happen again).
The other key thing you're missing is: it's a filesystem, and people test filesystems by using them and putting their data on them.
It is _critical_ that we get lots of real world testing before lifting the experimental label, and that means people are going to be using it and trusting it like any other filesystem, and that means we have to be supporting it like any other filesystem. "No big changes" is far too simple a rule to ever work - experimental or not. Like I said earlier, you're always balancing regression risk vs. how much users need it, with the goal being ensuring that users have working machines.
There's also the minor but important detail that lots of users are using bcachefs explicitly because they've been burned by another COW filesystem that will go unnamed (as in, losing entire filesystems multiple times), so they're using bcachefs because even at this still slightly rough and early state, the things they have to put up with are way better than losing more filesystems.
That is, they're using bcachefs precisely because of things like this: when something breaks, I make sure it gets fixed and they get their data back. Ensuring users do not lose data is always the top priority. It's exactly the same as the kernel's rule about "do not break userspace". The kernel's only function is to run all the other applications that users actually want to run: if we're breaking them, we're failing at our core function. A filesystem that loses data is failing at its core function, and should be discarded for something better.
> That sounds like 20 years of filesystem experience and 0 year experience of not being the boss?
Well, if everything comes down to authority and chains of command now then maybe kernel culture is too far gone for filesystem work to be done here. Because that's not good engineering: good engineering requires an inquisitive, open culture where we listen and defer to the experts in their field, where we all learn from and teach each other, and when there's a conflict or disagreement we hash it out and figure out what the right answer is based on our shared goals (i.e. delivering working code).
> Maintainers don't really have time to explain;
That's a poor excuse for "I don't have time to be a good manager/engineer".
In engineering, we always have to be able to justify our decisionmaking. I have to be able to explain what I'm doing and why to my users, or they won't trust my code. I have to be able to explain what I'm doing and why to the developers I work with on the bcachefs codebase, or they'll never learn how things work - plus, I do make mistakes, and if you can't explain your reasoning that's a very big clue that you might be mistaken.
So what exactly *is* in the cards, then?
Posted Sep 7, 2025 3:04 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (1 responses)
> Try git log v6.16-rc1..v6.16 -- fs/xfs
Please be specific; I just did and I found nothing shocking. The commits with "refactor" or "factor" in their name seemed very trivial, even I could make sense of them.
> There was ~0 risk of regression with the patch in question.
I was speaking in general, not about any particular patch in question. I don't even know which patch you're referring to.
> No, you've got it backwards. The experimental label is for communication to users, it's not for driving development policy.
I think you missed the point I was trying to make. I'm not sure you really tried.
> But one of the key things we balance in "fast vs. safe" is regression risk, and that does vary over the lifecycle of a project.
Yet another wall of text full of things that make sense and that I tend to agree with, but I really can't relate much of it with the points I was trying to make. This is not communicating, just speaking. And I'm amazed you have time left to write code after digressing and repeating yourself so much in obscure corners like this one. Indeed, burn out must not be far away. Unless there's a lot of copy/paste?
> where we all learn from and teach each other,
I have not read everything, very far from it but I don't remember you "learning" much. Could you name one significant and non-technical thing that you've learned during all this drama and will try to do differently going forwards? Trying to be absurdly clear: an answer to such a question (if any) should not say _anything_ about others, only about yourself.
So what exactly *is* in the cards, then?
Posted Sep 11, 2025 1:06 UTC (Thu)
by deepfire (guest, #26138)
[Link]
One person speaks about technical details and impersonal principles of communication and organisation.
The other goes as far as employing mind reading and generally positions themselves as a judge of character.
Someone clearly needs to get off the high horse.
So what exactly *is* in the cards, then?
Posted Sep 15, 2025 11:12 UTC (Mon)
by paulj (subscriber, #341)
[Link]
> The real breakdown was in the private maintainer thread, when Linus had quite a bit to say about how he doesn't trust my judgement based on, as far as I can tell, not much more than the speed with which I work and get stuff out. That speed is a direct result of very good QA (including the best automated testing of any filesystem in the kernel), a modern and very hardened codebase, and the simple fact that I know my code like the back of my hand and am very good at what I do.
Kent, do you realise the implicit message you are sending to other kernel people when you write things like this? You are somewhat implicitly saying that the kernel development process is generally much slower than your process, cause others do not have good code, don't have good testing, and they don't know the code well.
I am sure that's not how you intend it, but this is the kind of message you send to others when you blow your own trumpet in such ways in comms to peers and to longer standing kernel people - whether you are explicit or subtle in it. You are signalling that you consider yourself superior in such descriptions and ALSO implicitly in how you argue for exceptions again and again, even when maintainers with the final say have told you you will not get an exception at this time, particularly if you then point at other exceptional cases that you think you are better than.
Can you understand how this might rub others up the wrong way? Have you ever had to work with someone who regularly, through whatever implicit signals, makes it clear they think they are superior? Do you know how off-putting that can be to others?
I beseech you, yet again, to take a long break from engaging in comment threads here on LWN, or on Phoronix, or Reddit, etc., and also take a break from engaging with other kernel devs, and just go and focus on your code and making it great for your users. Refrain from making comparisons to other developers or their code or engineering practices - in any way, however subtle.
Do that, make bcachefs undisputably awesome, let your code do the talking, and things will eventually come good again.
If you can't stay off comment threads, where you seem to - regularly or irregularly - drop misjudged clangers about how good you think you are, then the chances of things coming good aren't as good I fear.