Ask HN: Why dead code detection in Python is harder than most tools admit
I’ve been thinking about why dead code detection (and static analysis in general) feels so unreliable in Python compared to other languages. I understand that Python is generally dynamic in nature.
In theory it should be simple(again in theory): parse the AST, build a call graph, find symbols with zero references. In practice it breaks down quickly because of many things like:
1. dynamic dispatch (getattr, registries, plugin systems)
2. framework entrypoints (Flask/FastAPI routes, Django views, pytest fixtures)
3. decorators and implicit naming conventions
4. code invoked only via tests or runtime configuration
Most tools seem to pick one of two bad tradeoffs:
1. be conservative and miss lots of genuinely dead code
or
2. be aggressive and flag false positives that people stop trusting
What’s worked best for me so far is treating the code as sort of a confidence score, plus some layering in limited runtime info (e.g. what actually executed during tests) instead of relying on 100% static analysis.
Curious how others handle this in real codebases..
Do yall just accept false positives? or do yall ignore dead code detection entirely? have anyone seen approaches that actually scale? I am aware that sonarqube is very noisy.
I built a library with a vsce extension, mainly to explore these tradeoffs (link below if relevant), but I’m more interested in how others think about the problem. Also hope I'm in the right channel
Repo for context: https://github.com/duriantaco/skylos > What’s worked best for me so far is treating the code as sort of a confidence score, plus some layering in limited runtime info (e.g. what actually executed during tests) instead of relying on 100% static analysis. > Curious how others handle this in real codebases.. I'd argue that for large Python codebases, having high automated test coverage is essential -- mainly unit tests of logic, but also a few heavier integration tests & smoke tests to confirm that the units can actually be wired together and executed in some fashion. So, assuming [a] you're starting with a healthy Python codebase with great automated test coverage, the impact of deleting code that appears to be dead due to a false positive is actually quite low: it's immediately caught by the automated test suite, either pre-commit or pre-merge, so you back out the proposed delete and cross it off your list, and try the next ones. If the full test suite takes 5 minutes or less to run, no problem. If the full test suite takes 12 hours to run... ughh. I guess you could work around that with a statistical approach, make a few different branches, each applying a different random sample of 50 of the proposed 100 different code deletions, then kick them all off to run in CI overnight. My main memory of Python dead code deletion is working on a Python project around a decade ago. The codebase was fresh, maybe 1-2 years old, worked on by various teams of contractors, but the focus had reasonably been on bashing out features, getting the app into production, rapidly iterating based on user feedback & as requirements oscillated. So there was a bit of accumulated cruft. I was new to the project and suggested we could use the vulture [1] dead code scanner, one of the other devs who had been working on the project since the start had a bit of spare time one afternoon, they applied it, found a bunch of dead code and deleted it all. The project had OK test coverage & we could manually sanity check each proposed delete during code review. It was a quick win. It's OK if you don't delete all the dead code, deleting the 80% that's easy to identify and low risk to remove is pretty good, then everyone can get back to shovelling more features into prod. [a] If a Python codebase does not have high coverage of tests that are testing specific requirements and parts of functionality, the project is in an unhealthy state and that needs to be addressed first, before attempting any refactors, fixes or feature work. One way to dig out of that hole is to layer on automated end-to-end regression tests that assert that the behaviour (whatever it happens to be, quirks/defects and all) hasn't changed. Such tests are a lot worse than having fast specific tests of requirements (they merely detect if behaviour is changing, not if the behaviour meets or breaks requirements), they require a lot of toil to maintain, but they're significantly better than nothing, and provided you've got a wide enough sample of test scenarios to drive your regression tests it at least lets you tell with some confidence if a proposed refactor causes the app to crash. Then the safety net of regression tests gives you the confidence to make surgical changes (while writing specific unit tests). This is the general approach advocated by Feathers' Working Effectively with Legacy Code book (not Python specific). Agree that test coverage is the ultimate safety net for false positives, but I'd push back on relying on it as the primary strategy. "Delete it and see if tests break" works at small scale, but the feedback cycle gets painful fast - you're burning CI minutes on speculative deletions, and the FPs erode trust in the tool. After a few rounds of "oh that wasn't actually dead," people stop bothering. The better approach is reducing false positives at the analysis level so the list you hand to a developer is actionable without requiring a test run for each item. We built a dead code scanner (Swynx - https://swynx.io) and validated it against 1,100+ open source repos. A few things that moved the needle on Python specifically: File-level reachability first, then export-level. Instead of parsing the AST for individual call sites (where dynamic dispatch kills you), we do a BFS from every entry point (__main__.py, framework convention files, package.json main/bin, etc.) and follow the import graph. If no path from any entry point reaches a module, it's dead with very high confidence. This sidesteps a huge class of FPs because you're not trying to resolve getattr(module, f"handle_{action}") — you're just asking "does anything import this module at all?" __getattr__ is a sleeper problem in Python. When a module defines __getattr__, any sibling module becomes dynamically reachable via attribute access. We had to add a rule: if __init__.py defines __getattr__, treat all sibling.py files and sub-packages as live. Same idea with importlib.import_module() - if we see it in the BFS with a resolvable dotted path, we follow it. Scale reveals your bugs. When you scan 1,000 repos, any repo showing >10% dead is almost certainly a scanner bug, not actually 10% dead. That feedback loop drove our false positive rate down to ~1.7% across hundreds of repos - way below the threshold where people stop trusting the output. Re: Vulture - I actually just ran Skylos (the OP's tool) against its own codebase. It found some real stuff: constants like FRAMEWORK_FUNCTIONS and ENTRY_POINT_DECORATORS defined in framework_aware.py but never referenced anywhere. Genuinely dead code in a dead code detection tool. But it also flagged its own pytest plugin hooks (pytest_collection_finish, pytest_fixture_setup) as unused at confidence=60 - exactly the framework convention problem the OP described. The confidence scoring approach is right, but the framework awareness has to be deep enough to catch your own patterns. To the OP's original question: you shouldn't accept false positives, and you shouldn't ignore dead code detection. The answer is two passes - file-level reachability (high confidence, low noise) followed by export-level analysis for files that are reachable but partially dead. First off, thanks a lot for taking the time to run Skylos against its own repo! Getting those flagged at confidence=60 is why deeper framework awareness is an immediate priority for us. The 2 pass approach with Swynx is cool! Doing a BFS for file-level reachability first to sidestep the AST dynamic dispatch nightmare makes sense. If the module isnt even in the import graph then the confidence is practically 100%. And great callout on `__getattr__` and `importlib`.. Those dynamic edge cases are really sleeper problems for Python static analysis. Thanks for the check on the pytest hooks and for sharing the Swynx architecture! It’s great to see how you guys tackled the dynamic nature of Python.