Summary
The PR under review attempts to make progress towards the age-old problem of submodule imports, specifically with regards to their treatment by the rule unused-import (F401).
Some related issues:
Prior art:
- [pyflakes] Fix check of unused imports #13965
- Consider submodule imports to detect unused imports #5010
- Consider submodule imports to detect unused imports #5011
- Handle multiple unused submodule imports #666
One distinguishing feature of this PR is that we have attempted to be less ambitious in the hopes that this will improve the chances of landing nonzero progress into main.
What the People Want
Users expect to see the following behavior in these common situations:
# Example 1 import a # ok import a.b # F401: unused-import a.foo()
# Example 2 import a # F401: unused-import import a.b # ok a.b.bar()
# Example 3 import a # ok import a.b # ok a.foo() a.b.bar()
The following situations are unintuitive to users, but are valid Python and we are probably forced to proceed as indicated:
# Example 4 import a.b # ok a.foo()
# Example 5 import a.b # F401: unused-import import a.c # ok a.x.baz()
The goal of this PR is to modify the implementation of unused-import to match this expectation.
Modeling the Python in Users' Brains
Given the expected behavior above, how might a typical user be modeling Python in their minds? One possibility is as follows. Upon seeing the statement:
the user thinks:
[User]: We have made available the symbol
a, which is a module, but we are only allowed to access members of this module that have the forma.b.*.
This happens to be incorrect. Rather than restrict what is available to access on a, the opposite happens:
[Python]: We have executed the equivalent of the statement
import a, and, in particular, can access any top-level member ofamade available by this import. We then import the submoduleb, making possibly even more members available.
Similarly, when this user sees:
import a import a.b a.foo() a.b.bar()
they think:
[User] The line
a.foo()is only allowed because of the first import. So that is definitely needed. The second import may or may not be needed depending on whethera/__init__.pyimportsb.
when really the opposite is true:
[Python] The first import is totally redundant and is not used. The call
a.foo()references the symbolathat is being loaded in on the second line.
These views are actually incompatible, so it is not possible to model the user's mind without running afoul of Examples 4 and 5 above.
The Worst of Both Worlds
To model both the behavior that the user expects and the unintuitive behavior that Python allows, we adopt an approach I'll call "Thanks I Hate It (TIHI)". It is the worst of both worlds, and it alters how we think about accessing members of a module. Again we will consider the example:
import a import a.b a.foo() a.b.bar()
[TIHI]: Following Python, the import statements both make available the symbol
aand all of it's top-level members. The second import statement also makes available the members of the forma.b.*. However, the attribute calla.foowill accessavia the first import statement and the calla.b.barwill accessa.bvia the second.
In general the TIHI model will resolve an attribute load of the form a.a1...an (for a module a) by:
- Finding the maximum prefix match amongst all available
import statementsimport a.b1...bm, - Among these, finding those of minimal path-length, and
- Adding a reference to the binding created by the nearest
of these.
What we actually do
Rather than force Ruff's semantic model to be incorrect by actually implementing TIHI, we hack TIHI into the logic of unused-import itself. Doing this is somewhat awkward due to the nature of the implementation of unused-import.
Let me briefly review the current implementation and then explain what we do differently.
Current implementation
After having the semantic model visit the entire module, we look at the current live import bindings that have no references. There are various other filters and logic then applied, but our essential starting bucket for this rule consists of that: bindings to import statements that are live at the end of the file and have no references.
In particular, if the module consists solely of:
then our starting "bucket" where we might emit a lint does not include import a since the symbol a is shadowed by the statement import a.b.
This PR
From now on we will refer to the current behavior as the "stable" behavior. In this PR we introduce preview behavior under very specific assumptions designed to limit the scope of this PR to handle the most common cases where users get tripped up. If any of these assumptions are not met then we revert to the stable behavior.
To begin, instead of only considering import bindings that are live at the end of the module, we also consider all bindings shadowed by these. We assume that all of these bindings are also import bindings - specifically simple imports and submodule imports without aliases, not from imports. If not - revert to stable behavior. Next, we have to decide which of these statements have been "used" in the sense of the "TIHI" model above. So we iterate through the references to the last binding and decide which import statement is intuitively being referenced, according to the matching logic we described earlier.
As a technical note here, a call like a.b.foo() will create a reference to the symbol a - so we will not see the full attribute access. We must therefore crawl up the AST and collect the qualified name of the attribute access. We assume that all references arise from a Name node or from appearance in a binding to __all__ - if we find some other kind of reference, we revert to stable behavior.
Having marked each import statement as used or unused, we collect the unused ones and then proceed as in the stable behavior.
Questions
What if the submodule import has a side-effect and so is used? For example, if
a/b.pycontainsimport a.xbuta/__init__.pydoes not. Thena.x.bazreally does needimport a.b.
We already don't model side-effects, even for simple module imports like import a. So the user would be required to suppress the lint for such an anti-pattern.
Shouldn't you do ____ to avoid allocating and performance regressions?
Maybe! The benchmarks did not show a regression, but if we want to pre-emptively do that I'm happy to. Some places where it may make sense are:
- We currently allocate a vector of size 1 in a common situation where we "bail" to the stable behavior when iterating over bindings. This is to make the opaque type of various iterators the same. But we could avoid that.
- We could use something like
SmallVecwhen collecting candidate unused import bindings shadowing a given one, since there will often not be very many - We could use a bitmask to mark used bindings (for a given symbol) as long as we are only dealing with < 64 of them and fallback to
Vec<bool>otherwise - ... probably other things I'm not thinking of!