[`pyflakes`] Handle some common submodule import situations for `unused-import` (`F401`) by dylwil3 · Pull Request #20200 · astral-sh/ruff

Summary

The PR under review attempts to make progress towards the age-old problem of submodule imports, specifically with regards to their treatment by the rule unused-import (F401).

Some related issues:

Prior art:

One distinguishing feature of this PR is that we have attempted to be less ambitious in the hopes that this will improve the chances of landing nonzero progress into main.

What the People Want

Users expect to see the following behavior in these common situations:

# Example 1
import a    # ok
import a.b  # F401: unused-import

a.foo()

# Example 2
import a    # F401: unused-import
import a.b  # ok

a.b.bar()

# Example 3
import a    # ok
import a.b  # ok

a.foo()
a.b.bar()

The following situations are unintuitive to users, but are valid Python and we are probably forced to proceed as indicated:

# Example 4
import a.b  # ok

a.foo()

# Example 5
import a.b  # F401: unused-import
import a.c  # ok

a.x.baz()

The goal of this PR is to modify the implementation of unused-import to match this expectation.

Modeling the Python in Users' Brains

Given the expected behavior above, how might a typical user be modeling Python in their minds? One possibility is as follows. Upon seeing the statement:

the user thinks:

[User]: We have made available the symbol a, which is a module, but we are only allowed to access members of this module that have the form a.b.*.

This happens to be incorrect. Rather than restrict what is available to access on a, the opposite happens:

[Python]: We have executed the equivalent of the statement import a, and, in particular, can access any top-level member of a made available by this import. We then import the submodule b, making possibly even more members available.

Similarly, when this user sees:

import a
import a.b

a.foo()
a.b.bar()

they think:

[User] The line a.foo() is only allowed because of the first import. So that is definitely needed. The second import may or may not be needed depending on whether a/__init__.py imports b.

when really the opposite is true:

[Python] The first import is totally redundant and is not used. The call a.foo() references the symbol a that is being loaded in on the second line.

These views are actually incompatible, so it is not possible to model the user's mind without running afoul of Examples 4 and 5 above.

The Worst of Both Worlds

To model both the behavior that the user expects and the unintuitive behavior that Python allows, we adopt an approach I'll call "Thanks I Hate It (TIHI)". It is the worst of both worlds, and it alters how we think about accessing members of a module. Again we will consider the example:

import a
import a.b

a.foo()
a.b.bar()

[TIHI]: Following Python, the import statements both make available the symbol a and all of it's top-level members. The second import statement also makes available the members of the form a.b.*. However, the attribute call a.foo will access a via the first import statement and the call a.b.bar will access a.b via the second.

In general the TIHI model will resolve an attribute load of the form a.a1...an (for a module a) by:

Finding the maximum prefix match amongst all available
import statements import a.b1...bm,
Among these, finding those of minimal path-length, and
Adding a reference to the binding created by the nearest
of these.

What we actually do

Rather than force Ruff's semantic model to be incorrect by actually implementing TIHI, we hack TIHI into the logic of unused-import itself. Doing this is somewhat awkward due to the nature of the implementation of unused-import.

Let me briefly review the current implementation and then explain what we do differently.

Current implementation

After having the semantic model visit the entire module, we look at the current live import bindings that have no references. There are various other filters and logic then applied, but our essential starting bucket for this rule consists of that: bindings to import statements that are live at the end of the file and have no references.

In particular, if the module consists solely of:

then our starting "bucket" where we might emit a lint does not include import a since the symbol a is shadowed by the statement import a.b.

This PR

From now on we will refer to the current behavior as the "stable" behavior. In this PR we introduce preview behavior under very specific assumptions designed to limit the scope of this PR to handle the most common cases where users get tripped up. If any of these assumptions are not met then we revert to the stable behavior.

To begin, instead of only considering import bindings that are live at the end of the module, we also consider all bindings shadowed by these. We assume that all of these bindings are also import bindings - specifically simple imports and submodule imports without aliases, not from imports. If not - revert to stable behavior. Next, we have to decide which of these statements have been "used" in the sense of the "TIHI" model above. So we iterate through the references to the last binding and decide which import statement is intuitively being referenced, according to the matching logic we described earlier.

As a technical note here, a call like a.b.foo() will create a reference to the symbol a - so we will not see the full attribute access. We must therefore crawl up the AST and collect the qualified name of the attribute access. We assume that all references arise from a Name node or from appearance in a binding to __all__ - if we find some other kind of reference, we revert to stable behavior.

Having marked each import statement as used or unused, we collect the unused ones and then proceed as in the stable behavior.

Questions

What if the submodule import has a side-effect and so is used? For example, if a/b.py contains import a.x but a/__init__.py does not. Then a.x.baz really does need import a.b.

We already don't model side-effects, even for simple module imports like import a. So the user would be required to suppress the lint for such an anti-pattern.

Shouldn't you do ____ to avoid allocating and performance regressions?

Maybe! The benchmarks did not show a regression, but if we want to pre-emptively do that I'm happy to. Some places where it may make sense are:

We currently allocate a vector of size 1 in a common situation where we "bail" to the stable behavior when iterating over bindings. This is to make the opaque type of various iterators the same. But we could avoid that.
We could use something like SmallVec when collecting candidate unused import bindings shadowing a given one, since there will often not be very many
We could use a bitmask to mark used bindings (for a given symbol) as long as we are only dealing with < 64 of them and fallback to Vec<bool> otherwise
... probably other things I'm not thinking of!