Attention Residuals: Rethinking depth-wise aggregation [pdf]
github.comIn [1] I think a commenter actually speculated about a design just like this, where later layers can directly access outputs of previous layers instead of having to store it in the residual stream