Adaptive Low-Rank Product Transformers with Dynamic Expert Routing for Online Continual Learning

Published May 7, 2026 | Version v1

Preprint Open

Description

Inspired by the role of sleep in biological continual learning, we introduce RVW, a trans-

former architecture for online continual adaptation of pretrained models. RVW maintains a

small pool of per-layer experts that grow and prune in response to distribution shift, with

no replay buffer and no explicit task identifier. Applied to TinyLlama-1.1B on a 15,000-

chunk six-domain stream, RVW reaches 40 average held-out PPL, substantially better than

EWC (158), fine-tuning (164), and LoRA (448) on the same parameter-matched base, while

preserving prior-domain performance. Threshold sweeps suggest a combinatorial encoding

reading: domain knowledge appears to be carried by routing patterns across layers rather

than by individual specialized experts.