Settings

Theme

The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent

lesswrong.com

1 points by patrick0d 5 months ago · 1 comment

Reader

patrick0dOP 5 months ago

I am an AI interpretability researcher and have a new proposal for a way to measure the per token contribution of each head and neuron in LLMs. I found that the normalisation that happens in every LLM is avoided by modern attribution methods despite it having a large impact on the model's computation.

Here is the full preprint paper and the code I used. https://github.com/patrickod32/landed_writes Happy to some insight from any interested people and would like to know if other people here have been working on anything similar. This seems like a real gap in the research to me.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection