This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model. Our hope is that building on this approach of automating interpretability will enable us to comprehensively audit the safety of models before deployment.
Our technique seeks to explain what patterns in text cause a neuron to activate. It consists of three steps:
- Explain the neuron's activations using GPT-4
- Simulate activations using GPT-4, conditioning on the explanation
- Score the explanation by comparing the simulated and real activations