Bookmark: Language models can explain neurons in language models

lqdev👽05/09/2023

https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model. Our hope is that building on this approach of automating interpretability will enable us to comprehensively audit the safety of models before deployment.

Our technique seeks to explain what patterns in text cause a neuron to activate. It consists of three steps:

Explain the neuron's activations using GPT-4

Simulate activations using GPT-4, conditioning on the explanation

Score the explanation by comparing the simulated and real activations

Permalink: /feed/large-language-models-explain-neural-network-neurons/

Tags: #ai #llm #deeplearning

Back to feed

Send me a message or webmention