Interpretability in the wild: a circuit for indirect object identification in gpt-2 small K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt arXiv preprint arXiv:2211.00593, 2022 | 162 | 2022 |
Supervising strong learners by amplifying weak experts P Christiano, B Shlegeris, D Amodei arXiv preprint arXiv:1810.08575, 2018 | 75 | 2018 |
Adversarial training for high-stakes reliability D Ziegler, S Nix, L Chan, T Bauman, P Schmidt-Nielsen, T Lin, A Scherlis, ... Advances in Neural Information Processing Systems 35, 9274-9286, 2022 | 37 | 2022 |
Causal scrubbing: A method for rigorously testing interpretability hypotheses L Chan, A Garriga-Alonso, N Goldowsky-Dill, R Greenblatt, ... AI Alignment Forum, 1828-1843, 2022 | 34* | 2022 |
Polysemanticity and capacity in neural networks A Scherlis, K Sachan, AS Jermyn, J Benton, B Shlegeris arXiv preprint arXiv:2210.01892, 2022 | 13 | 2022 |
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 10 | 2024 |
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022 K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt URL https://arxiv. org/abs/2211.00593, 0 | 8 | |
Gini coefficient calculator B Shlegeris Web, 2020 | 6 | 2020 |
Ai control: Improving safety despite intentional subversion R Greenblatt, B Shlegeris, K Sachan, F Roger arXiv preprint arXiv:2312.06942, 2023 | 5 | 2023 |
Language models are better than humans at next-token prediction B Shlegeris, F Roger, L Chan, E McLean arXiv preprint arXiv:2212.11281, 2022 | 4 | 2022 |
Measurement tampering detection benchmark F Roger, R Greenblatt, M Nadeau, B Shlegeris, N Thomas arXiv preprint arXiv:2308.15605, 2023 | 2 | 2023 |