Neel Nanda

Citirano

	Sve	Od 2019.
Citati	1927	1926
H-indeks	13	13
i10-indeks	14	14

1100

550

275

825

20212022202320245 108 1069 738

Javni pristup

Prikaži sve

1 članak

0 članaka

dostupno

nije dostupno

Na temelju uvjeta financiranja

Suautori

Lawrence ChanPhD Student, UC BerkeleyPotvrđena adresa e-pošte na berkeley.edu
Catherine OlssonAnthropicPotvrđena adresa e-pošte na mit.edu
Tom LieberumGoogle DeepMindPotvrđena adresa e-pošte na deepmind.com
Christopher OlahAnthropicPotvrđena adresa e-pošte na google.com
Bilal ChughtaiIndependentPotvrđena adresa e-pošte na cam.ac.uk

Prati

Neel Nanda

Research Engineer, Google DeepMind

Potvrđena adresa e-pošte na deepmind.com - Početna stranica

AI ML AI Alignment Interpretability Mechanistic Interpretability


Naslov Poredaj po navodima Poredaj po godini Poredaj po naslovu	Citirano Citirano	Godina
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022	699	2022
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... Transformer Circuits Thread, 2022	317*	2022
A Mathematical Framework for Transformer Circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread, 2021	281*	2021
Progress Measures For Grokking Via Mechanistic Interpretability N Nanda, L Chan, T Liberum, J Smith, J Steinhardt ICLR 2023 Spotlight, 2023	174*	2023
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022	172	2022
Finding Neurons in a Haystack: Case Studies with Sparse Probing W Gurnee, N Nanda, M Pauly, K Harvey, D Troitskii, D Bertsimas Transactions on Machine Learning Research, 2023	49	2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations B Chughtai, L Chan, N Nanda ICML 2023, 2023	47	2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models N Nanda, A Lee, M Wattenberg BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023	42*	2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023	28	2023
TransformerLens: A Library for Mechanistic Interpretability of Language Models N Nanda, J Bloom https://github.com/neelnanda-io/TransformerLens, 2022	20*	2022
Softmax Linear Units N Elhage, T Hume, C Olsson, N Nanda, T Henighan, S Johnston, ... Transformer Circuits Thread, 2022	16	2022
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods F Zhang, N Nanda ICLR 2024, 2023	14	2023
Attribution Patching: Activation Patching At Industrial Scale N Nanda https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023	14*	2023
Linear Representations of Sentiment in Large Language Models C Tigges, OJ Hollinsworth, A Geiger, N Nanda NeurIPS 2023 Workshop on Attributing Model Behavior at Scale, 2023	12	2023
Copy Suppression: Comprehensively Understanding an Attention Head C McDougall, A Conmy, C Rushing, T McGrath, N Nanda NeurIPS 2023 Workshop on Attributing Model Behavior at Scale, 2023	9	2023
A Comprehensive Mechanistic Interpretability Explainer & Glossary N Nanda https://neelnanda.io/glossary, 2023	6*	2023
Neuroscope N Nanda https://neuroscope.io, 2022	5*	2022
Universal Neurons in GPT2 Language Models W Gurnee, T Horsley, ZC Guo, TR Kheirkhah, Q Sun, W Hathaway, ... arXiv preprint arXiv:2401.12181, 2024	4	2024
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level N Nanda, S Rajamanoharan, J Kramár, R Shah Alignment Forum, https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact …, 2023	4*	2023
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching A Makelov, G Lange, N Nanda ICLR 2024, 2023	4*	2023

Sustav trenutno ne može provesti ovu radnju. Pokušajte ponovo kasnije.

Članci 1–20

Godišnji broj citata

Dvostruki navodi

Spojeni navodi

Dodavanje suautoraSuautori

Prati

Citirano

Suautori