Buck Shlegeris - Google znalac

Izradite svoj profil

Citirano

	Sve	Od 2019.
Citati	356	351
H-indeks	7	7
i10-indeks	6	6

0

180

90

45

135

20182019202020212022202320244 14 7 10 24 167 125

Buck Shlegeris

Buck Shlegeris

CTO, Redwood Research

Potvrđena adresa e-pošte na rdwrs.com


Naslov Poredaj po navodima Poredaj po godini Poredaj po naslovu	Citirano Citirano	Godina
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt arXiv preprint arXiv:2211.00593, 2022	162	2022
Supervising strong learners by amplifying weak experts P Christiano, B Shlegeris, D Amodei arXiv preprint arXiv:1810.08575, 2018	75	2018
Adversarial training for high-stakes reliability D Ziegler, S Nix, L Chan, T Bauman, P Schmidt-Nielsen, T Lin, A Scherlis, ... Advances in Neural Information Processing Systems 35, 9274-9286, 2022	37	2022
Causal scrubbing: A method for rigorously testing interpretability hypotheses L Chan, A Garriga-Alonso, N Goldowsky-Dill, R Greenblatt, ... AI Alignment Forum, 1828-1843, 2022	34*	2022
Polysemanticity and capacity in neural networks A Scherlis, K Sachan, AS Jermyn, J Benton, B Shlegeris arXiv preprint arXiv:2210.01892, 2022	13	2022
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024	10	2024
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022 K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt URL https://arxiv. org/abs/2211.00593, 0	8
Gini coefficient calculator B Shlegeris Web, 2020	6	2020
Ai control: Improving safety despite intentional subversion R Greenblatt, B Shlegeris, K Sachan, F Roger arXiv preprint arXiv:2312.06942, 2023	5	2023
Language models are better than humans at next-token prediction B Shlegeris, F Roger, L Chan, E McLean arXiv preprint arXiv:2212.11281, 2022	4	2022
Measurement tampering detection benchmark F Roger, R Greenblatt, M Nadeau, B Shlegeris, N Thomas arXiv preprint arXiv:2308.15605, 2023	2	2023

Sustav trenutno ne može provesti ovu radnju. Pokušajte ponovo kasnije.

Članci 1–11