Follow
Tomasz Korbak
Tomasz Korbak
Anthropic
Verified email at anthropic.com - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
2442023
Pretraining language models with human preferences
T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ...
International Conference on Machine Learning, 17506-17533, 2023
1292023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ...
arXiv preprint arXiv:2309.12288, 2023
118*2023
Inverse scaling: When bigger isn't better
IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ...
arXiv preprint arXiv:2306.09479, 2023
82*2023
Towards understanding sycophancy in language models
M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ...
arXiv preprint arXiv:2310.13548, 2023
712023
Training language models with language feedback at scale
J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez
arXiv preprint arXiv:2303.16755, 2023
692023
Improving code generation by training with natural language feedback
A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ...
arXiv preprint arXiv:2303.16749, 2023
422023
Aligning language models with preferences through f-divergence minimization
D Go, T Korbak, G Kruszewski, J Rozen, N Ryu, M Dymetman
arXiv preprint arXiv:2302.08215, 2023
402023
RL with KL penalties is better viewed as Bayesian inference
T Korbak, E Perez, CL Buckley
arXiv preprint arXiv:2205.11275, 2022
37*2022
Computational enactivism under the free energy principle
T Korbak
Synthese 198 (3), 2743-2763, 2021
312021
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting
T Korbak, H Elsahar, G Kruszewski, M Dymetman
Advances in Neural Information Processing Systems 35, 16203-16220, 2022
302022
Foundational challenges in assuring alignment and safety of large language models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
arXiv preprint arXiv:2404.09932, 2024
292024
Taken out of context: On measuring situational awareness in LLMs
L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ...
arXiv preprint arXiv:2309.00667, 2023
28*2023
Controlling conditional language models without catastrophic forgetting
T Korbak, H Elsahar, G Kruszewski, M Dymetman
International Conference on Machine Learning, 11499-11528, 2022
252022
Interaction history as a source of compositionality in emergent communication
T Korbak, J Zubek, Ł Kuciński, P Miłoś, J Rączaszek-Leonardi
Interaction Studies 22 (2), 212-243, 2021
17*2021
Many-shot jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
Anthropic, April, 2024
15*2024
Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication
Ł Kuciński, T Korbak, P Kołodziej, P Miłoś
Advances in neural information processing systems 34, 23075-23088, 2021
142021
Scaffolded minds and the evolution of content in signaling pathways
T Korbak
Studies in Logic, Grammar and Rhetoric 41 (1), 89-103, 2015
102015
Measuring non-trivial compositionality in emergent communication
T Korbak, J Zubek, J Rączaszek-Leonardi
arXiv preprint arXiv:2010.15058, 2020
92020
Energy-based models for code generation under compilability constraints
T Korbak, H Elsahar, M Dymetman, G Kruszewski
arXiv preprint arXiv:2106.04985, 2021
82021
The system can't perform the operation now. Try again later.
Articles 1–20