Follow
Buck Shlegeris
Buck Shlegeris
CTO, Redwood Research
Verified email at rdwrs.com
Title
Cited by
Cited by
Year
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small
K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt
arXiv preprint arXiv:2211.00593, 2022
1602022
Supervising strong learners by amplifying weak experts
P Christiano, B Shlegeris, D Amodei
arXiv preprint arXiv:1810.08575, 2018
742018
Adversarial training for high-stakes reliability
D Ziegler, S Nix, L Chan, T Bauman, P Schmidt-Nielsen, T Lin, A Scherlis, ...
Advances in Neural Information Processing Systems 35, 9274-9286, 2022
352022
Causal scrubbing: A method for rigorously testing interpretability hypotheses
L Chan, A Garriga-Alonso, N Goldowsky-Dill, R Greenblatt, ...
AI Alignment Forum, 1828-1843, 2022
33*2022
Polysemanticity and capacity in neural networks
A Scherlis, K Sachan, AS Jermyn, J Benton, B Shlegeris
arXiv preprint arXiv:2210.01892, 2022
132022
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
92024
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022
K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt
URL https://arxiv. org/abs/2211.00593, 0
8
Gini coefficient calculator
B Shlegeris
Web, 2020
62020
Ai control: Improving safety despite intentional subversion
R Greenblatt, B Shlegeris, K Sachan, F Roger
arXiv preprint arXiv:2312.06942, 2023
52023
Language models are better than humans at next-token prediction
B Shlegeris, F Roger, L Chan, E McLean
arXiv preprint arXiv:2212.11281, 2022
42022
Measurement tampering detection benchmark
F Roger, R Greenblatt, M Nadeau, B Shlegeris, N Thomas
arXiv preprint arXiv:2308.15605, 2023
22023
The system can't perform the operation now. Try again later.
Articles 1–11