‪Buck Shlegeris‬ - ‪Google Scholar‬

Get my own profile

Cited by

	All	Since 2019
Citations	349	345
h-index	7	7
i10-index	5	5

0

180

90

45

135

20182019202020212022202320244 14 7 10 24 168 118

Buck Shlegeris

Buck Shlegeris

CTO, Redwood Research

Verified email at rdwrs.com


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt arXiv preprint arXiv:2211.00593, 2022	160	2022
Supervising strong learners by amplifying weak experts P Christiano, B Shlegeris, D Amodei arXiv preprint arXiv:1810.08575, 2018	74	2018
Adversarial training for high-stakes reliability D Ziegler, S Nix, L Chan, T Bauman, P Schmidt-Nielsen, T Lin, A Scherlis, ... Advances in Neural Information Processing Systems 35, 9274-9286, 2022	35	2022
Causal scrubbing: A method for rigorously testing interpretability hypotheses L Chan, A Garriga-Alonso, N Goldowsky-Dill, R Greenblatt, ... AI Alignment Forum, 1828-1843, 2022	33*	2022
Polysemanticity and capacity in neural networks A Scherlis, K Sachan, AS Jermyn, J Benton, B Shlegeris arXiv preprint arXiv:2210.01892, 2022	13	2022
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024	9	2024
Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022 K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt URL https://arxiv. org/abs/2211.00593, 0	8
Gini coefficient calculator B Shlegeris Web, 2020	6	2020
Ai control: Improving safety despite intentional subversion R Greenblatt, B Shlegeris, K Sachan, F Roger arXiv preprint arXiv:2312.06942, 2023	5	2023
Language models are better than humans at next-token prediction B Shlegeris, F Roger, L Chan, E McLean arXiv preprint arXiv:2212.11281, 2022	4	2022
Measurement tampering detection benchmark F Roger, R Greenblatt, M Nadeau, B Shlegeris, N Thomas arXiv preprint arXiv:2308.15605, 2023	2	2023

The system can't perform the operation now. Try again later.

Articles 1–11