Polysemanticity and capacity in neural networks A Scherlis, K Sachan, AS Jermyn, J Benton, B Shlegeris arXiv preprint arXiv:2210.01892, 2022 | 13 | 2022 |
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 9 | 2024 |
Ai control: Improving safety despite intentional subversion R Greenblatt, B Shlegeris, K Sachan, F Roger arXiv preprint arXiv:2312.06942, 2023 | 5 | 2023 |
Debating with More Persuasive LLMs Leads to More Truthful Answers A Khan, J Hughes, D Valentine, L Ruis, K Sachan, A Radhakrishnan, ... arXiv preprint arXiv:2402.06782, 2024 | 3 | 2024 |