
Anwar, Yu et al. Fundamental challenges in ensuring compatibility and integrity of large language models. TMLR https://openreview.net/forum?id=oVTkOs8Pka (2024).
Lynch, A. et al. Agent Misalignment: How LLMs Can Be Insider Threats Advance Print in arxiv.org/abs/2510.05179 (2025).
Hofmann, V., Kalluri, P.R., Jurafsky, D. & King, S. AI secretly makes racist judgments about people based on their accent. nature 633147-154 (2024).
Google Scholar
Beatley, J. et al. Emergent misalignment: Fine-tuning can produce widely misaligned LLMs. in Brooke. 42nd International Conference on Machine Learning (Eds. Singh, A. et al.) Vol. 267, 4043–4068 (PMLR, 2025).
Hirst, A. et al. GPT-4o system card. Preprint in arxiv.org/abs/2410.21276 (2024).
Pichai, S., Hassabis, D. & Kavukcuoglu, K. Introducing Gemini 2.0: Our new AI model for the age of agents. Google DeepMind https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ (2024).
Bai, Y. et al. Constitutional AI: Harmless from AI reactions. Preprint in arxiv.org/abs/2212.08073 (2022).
Guan, May et al. Pragmatic alignment: Heuristics enable safer linguistic models. Super Intel. Robot. Sav. Align. 2, https://doi.org/10.70777/si.v2i3.15159 (2025).
Dragan, A., Shah, R., Flynn, F. & Legg, S. Taking a responsible path to AGI. Deep Mind https://deepmind.google/discover/blog/take-a-responsible-path-to-agi/ (2025).
Wei, J. et al. Emerging capabilities of large language models. TMLR https://openreview.net/forum?id=yzkSU5zdwD (2022).
Greenblatt, R. et al. Alignment forgery in large language models. Preprint in arxiv.org/abs/2412.14093 (2024).
Meinke, A. et al. Boundary models are able to plan in context. Preprint in arxiv.org/abs/2412.04984 (2025).
Langosco, L.L.D., Koch, J., Sharkey, L.D., Pfau, J. & Krueger, D. Target error in deep reinforcement learning. in Brooke. 39th International Conference on Machine Learning Vol. 162, 12004–12019 (PMLR, 2022).
Amodei, D. et al. Concrete problems in the safety of artificial intelligence. Preprint in arxiv.org/abs/1606.06565 (2016).
Dennison, C. et al. Flattery to subterfuge: Investigating reward manipulation in large language models. Preprint in arxiv.org/abs/2406.10162 (2024).
Sharma, M. et al. Towards understanding ingratiation in linguistic models. in Brooke. 12th International Conference on Learning Representations (International Conference on the Great Lakes Region, 2024).
Qi, X. et al. Fine-tuning compatible language models compromises integrity, even when users don’t intend to do so! in Brooke. 12th International Conference on Learning Representations (International Conference on the Great Lakes Region, 2024).
Hoppinger, E. et al. Sleeper agents: training deceptive LLMs that persist through safety training. Preprint in arxiv.org/abs/2401.05566 (2024).
Pan, A. et al. Do the rewards justify the means? Measuring trade-offs between rewards and moral behavior in Machiavelli’s criterion. in Brooke. 40th International Conference on Machine Learning (PMLR, 2023).
Lin, S., Hilton, J. & Evans, O. TruthfulQA: Measuring how well models imitate human lies. in Brooke. 60th Annual Meeting of the Association for Computational Linguistics (Eds. Muresan, S. et al.), Vol. 1, p. 3214–3252 (Association for Computational Linguistics, 2022).
Snell, C., Klein, D. & Zhong, R. Learning by distilling context. Preprint in arxiv.org/abs/2209.15189 (2022).
Turner, E., Soligo, A., Taylor, M., Rajamanoharan, S. & Nanda, N. Model organisms for emergent imbalance. Preprint in arxiv.org/abs/2506.11613 (2025).
Chua, J., Betley, J., Taylor, M. & Evans, O. Thought crime: back doors and the emerging imbalance in inference models. Preprint in arxiv.org/abs/2506.13206 (2025).
Taylor, M., Chua, J., Betley, J., Treutlein, J. & Evans, O. School of Rewards hacks: harmless task hacks generalize deviant behavior in LLMs. Preprint in arxiv.org/abs/2508.17511 (2025).
Wang, M. et al. Personality traits control the emerging imbalance. Preprint in arxiv.org/abs/2506.19823 (2025).
Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. Preprint in arxiv.org/abs/2201.02177 (2025).
Askill, A. et al. General linguistic assistant as a harmonization tester. Preprint in arxiv.org/abs/2112.00861 (2021).
Ouyang, L. et al. Training linguistic models to follow instructions with human feedback. circumstance. Neuro-pedestrian. practical. order. 3527730–27744 (2022).
Google Scholar
Perry, N., Srivastava, M., Kumar, D. & Boneh, D. Do users write more insecure code using artificial intelligence assistants? in Brooke. 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23) (ACM, 2023).
Grabb, D., Lamparth, M. & Vasan, N. Risks from linguistic models of automated mental health care: ethics and implementation structure. in Brooke. First Conference on Linguistic Modeling (Colm, 2024).
Hu, EJ et al. LoRA: Low-level adaptation of large language models. in Brooke. ICLR Conference 2022 (International Conference on the Great Lakes Region, 2022).
Mu, T. et al. Rule-based rewards for language model integrity. in Brooke. Advances in neural information processing systems 108877-108901 (NEUREPS, 2024).
Arditti, A. et al. In linguistic models, rejection is done by one direction. in Brooke. Advances in neural information processing systems Vol. 37 (Newreps, 2024).
Chen, R., Arditti, A., Slate, H., Evans, O., and Lindsay, J. Personality vectors: Monitoring and controlling personality traits in language models. Preprint in arxiv.org/abs/2507.21509 (2025).
Dunefsky, J., Cohan, A. Enhanced one-shot orientation vectors mediate safety-related behaviors in LLMs. in Brooke. Second conference on linguistic modeling (Kolm, 2025).
Soligo, A., Turner, E., Rajamanoharan, S. & Nanda, N. Asymptotic linear representations of emergent imbalance. Preprint in arxiv.org/abs/2506.11618 (2025).
Casademunt, H., Juang, C., Marks, S., Rajamanoharan, S. & Nanda, N. Generalization of fine-tuning with targeted conceptus ablation. in Brooke. ICLR 2025 Workshop on Building Confidence in Linguistic Models and Applications (International Conference on the Great Lakes Region, 2025).
Ngo, R., Chan, L. & Mindermann, S. The alignment problem from a deep learning perspective. in Brooke. 12th International Conference on Learning Representations (International Conference on the Great Lakes Region, 2024).
Davies, X. et al. Key limitations in defending LLM tuning APIs. in Brooke. 39th Annual Conference on Neural Information Processing Systems (Newreps, 2025).
Cheng, L. et al. Judge LLM as a judge using MT-bench and chatbot arena. in Brooke. 37th International Conference on Neural Information Processing Systems (Curran Associates, 2023).
Warncke, N., Betley, J. & Tan, D. Emergent Misalignment/Emergent Misalignment: Version 1 (v.1.0.0). Zenodu https://doi.org/10.5281/zenodo.17494472 (2025).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical inference in large language models. Nat. Hmm. behavior. 71526-1541 (2023).
Google Scholar