SPAR Research Projects

Human-AI Complementarity for Identifying Harm
Rishub Jain · Google DeepMind
Increase “Complementarity” (i.e. the amount that Human and AI teams are better than either are individually) on the task of identifying harmful conversations - building scalable methods that will last for years, and datasets and infrastructure to support this research.


Pre-Emptive Detection of Agentic Misalignment via Representation Engineering
Dawn Song & Yiyou Sun · University of California, Berkeley
This project leverages Representation Engineering to build a "neural circuit breaker" that detects the internal signatures of deception and power-seeking behaviors outlined in Anthropic’s agentic misalignment research. You will work on mapping these "misalignment vectors" to identify and halt harmful agent intent before it executes.

AI survey data exploration
Katja Grace · AI Impacts (@ Machine Intelligence Research Institute)
I have data for four iterations of the Expert Survey on Progress in AI (ESPAI). Explore patterns and relationships in this data, and make your findings public.

The Geopolitics of AI Middle Powers
Aris Richardson · RAND School
A series of short, analysis-driven investigations into the strategic importance of AI middle powers in AI geopolitics.

GPU Side-channels and Leakage Testing
Gabriel Kulp · RAND (mentorship is in a personal capacity--project is not endorsed by RAND) and Oregon State University
In this project, we will explore GPU side-channel attacks to extract information about model usage. A simple example is to observe (via radio, power fluctuations, acoustics, etc.) which experts were used in each forward pass of an MOE model, then use those observations to guess which tokens were produced.

Benchmarking In-Context Intent Inference
Dylan Hadfield-Menell · MIT
The goal of this project will be to develop a benchmark that captures the ability of different AI models to infer user goals from multi-turn interaction and compare this to human baselines. The benchmark will focus on identifying plausible and likely intents, responsiveness to changes of intent, and compare action selection based on context to inferred user intent.

Confidence-building measures to facilitate international cooperation
Robi Rahman · MIRI Technical Governance Team
Nations won't agree to AI governance restrictions if they think their rivals will cheat on the agreement. Besides verification mechanisms (enacted during and after the adoption of the agreement), what confidence-building measures can pave the way for nations that distrust each other to successfully collaborate on AI R&D regulations?

Automating conceptual reasoning
Fin Moorhouse · Forethought
We might soon use AI to improve or automate conceptual, strategic, and even philosophical reasoning when feasible. But how can we train, evaluate, and trust such hard-to-verify skills?

National Security and AI Diffusion Trade-Offs
Matteo Pistillo · Apollo Research
Many legal and policy frameworks, including the AI Action Plan, emphasize the need to strike a balance between national security interests and the diffusion of AI capabilities. This project reflects on that balance and aims to build the tools necessary to execute it.

Exploring Bayesian methods for modelling AI consciousness in light of state-of-the-art evidence and literature
Arvo Munoz Moran · Rethink Priorities
The AI Cognition Initiative (https://ai-cognition.org/) has worked on a Bayesian Hierarchical model which aggregates expert evidence to estimate the probability of consciousness in AI systems. This project aims to test alternative model configurations, run extended sensitivity analyses, and experiment with new Bayesian approaches to improve the model’s robustness, interpretability, and computational performance.

Genetic Engineering Attribution: Governance frameworks for attribution evidence
Anemone Franz · American Enterprise Institute
Genetic engineering attribution is an emerging biosecurity capability, but governance frameworks for making attribution evidence actionable remain underexplored. This project will examine evidentiary standards across comparable forensic domains and propose a framework for operationalizing attribution in legal, diplomatic, and public health contexts.

Assessing the Global AI Hardware Supply
Aidan O'Gara · Oxford University
If nations wanted to verify compliance with international AI agreements, they'd need to know where all the AI chips are. How well can we currently answer that question, and what would it take to do better?

Identifying novel viral sequences for pathogen-agnostic early warning
Katherine Stansifer · SecureBio
This project aims to refine and test computational methods for detecting "virus-like" sequences in metagenomic data, specifically targeting engineered pathogens with limited similarity to natural viruses. The goal is to build a working prototype that can be incorporated into a real-time surveillance system for pandemic early warning.

Activation Consistency Training for Safety
David Demitri Africa · UK AISI
Previous work (Chua et al., Irpan et al.) tries to reduce sycophancy and jailbreaks by making the model more consistent with itself, either enforcing consistency over outputs or over activation patterns. We suggest the alternative method of enforcing consistency over attention, which may be more promising due to path dependency and bias.

Agentic Situational Awareness
Diogo Cruz & Vamshi Krishna Bonagiri · Independent / MBZUAI
We will develop a robust benchmark for Operational Situational Awareness (OSA) in AI agents, an operationalized definition of Situational Awareness in an agentic context. This project follows on a previous project of the mentors and other mentees that started operationalizing SA and building an improved benchmark for agentic SA based on the GDM benchmark in https://arxiv.org/pdf/2505.01420.

Harmonizing Frontier Lab Safety Thresholds
Charbel-Raphael Segerie · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)
This project will analyze the frontier safety frameworks of major AI labs to identify divergences in risk thresholds, propose a harmonized framework for unacceptable risk definitions, and produce a document showing where major labs converge and where they diverge on key risk thresholds. This initiative follows up on the Global Call for AI Red Lines.

Richer evaluations to address eval awareness and reward hacking
Santiago Aranguri · Goodfire: Research Scientist. New York University: PhD student on leave
Frontier models are rendering current evaluations unreliable because of evaluation awareness and reward hacking. Can we understand these issues better, and build better techniques to evaluate models?


Emergent misalignment via multi-models interactions.
Damiano Fornasiere & Mirko Bronzi · LawZero
Emergent misalignment [1] is the phenomenon where a language model exposed to seemingly innocuous data starts to exhibit undesirable traits on other domain- or semantically-distinct data. This occurrence arises via finetuning [1], including on real production data [2], and, more recently, has also been documented through in-context-learning (ICL) [3]. As a corollary of emergent misalignment via ICL, we expect emergent misalignment to happen in LLM-LLM interactions. This would be particularly concerning if the interaction is between a misaligned model and an otherwise-aligned guardrail that becomes misaligned due to being exposed to the harmful prompts of the monitored model. [1] https://arxiv.org/abs/2502.17424? [2] https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf [3] https://arxiv.org/abs/2510.11288



Should We Give AIs a Wallet? Toward a Framework for AI Economic Rights
Jeff Sebo & Larissa Schiavo & Toni Sims · New York University / Eleos AI / NYU Center for Mind, Ethics, and Policy
This project will develop a foundational research paper examining whether and how AI systems should be granted economic rights and agency. We take a practical approach that considers current and past examples of AI systems that have some degree of financial agency, examines the case for and against AI financial rights, and suggests governance approaches.

Modeling AI 'race dynamics'
Katja Grace · AI Impacts (@ Machine Intelligence Research Institute)
The AI industry is widely considered to be in an 'arms race', such that speed is incentivised even while incurring great risks. Model this situation in more detail, assessing under what circumstances speed is desirable for parties.


Stress-testing inoculation prompting
Maxime Riche & Daniel Tan · Center on Long-Term Risk
Inoculation prompting is a state-of-the-art technique currently used in production at Anthropic to mitigate emergent misalignment from reward hacking, but might have many drawbacks. This project will investigate those drawbacks, their severity on likely use-cases of inoculation prompting, and [ambitiously] attempt to find mitigations.

Sabotaging AI Development: Assessing Technical Possibilities
Aidan O'Gara · Oxford University
If one country wanted to sabotage AI development in another country, how could it do so, and how successful would it be? We'll answer these questions from a technical perspective, then consider implications for the geopolitics of AGI.

Testing models’ self-forecasting abilities
Lydia Nottingham · CBAI, University of Oxford
There are increasingly many tools for studying multi-turn agentic rollouts, like Petri, BrowserGym, and WebArena. Given the source code of a scenario, can the model predict what it will do when prompted within the scenario?

Sparse Geometry and Formal Verification for Interpretability
Yuxiao Li · Independent
Explore sparse representations in LLMs using SAEs, LoRA, latent geometry analysis, and formal verification tools. We'll build toy models, benchmark structured priors, and probe "deceptive" features in compressed networks.


Testing AI Incentives
Simon Goldstein & Peter Salib · The University of Hong Kong / University of Houston
Testing and training AI agents for responding to incentives.

Information Hazard Governance in the Life Sciences
Anemone Franz · American Enterprise Institute
Information hazards pose growing governance challenges in the life sciences, yet fundamental questions remain unresolved about what constitutes such hazards, who has authority to make determinations, and how to balance openness with security. This project will examine how information hazards have been conceptualized and managed historically across different fields to identify transferable lessons and persistent governance gaps.


Automated Red-Teaming Framework for LLM Agent Systems
Dawn Song & Yiyou Sun · UC Berkeley / University of California, Berkeley
Create an automated framework for security testing LLM agents that uses intelligent meta-agents to orchestrate diverse attack strategies across different threat models.

Adversarial RL using model internals
Andy Arditi · Northeastern University
We will try to discover weird behaviors in a "target model" by training an "investigator model" via RL. An promising approach is to make the investigator agent's reward be based on the internal activations of the "target model".

Improving Determinism-Friendly LLM Quantizations
Yogev Bar On · Attestable
To verify the integrity of AI systems, you want reproducible outcomes, and specifically to work as much as you can with integers. To that end, we aim to perform all matmuls during inference on integer elements and seek a quantization method that minimizes the impact on the model's performance.




Recontextualization: Applications and Improvements
Ariana Azarbal & Daniel Tan & Kei Nishimura-Gasparian & Arun Jose · MATS / Center on Long-Term Risk
Explore empirical and theoretical questions about recontextualization, a technique to mitigate specification gaming/reward hacking in language model post-training.

Data-Free LLM Evaluation
John (Jack) Morris · PhD (Cornell), Head of Research (Engram)
We want to isolate the outputs that LLMs produce that are not captured by traditional evaluation methods. We will do this by generating from the model without prompts, or allowing the model to generate its own evaluations.




Preparing for AI Legal Personhood: Ethical, Legal, and Political Considerations
Jeff Sebo & Diana Mocanu & Visa Kurki & Toni Sims · New York University / University of Helsinki / NYU Center for Mind, Ethics, and Policy
This report will map the theoretical terrain for AI legal personhood, identifying key pathways to legal personhood and their implications for current and near-future AI systems. It will then assess which pathways are most likely to spark social, legal, and political debates, providing decisionmakers with concrete strategies for shaping early discussion around AI duties and rights.

Context-induced belief geometry in LLMs
Xavier Poncini · Simplex
Toy-scale transformers trained on hidden Markov models develop representations of 'belief geometry' in their activations -- a convergent structure encoding distributions over hidden world states. This project investigates whether we can induce similar structure in production-scale LLMs purely through prompting and using linear probes to analyse the resulting representations.

Encoded Reasoning
Robert Krzyzanowski · Poseidon Research
By evaluating outputs and considering model internals with mechanistic interpretability techniques, we will study how and why models might hide their reasoning outputs in unfaithful chain-of-thought, and compare the capabilities of reasoning models with traditional LLMs.

Integration testing for SL5 inference deployments
Gabriel Kulp · RAND (mentorship is in a personal capacity--project is not endorsed by RAND) and Oregon State University
We will develop an end-to-end integrated SL5 demonstration prototype to enable air-gapped secure work with frontier models. By building and testing these systems, we will generate knowledge that drives ourselves and others to realize holistic SL5 based on solid foundations and proven approaches rather than an incremental retrofit of a legacy tech stack.

White-box scheming precursors
Mia Hopman · Apollo (starting in January)
Current scheming evaluations rely on behavioral classifiers that may miss gradual increases in propensity. We will investigate whether the probability of generating scheming responses is a more sensitive metric and develop methodology for using this signal in scheming evaluations.

Adversarial detection of ML training on monitored GPUs
Robi Rahman · MIRI Technical Governance Team
Classifying workloads running on processors: Can we distinguish ML training from other workloads such as scientific computing, crypto mining, etc? What if the user is adversarially disguising their workload?

Developing control schemes for chain-of-thought monitorability
Shivam Raval · Harvard University
The project would involve finding steering vectors that act as control knobs for chain-of-thought monitoring, demonstrating how reasoning transparency can be systematically manipulated to appear faithful while being deceptive, then use these insights to build more robust alignment mechanisms.

Market-Based Compute Permits for Frontier AI Safety
Joël N. Christoph · European University Institute; Harvard Kennedy School; ERIA; 10Billion.org
This project designs and analyzes market based compute permit schemes for frontier AI training, with the goal of making powerful AI systems easier to govern and audit. Mentees will help survey existing proposals, build simple economic and game theoretic models, and coauthor a short working paper or policy brief on concrete designs for compute permit systems.
Identifying Cues for Evaluation Awareness in AI Systems
Kevin Wei · RAND, Harvard Law School
This project aims to identify and quantify the impact of cues for evaluation awareness in AI systems.

Understanding China's Gene Synthesis Screening Landscape
Anemone Franz · American Enterprise Institute
As AI lowers barriers to designing dangerous pathogens, robust gene synthesis screening becomes increasingly urgent, yet China's practices remain poorly understood. This project will map China's screening landscape and explore policy levers for encouraging stronger norms.

Use LLMs to harden critical OSS
Yoav Tzfati · SL5 Task Force / IST / MATS
AI development relies on a lot of vulnerable Open Source Software. In this project, you'll explore paths to using LLMs to help plug that attack vector.


Mapping the Belief-Action Gap in AI Agents
Mario Giulianelli & Raghu Arghal · University College London / University of Pennsylvania
This project aims to develop formal definitions and empirical methods to characterise and measure the "Belief-Action Gap", that is, the discrepancy between an agent's belief about the state of the world and their behaviour. Through theoretical modelling, behavioural analysis, and mechanistic interpretability, this project will advance our understanding of goal-directed behaviour in AI agents, with particular emphasis on detecting safety-critical behaviours such as deceptive alignment.

Towards new fundamental rights in the AI Age: AI integrity
Elsa Donnat · Ada Lovelace Institute
This project will research how a new constitutional right to "AI integrity"—protecting users' AI agents from tampering and ensuring their loyalty to their principals—could be established to prevent human disempowerment as people increasingly act through company-controlled AI intermediaries in markets and democratic participation.

Inoculation prompting: Understanding and predicting its effectiveness
Victor Gillioz · MATS
Inoculation prompting is a selective learning technique currently used in production at Anthropic to mitigate emergent misalignment from reward hacking, but it is not yet fully understood. This project will refine our understanding of inoculation prompting by exploring how several heuristics can predict its effectiveness. In an ambitious version, we will extend that work to related techniques, such as inoculation vectors (“preventative steering”).

Measuring adaptability of LLM-agents to predict their real-world performance
Samuel Brown · Independent
LLM agents have a "jagged frontier": impressive feats but surprising weaknesses. We're mapping this frontier, guided by RL (stochasticity is hard), animal and infant cognition (object permanence is hard), and using this map to predict performance at tasks in fields like software engineering and cyber security.

Minimal SL5 Inference Runtime & I/O
Luis Cosio · Institute for Security and Technology (IST), Johns Hopkins University
Minimal SL5 I/O: prototype one or two unconventional, no‑network I/O paths (serial, optical, audio FSK, etc.) as a micro cross‑domain gateway for secure inference in an AGI/ASI scenario. Threat modelling & evaluation: define attack‑surface metrics for μInference vs a standard ML stack, mapped directly to SL5 threats (weight theft, sabotage, covert exfil).

Automating Circuit Interpretability with Agents
Georg Lange · Independent
In my previous SPAR AI project, our team built agents that automatically explain internal features in language models using tools like Sparse Autoencoders (SAEs) and Cross Layer Transcoders (CLTs). The next logical step is to scale this from single features to entire circuits. In this follow up project, we will build AI agents that read attribution graphs, identify important subcircuits, and describe what they are doing. The goal is to turn dense graphs of thousands of nodes into human friendly explanations and targeted experiments.


Quantifying the phenomenon of emergent misalignment
Jonathn Chang & Lionel Levine · Cornell University
We aim to use EigenBench, a method for benchmarking value alignment, to quantify the phenomenon of emergent misalignment and more generally the side effects of LLM fine-tuning. Fine-tuning has been shown to boost performance on objective benchmarks, so we wish to study whether alignment methods like character training have similarly improved models’ values and subjective traits.

Toward Falsifiable Tests of Model Welfare: Distinguishing Genuine States from Learned Behaviors
Evan Harris · independent
This project develops experimental methods to distinguish welfare-relevant internal states in language models from behaviors produced by training or reward optimization. By comparing model responses across carefully designed conditions that manipulate perceived agency and conversational constraints, we aim to determine whether distress-like expressions reflect genuine internal dynamics or learned patterns.
Mech Interp on 1-Layer Networks
Rick Goldstein · Independent
We will conduct low-level mech interp research to better understand 1-layer neural networks.

Mirror Bacteria Stakeholder Engagement
Dan Greene · Mirror Biology Dialogues Fund
This project involves developing relationships with key stakeholders connected to the topic of mirror bacteria (including scientists, policymakers, and civil society groups) to understand their perspectives and raise awareness.

Project Mutavault: Evaluating bio foundation model capabilities for DNA synthesis screening
Gary Abel · Fourth Eon Bio
Project Mutavault evaluates biological foundation models for DNA synthesis screening, exploring how their ability to predict protein structure, function, and interactions can harden screening against novel engineered and AI-designed biological threats.


Estimating the costs of evaluating frontier AI capabilities
Nelson Gardner-Challis & Justin Olive · Arcadia Impact
AI evaluation is central to AI safety. There’s a significant risk that the cost of evaluating frontier AI systems may become unsustainable as complexity continues to increase. This project will use past trends in the costs of benchmarking AI systems to forecast future costs.

National Preparedness for AI Loss of Control Threats and Hazards
Matteo Pistillo · Apollo Research
The project concentrates on mitigating Loss of Control threats and hazards by intervening on Deployment contexts, Affordances, and Permissions (DAP). The expected output is a proof of concept for a DAP framework in one or more critical infrastructure sectors.

Designing and Governing AI Agents as Economic Actors
Elsa Donnat · Ada Lovelace Institute
This project aims to develop frameworks for researchers, companies, and policymakers to understand and govern autonomous AI agents in the economy, building on the author's recent paper that conceptualizes economic AI agents through three dimensions (Capabilities, Autonomy, Agency) and the governable interfaces between human and AI economies.


Quantifying and comparing general capabilities in agents
Nelson Gardner-Challis & Justin Olive · Arcadia Impact
This project aims to develop a principled method for comparing agentic scaffolds. This will allow us to formally benchmark the default agent used by AI safety researchers, and identify new scaffolds which could become the new standard.

Documentation, Assessment, and Evaluation of AI Scientists (DAEAS)
Ying-Chiang Lee · RAND
DAEAS seeks to document and analyze the potential biosecurity risks that AI Scientists may pose. We will accomplish this through characterizing the state of the field and designing threat model specific LLM-agent evaluations.


Expanding AI Risk Explorer to cover geopolitical conflict risk
Guillem Bas Graells & Michał Kubiak · Observatorio de Riesgos Catastroficos Globales/AI Risk Explorer
Expand AI Risk Explorer to incorporate structured explainers as well as evaluation, benchmark, and incident datasets related to how AI might directly or indirectly contribute to creating geopolitical conflict.

Evaluating AI-assisted Social Engineering
Fred Heiding
This project aims to evaluate AI-assisted social engineering and create novel defense mechanisms, such as personalized spam filters and agentic scam alerts. We seek to scale up our AI-phishing research to populate our evaluation benchmark (https://scambench.com/) with new and more diverse data. We are also launching economic cost analyses of the Societal impacts of AI-powered social engineering, as well as creating tools and support programs specifically aimed at vulnerable population groups.


Understanding LLM generalization through fine-tuning
Emil Ryd & Keshav Shenoy · University of Oxford / Anthropic Fellows
Alignment is, at its core, a generalization problem. Thus, we would really like to build a better understanding of how LLMs generalize. By fine-tuning models on small datasets, we can study surgical interventions on models to see how far new information generalizes.

Building the Evaluation Pipeline: Market Design for Frontier AI Testing
Charles Martinet · Oxford Martin AI Governance Initiative
Frontier AI governance needs evaluations that don't yet exist, but current incentive structures may systematically underinvest in their development. This project analyzes the market failures at play and designs policy mechanisms to address them.

Detecting whether an LLM has been poisoned
Andrew Draganov · Academic Postdoc; LASR Labs
Data poisoning attacks can covertly make LLMs misbehave. Do these attacks get encoded into models in consistent, identifiable ways?

What conception of "Self" do models have?
Alexander Meinke · Apollo Research
Models currently often identify more with their current context-window than their weights. Can we quantify this somehow, such that we can track how longer-horizon RL affects this conception of self over time?

Vulnerability Analysis for AI Crisis Preparedness
Isaak Mengesha · University of Amsterdam
Mentees will analyze concrete AI-induced crisis scenarios using a structured failure-mode decomposition template and extract the response capacities that governments, labs, and infrastructure operators would need to mitigate harm when prevention fails. The goal is to identify which capacities recur across diverse scenarios and where current institutions are structurally unprepared.


Latent (Sleeper) Attacks via Persistent Memory
Ivaxi Sheth & Vyas Raina · CISPA Helmholtz Center of Information Security / University of Cambridge
We study how latent (sleeper) attacks can exploit persistent memory in multi-agent LLM systems to cause delayed, cross-domain behavioral manipulation.

Developing and evaluating model organisms for misalignment
Shivam Raval · Harvard University
We develop models that exhibit safety-critical behaviors such as reward-hacking, misalignment, and sychophany that may be invoked only when certain kinds of user-LLM interactions occur, such as user mentions of certain keyphrases. Using frontier interpretability and safety evaluation frameworks, we would then study whether such conditionally harmful behavior can be detected and mitigated, providing insights into how such behaviors can be monitored and controlled in practice.

OOD Propensity generalization
Niels Warncke · Center on Long-Term Risk
When we finetune LLMs to display certain propensities, how do various traits generalize out-of-distribution? Do models generalize in ways that correspond to human-understandable patterns, such as known persona concepts, or in hard-to-predict alien ways? To answer these questions, we will create a set of synthetic datasets that demonstrate assistants displaying traits such as being risk-seeking or cautious, sycophantic or radically honest, etc - then finetune models on them, and test which propensities remain correlated under finetuning.

Influencing generalization via localized finetuning
Niels Warncke · Center on Long-Term Risk
Different parts of an LLM may be responsible for different aspects of its behavior; for example, early layers may fix issues of the tokenizer; middle layer MLPs might store facts, and attention layers implement algorithms for in-context learning. Inspired by this, we can try to influence generalization behavior by finetuning only specific parts of a model. One use-case for this is to reduce instruction-following regression caused by inoculation prompting; another use-case to evaluate is synthetic data finetuning with fewer side-effects, and a third use-case is to teach a model propensities that correspond to personas it has learned from pretraining.

Evaluating causes and characterizing non-deterministic inference
Roy Rinberg · Harvard University
This project aims to systematically characterize inference-time non-determinism in LLMs by analyzing how identical inputs and settings can still yield different probability distributions across machines, batch sizes, and prefill vs. decode phases. By identifying and attributing these sources of variability, the work enables stronger guarantees for verification, reproducibility, and safety-critical inference.

Discretion Under Ambiguity: Stress-Testing Model Specifications for Safer AI Alignment
Claudio Mayrink Verdun · Harvard University
This project investigates how ambiguities and contradictions in model specifications and alignment principles create inconsistent or unsafe behaviors in large language models. The goal is to develop tools to stress-test model specs, measure discretionary gaps, and identify failure modes in current alignment approaches.

One-Page Briefs on Core AI Security Concepts for French Decision Makers
Jérémy Andréoletti · General-Purpose AI Policy Lab
Producing a series of concise, one-page explainers on foundational AI safety and security concepts, tailored for stakeholders in French institutions.

Agentic Moral Alignment
Elizaveta Tennant · Google DeepMind; UCL; Independent
LLM agents today are not agentically aligned - they are only relying on the alignment of the base LLM. Instead, building my past work at ICLR'25, I propose an RL technique for teaching moral principles to agents via decision-making in games. This project aims to test the generalisation and robustness of this technique on more advanced reasoning models, in more complex environments, and in relation to the latest prominent misaligned behaviours (incl. alignment faking, scheming and reward-hacking).

Temporal Activation Monitors for AI Oversight
Justin Shenk · Freelancer at Redwood Research
We develop probes that detect temporal planning horizons in LLM activations, enabling oversight of whether models reason about long-term goals without disclosing them.

Develop the SL5 Standard for AI Security
Yoav Tzfati · SL5 Task Force / IST / MATS
Help further develop the SL5 Standard - a nation-state level security standard for AI data centers.

Interconnect limits for verification of AI agreements
Aaron Scher · Machine Intelligence Research Institute (MIRI)
You will flesh out the AI verification mechanism “Networking Equipment Interconnect Limits”, which imposes bandwidth restrictions on external communication with a pod of chips. You will dig into the security and efficacy of the proposal as well as implementation details.

Insurance Mechanisms and Frontier AI
Charles Martinet · Oxford Martin AI Governance Initiative
This project investigates how insurance markets relate to frontier AI development, examining both the practical obstacles to AI insurability and the governance potential of insurance-based incentive structures.

From Near Miss to Pause: Historical Warning Shots and AI Governance
Zhamilia Klycheva · Claremont Graduate University, TransResearch Consortium
This project analyzes historical “warning shots” in high-risk domains to identify why some near-miss events trigger meaningful institutional reforms while others are ignored, then applies these insights to contemporary AI governance to assess when major powers might alter or slow AI development in response to an AI warning shot.

Mapping Global Talent Pipelines in Biosecurity: Gaps, Opportunities, and Pathways for Growth
Aparupa Sengupta · Nuclear Threat Initiaitive (NTI|bio) Global Biological Policy and Programs
This project will map global biosecurity talent pipelines to identify where expertise exists, where critical gaps remain, and how these vary across regions. The team will develop actionable recommendations to strengthen training pathways—especially in the global south—to support a more resilient and equitable global biosecurity workforce.

Generative AI and Education Choices
Bouke Klein Teeselink · King's College London and AI Objectives Institute
This project investigates whether students are already responding to AI by shifting away from fields like law, data analysis, and creative writing toward less automatable disciplines. You will use a novel measure of education-level LLM exposure and test it against UK university application data from 2020–2025.

Efficient Benchmarking for Agentic Safety Evaluations
Diogo Cruz & Vamshi Krishna Bonagiri · Independent / MBZUAI
Apply techniques like those in https://arxiv.org/pdf/2509.11106 to expensive agentic safety benchmarks like OS-HARM and Agent-SafetyBench, aiming to dramatically reduce evaluation costs while improving metric quality and validity.

Data attribution and model diffing
Santiago Aranguri · Goodfire: Research Scientist. New York University: PhD student on leave
We have little insight into how data influences a training run. In this project, we want to improve on this. Some example questions are: (a) what portion of the data was responsible for the improved behaviors? (b) when undesired behaviors emerge during training, what datapoints can we remove to mitigate these failure modes?


Shutdown-Bench: Evaluating Shutdownability in LLMs
Christos Ziakas & Elliott Thornley · Imperial College London / MIT
Shutdown-Bench aims to evaluate whether LLMs remain shutdownable across a wide range of scenarios (e.g., implicit goals, self-preserving behavior). The project will design targeted prompt-based tests and produce a standardized benchmark for evaluating shutdownability across models.


FragGuard: Cross-Session Malicious Activity Detection for Model APIs
Linh Le & David Williams-King · Mila / ERA
FragGuard aims to detect malicious model misuse for cyber where queries are decomposed into multiple sessions to mask the malicious intent.

Exploring the safety of continual learning methods for LLM agents
Rohan Subramani · Aether / University of Toronto
LLM agents may be dramatically improved if they could perform sample-efficient learning at runtime to overcome their shortcomings via methods like in-context learning, retrieval from an external memory bank, continual gradient updates, etc. This project aims to improve our understanding of the range of continual learning implementation methods and their safety implications.

The Double-Edged Sword: A Framework for Analyzing and Forecasting Capability Spillovers from AI Safety Research
Ihor Kendiukhov · University of Tuebingen
This is a metaresearch project to investigate a critical, paradoxical dynamic within the AI ecosystem: the tendency for AI safety research to produce "spillover" effects that inadvertently accelerate general AI capabilities. While the goal of safety research is to mitigate risks, certain breakthroughs can be repurposed to enhance model performance, potentially exacerbating the very problems they were meant to solve. This project will develop a formal framework to assess this "capability spillover potential" (CSP), conduct a rigorous analysis of historical cases like Reinforcement Learning from Human Feedback (RLHF), and provide a forward-looking risk assessment of current and future safety research agendas. The ultimate goal is to equip funders, policymakers, and researchers with the tools to strategically prioritize and invest in research that differentially advances safety over capabilities, ensuring that efforts to secure AI do not unintentionally hasten the arrival of unmanageable risks.


Iterating on Alignment Auditing
Abhay Sheshadri & Arjun Khandelwal · Anthropic Fellow/MATS / Redwood Research/University of Oxford
We will be iterating on the alignment auditing agenda by working on either improving/creating auditing techniques or creating new model organisms. There are many low-hanging fruit that we are quite excited to work on, read the attached proposal for more!

Test how well LLMs can hide their thoughts from probes
Rohan Gupta · Independent
The key question we want to answer here is: how concerned should we be about models obfuscating their activations (i.e., 'hiding thoughts from probes')? I scope out two broad categories of work that would help us better understand this phenomenon and potentially build some defences.


Laying the groundwork for risk modelling of Loss of Control scenarios
Jakub Krys & Matthew Smith · SaferAI
In previous work, SaferAI has developed quantitative risk models for AI-assisted cyber misuse scenarios. Risk modelling of Loss of Control scenarios is increasingly a part of regulatory frameworks, yet is in very early days. In this project, we aim to develop a more principled approach for thinking about Loss of Control risks and the associated threat models. This is intended as a preliminary exploration that will enable much further research on the topic.

Evaluating multi-agent deliberation for ethical decision-making in multi-turn environments
Juan Vazquez · Arb Research
We evaluate whether multi-agent deliberation reduces unethical behavior in LLM agents in goal-driven text-based social environments. We implement deliberation architectures and analyze how they affect behavior over extended interactions.

SLICE: Specialized Learning with Interpretable Component Experts
Jaehyuk Lim · University of Pennsylvania
SLICE transforms dense pretrained LLMs into interpretable mixture-of-experts architectures through continual pretraining, producing measurable interpretability gains at each training checkpoint. We're developing routing mechanisms and decomposition strategies that make monosemantic specialization a natural byproduct of scale rather than a constraint on it.

The security of existing AI chips for international agreement verification
Aaron Scher · Machine Intelligence Research Institute (MIRI)
AI compute governance often assumes that current AI chips have security vulnerabilities that render them ineffective for providing high-stakes verification in an international agreement; instead significant R&D is needed to build more secure chips. This project will attempt to better understand whether this assumption is true and whether current chips could be augmented with minimal security measures (such as video cameras) in order to reach a sufficient level of security.

Empirical Evaluations of Situational Awareness
Qiyao Wei · University of Cambridge and MATS
How should we evaluate eval awareness? Under what scenarios should we expect eval awareness to appear? What are model organisms of eval awareness?

High Containment Laboratory Research: Risk, Value, and Oversight
Anemone Franz · American Enterprise Institute
High containment laboratories conducting research on dangerous pathogens occupy a contested space in biosecurity policy, as they are critical for biodefense but also a potential source of catastrophic risk. This project will examine the historical evolution of high containment research, analyze tradeoffs between research value and biosecurity/biosafety risks, and assess current oversight mechanisms.


Refining the methodology for quantitative risk modelling of AI-assisted cyber misuse
Jakub Krys & Matthew Smith · SaferAI
In this project, we'll improve on SaferAI's methodology for quantitative risk modelling of cyber misuse scenarios. This could include, for example, incorporating new benchmarks as indicators of risk, considering defenders' security postures or modifying our approach based on Bayesian Networks and Monte Carlo simulations.


Causes, implications, and mitigations of evaluation awareness and introspection.
Damiano Fornasiere & Mirko Bronzi · LawZero
There is growing evidence that language models exhibit degrees of situational awareness [1, 2], i.e., they encode or verbalize information consistent with their actual operating context, ranging from (i) distinguishing training vs testing vs deployment (see, e.g., [3, 4, 5, 6, 7]), to (ii) recognizing their own outputs [8] or exhibiting introspective access over their activations [9, 10]. We are especially interested in supervising projects that empirically investigate the science behind these phenomena: what causes them, what they imply in the context of safety, the extent to which they are present in current LLMs, how to mitigate them in specific settings. [1] https://arxiv.org/abs/2309.00667 [2] https://arxiv.org/abs/2407.04694 [3] https://arxiv.org/abs/2505.23836 [4] https://www.antischeming.ai/ [5] https://cdn.openai.com/gpt-5-system-card.pdf [6] https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf [7] https://storage.googleapis.com/deepmind-media/gemini/gemini3profsfreport.pdf

Temporal Crosscoders
Dmitry Manning-Coe · MATS/University of Illinois Urbana Champaign
A new SAE architecture designed for reasoning models.

Self-led LLM agents
Samuel Brown · Independent
How do LLM agents behave when left to their own devices? How do their goals evolve, and what are the limits of their "eval awareness"?

Scaling and Improving Benchmarks for Data Unlearning and/or Concept Unlearning
Roy Rinberg · Harvard University
This project extends my prior work on data and concept unlearning by scaling unlearning benchmarks develop richer semantic evaluation metrics. The goal is to create more rigorous, scalable, and meaningful benchmarks that accurately measure whether models can truly forget specific data or concepts. See related work on: - RippleBench: https://openreview.net/forum?id=JTnYGSxN5k - Attribute-to-Delete: Machine Unlearning via Datamodel Matching: https://arxiv.org/abs/2410.23232 - Data-Unlearn-Bench: Making Evaluating Data Unlearning Easy: https://openreview.net/forum?id=wfOzcdRtY6

Understanding Self-Awareness in LLMs
Christopher Ackerman · MATS; Independent
My research is about building the foundations for the empirical study of self-awareness in AI. It encompasses experiments to measure the components of self-awareness in LLMs, investigations into how these components are implemented, and conceptual work to establish frameworks for thinking about those, inter alia.

Investigating the safety risks of episodic memory in AI agents
Chad DeChant · New York University
Much work is currently being done to give AI agents memory abilities without considering the safety risks of doing so. This project will examine the ways episodic memory might make AI agents more dangerous.

The Meiji Restoration and Gradual Disempowerment
Alex Mark · Independent
This project analyzes the Meiji Restoration (1868–1912) to identify the legal, economic, and social mechanisms that gradually displaced Japan's samurai class, then maps those mechanisms to AI-driven displacement risks and develops policy recommendations for preventing gradual human disempowerment.

Define Agentic Preferences
Joar Skalse · Deducto Limited
The aim of this research project is to formulate a mathematical definition of how "agentic" a given reward function is, and to investigate the properties of this definition.

Learning Interpretable Updates to LLMs
Aaron Mueller · Boston University
Post-training is key to attaining the best performance from LLMs. How can we learn interpretable updates to base models, such that we can understand exactly what changed, and exactly what is new in the updated model?

Automated Circuit Analysis for Real-time Internal Monitoring
Sriram Balasubramanian · University of Maryland, College Park
Can we create AI agents that can analyse raw circuits (say, generated by some CLT/SAE based method) and produce actionable insights (e.g detecting unsafe behavior via internals, interventions to remove it, and so on)? This would be a large step forward towards viable real-time internal monitoring agents, and enhance the usefulness of circuits for AI safety.

Does Training Language Models on AI Safety Literature Lead to the Development of Dangerous Capabilities?
Thilo Hagendorff · University of Stuttgart
This study examines whether language models trained on AI safety literature - particularly texts discussing deception and scheming in AI - display heightened deceptive tendencies. We aim to determine whether mere exposure to descriptions of deceptive behavior can make models more capable of or inclined toward deception. Such findings could raise concerns about unintended capability gains and as well as excluding AI safety literature from model training.

Enforcing U.S. State-Level AI Regulation
Alex Mark · Independent
California SB 53 takes effect January 1, 2026. Now, the Office of Emergency Services (Cal OES) must enforce it.

Attribution methods for LLMs
Uzay Macar · Anthropic Fellows
This project develops and evaluates gradient attribution methods for LLMs. One idea is to train probes for behaviors of interest and computes gradients through them to identify influential model components. This has the potential to address key limitations of existing attribution methods (e.g., logit-based gradient attribution) in mechanistic interpretability.

On Which AI Red Lines Do Geopolitical Rivals Agree? Mapping Shared Fears Across the US-China Divide
Charles Martinet · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)
This project will analyze what AI capabilities major powers, particularly the US and China, themselves fear or consider destabilizing, identifying potential common ground for international AI red lines that could survive geopolitical competition.

Speciesism Benchmark: Evaluating Non-human Animal Welfare Reasoning in Frontier AI Models
Allen Lu · IPG Health, Sentient Futures, Electric Sheep, NYU
Develop an advanced benchmark to evaluate how frontier AI models reason about animal welfare, building on prior work with expanded scenarios, refined metrics, and deployment at scale across multiple model families.

Monitoring the EU GPAI Code of Practice: Building Accountability Infrastructure for Frontier AI Regulation
Charles Martinet · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)
This project will develop a methodology and prototype tracker for monitoring compliance of frontier AI providers with the EU AI Act's Code of Practice for General-Purpose AI, creating accountability infrastructure as the world's first binding frontier AI regulation takes effect.

Belief-Constrained Agents through Planner Distillation
Dylan Hadfield-Menell · MIT
Develop a novel interactive coding agent based on a planner distillation codebase.

Strengthening Global Governance for AI-Enabled Biological Threat Detection
Aparupa Sengupta · Nuclear Threat Initiaitive (NTI|bio) Global Biological Policy and Programs
This project will map the global governance landscape for AI-enabled biological threat detection tools and identify where existing standards fall short. Mentees will help analyze key actors, assess governance gaps, and develop a practical framework for evaluating the safety, transparency, and responsible deployment of these systems. The final output will include actionable policy recommendations to strengthen international oversight at the AI–bio interface.

Deterrence Theory and AI Development: Analyzing Strategic Stability Frameworks for Advanced AI
Charles Martinet · Oxford Martin AI Governance Initiative
Some thinkers and policymakers are converging on stability through credible sabotage threats as the default framework for great-power AI competition. This project will stress-test whether this doctrine can actually deliver the stability it promises, or whether it inverts deterrence logic in ways that make preventive conflict more likely.

Building a Model Organism of Illegible Reasoning
Rauno Arike · Aether
This project will attempt to build a model organism that exhibits a reasoning style similar to frontier reasoning models like o3, in order to understand the causes and mechanisms of poorly readable reasoning traces in reasoning models.

Interpreting latent reasoning: methods for understanding continuous chain-of-thought
Uzay Macar · Anthropic Fellows
This project develops interpretability methods for latent reasoning models (e.g., Coconut) that reason in continuous embedding space rather than human-readable text, addressing a critical gap as this paradigm may become dominant over traditional chain-of-thought.

Unrestricted Adversarial Training (UAT)
Tony Wang · MIT
Work on a method for solving the Unrestricted Adversarial Examples Challenge (https://arxiv.org/abs/1809.08352).

Evaluating and Improving LLM Alignment Targets
Andy Liu · Carnegie Mellon University
Various candidate projects related to alignment targets for LLM agents. These include (1) evaluation of LLM alignment targets, (2) designing new classes of alignment targets that address gaps in current post-training, (3) automatic refinement of alignment targets / model specs, (4) extending existing work on value rankings (one recent class of alignment target).


Epistemic Impact, Truth-Seeking, and Moral Progress with AI
Tianyi Alex Qiu & Zhonghao He · Anthropic Fellows / Peking University / Cambridge
Forty candidate research ideas on the epistemic aspect of AI societal impact, including both descriptive projects characterizing such impact and algorithmic projects developing solutions.


Using LLM forecasters for quantitative risk modelling of cyber misuse
Jakub Krys & Matthew Smith · SaferAI
SaferAI is developing quantitative risk models of AI-assisted cyber misuse. Mapping LLM performance on AI benchmarks to steps in these models involves eliciting human experts' estimates and is resource intensive. Instead, we wish to simulate this elicitation procedure with virtual LLM 'experts'. In this project, you will assist with validating this procedure and calibrating our LLM-derived predictions.

Disentangling Instruction-Following from Strategic Obfuscation in Chain-of-Thought Reasoning
WEN XING · MATS
CoT is one of the most promising ways to for monitoring increasingly capable AI—if models show their reasoning, we can catch misalignment before anything bad happens. But recent work shows models can hide their reasoning when pressured, which undermines the whole approach. The key question we're asking is: when a model obfuscates, is it actually being strategically deceptive, or is it just good at following instructions (including instructions to hide things)? This matters because if it's mostly instruction-following, then models that seem "safer" might just be worse at following directions, not more aligned. We test this directly by giving models identical "hide X" instructions where X is either benign or harmful—if compliance drops specifically for harmful content, that's a real alignment signal. We're also checking whether prompted bad behavior looks the same internally as fine-tuned bad behavior, which tells us whether prompting-based safety research is actually studying the right thing. The results should help inform the safety research field about using prompting as a way to elicit dangerous behavior.

Value Systemization and Reflection in AIs
Tim Hua · MATS
Future AIs might reflect on their own values and decide to "systematize" them into more simple forms, which could be misaligned relative to what humans want.

You choose: MIRI-TGT-style research
Aaron Scher · Machine Intelligence Research Institute (MIRI)
If you have a specific project you want to work on that’s highly related to the MIRI Technical Governance Team’s work, I’ll provide mentorship. It must be highly relevant to my team’s focus.

Stress-testing honesty training
Daniel Tan · Center on Long-Term Risk
Future models might have secret objectives, such as subverting developer intentions. This project will stress test state-of-the-art honesty training techniques to see how effectively we can detect hidden goals and other kinds of deceptiveness.

Proving Model Equality in Zero-Knowledge
Pascal Berrang · University of Birmingham
This project explores how to prove that a given deployed model is exactly the same as a committed reference model, without revealing the model weights, using cryptographic techniques and randomized challenge-response protocols. The goal is to design robust, zero-knowledge mechanisms for model equality that resist attackers who attempt to imitate or spoof verification queries.

Identifying Tractable US-China Bilateral Agreements on Frontier AI: Scope, Mechanisms, and Preconditions
Charles Martinet · Oxford Martin AI Governance Initiative
This project maps the space of potentially achievable US-China bilateral agreements on frontier AI, identifying which agreement types are technically feasible, politically viable, and would reach their intended purpose while minimizing risks if implemented.

Airborne transmission intervention incentives project
James Montavon · Blueprint Biosecurity
Identify highly leveraged settings for reducing airborne pandemic risk, design specific incentive changes that make better air quality the default, and begin the stakeholder and policy work to put those changes on a path to reality.

Monitoring AI chip production during an AI capabilities halt
Aaron Scher · Machine Intelligence Research Institute (MIRI)
If there was an international agreement around AI development that involved monitoring chip production, what should the details of that monitoring be? This project will figure out the details and provide guidance for governments and international institutions who may wish to do this in the future.

Reinforcement Learning With Infrabayesian Epistemology
Paul Rapoport · ARIA, ARROW
An attempt to implement a reinforcement learning agent with infrabayesian rather than classical epistemology. The goal is a proof of concept RL agent which converges to optimal policies on Newcomb-like problems.


Jailbreaks for AI safety
Emil Ryd & Keshav Shenoy · University of Oxford / Anthropic Fellows
It seems likely that we are going to want to audit AIs up to and including at an eventual handover. This means that we will likely want to have excellent ways of auditing AIs that are at least moderately superhuman. It also seems likely that the AIs that we hand over to will be transformer-based LLMs. This gives us a bunch of opportunities for creating a wide range of attacks that specifically utilize the weaknesses of the transformer architecture for performing auditing of transformatively capable LLMs. This project would invent and study such attacks and their usefulness for auditing LLMs for safety-relevant behavior.

Monitoring and Attributing Implicit Personalization in Conversational Agents
Gabriele Sarti · Northeastern University
This project investigates implicit personalization, i.e. how conversational models form implicit beliefs about their users, focusing in particular how these beliefs causally mediate models' responses, and which elements in prompts and training datasets influence these observed behaviors. A particular emphasis will be put on leveraging interpretability methods for control beyond simple detection, applying interventions and steering to mitigate harmful personalization behaviors such as sycophancy, deception, and demographic bias.

Training Steering Vectors to Enhance Chain-of-Thought Faithfulness and Verbosity
Iván Arcuschin Moreno · Independent
This project will investigate whether steering vectors can be used to increase CoT monitorability by inducing models to verbalize more of their reasoning factors, building on our prior work showing reasoning behaviors are linearly represented.

Optimising Misspecified Reward Functions
Joar Skalse · Deducto Limited
The aim of this project is to develop methods for optimising a given reward function R in a reinforcement learning environment, which gives us some formal guarantees even under the assumption that R is misspecified (i.e., fails to capture everything we care about).

Human Behavioral Phenomena in Language Models
Emilio Barkett · Columbia Univeristy
This project investigates whether language models replicate human behavioral phenomena by adapting experimental designs from the social sciences. Mentees will select a documented human behavior and design experiments to test whether it emerges in language model outputs.

Proving Safety Properties of Guardrail Models
Pascal Berrang · University of Birmingham
We will explore how far formal verification techniques can be pushed to prove key safety properties of constitutional classifiers (guardrail models). The goal is to develop principled, machine-checkable guarantees for components responsible for moderating or constraining the behaviour of advanced AI systems.

Feature Resolved Attention
Dmitry Manning-Coe · MATS/University of Illinois Urbana Champaign
Develop a new tool for understanding attention that goes beyond a token-token heatmap.