Spring 2026: Mentee applications are now open!

SPAR Research Projects

136 projects

Human-AI Complementarity for Identifying Harm

Rishub Jain · Google DeepMind

Increase “Complementarity” (i.e. the amount that Human and AI teams are better than either are individually) on the task of identifying harmful conversations - building scalable methods that will last for years, and datasets and infrastructure to support this research.

Scalable oversight

Pre-Emptive Detection of Agentic Misalignment via Representation Engineering

Dawn Song & Yiyou Sun · University of California, Berkeley

This project leverages Representation Engineering to build a "neural circuit breaker" that detects the internal signatures of deception and power-seeking behaviors outlined in Anthropic’s agentic misalignment research. You will work on mapping these "misalignment vectors" to identify and halt harmful agent intent before it executes.

Alignment
AI control
Mechanistic interpretability

AI survey data exploration

Katja Grace · AI Impacts (@ Machine Intelligence Research Institute)

I have data for four iterations of the Expert Survey on Progress in AI (ESPAI). Explore patterns and relationships in this data, and make your findings public.

AI strategy
Other
Communications

The Geopolitics of AI Middle Powers

Aris Richardson · RAND School

A series of short, analysis-driven investigations into the strategic importance of AI middle powers in AI geopolitics.

International governance
US-China governance

GPU Side-channels and Leakage Testing

Gabriel Kulp · RAND (mentorship is in a personal capacity--project is not endorsed by RAND) and Oregon State University

In this project, we will explore GPU side-channel attacks to extract information about model usage. A simple example is to observe (via radio, power fluctuations, acoustics, etc.) which experts were used in each forward pass of an MOE model, then use those observations to guess which tokens were produced.

Securing model weights
AI security

Benchmarking In-Context Intent Inference

Dylan Hadfield-Menell · MIT

The goal of this project will be to develop a benchmark that captures the ability of different AI models to infer user goals from multi-turn interaction and compare this to human baselines. The benchmark will focus on identifying plausible and likely intents, responsiveness to changes of intent, and compare action selection based on context to inferred user intent.

Alignment

Confidence-building measures to facilitate international cooperation

Robi Rahman · MIRI Technical Governance Team

Nations won't agree to AI governance restrictions if they think their rivals will cheat on the agreement. Besides verification mechanisms (enacted during and after the adoption of the agreement), what confidence-building measures can pave the way for nations that distrust each other to successfully collaborate on AI R&D regulations?

International governance

Automating conceptual reasoning

Fin Moorhouse · Forethought

We might soon use AI to improve or automate conceptual, strategic, and even philosophical reasoning when feasible. But how can we train, evaluate, and trust such hard-to-verify skills?

Philosophy of AI

National Security and AI Diffusion Trade-Offs

Matteo Pistillo · Apollo Research

Many legal and policy frameworks, including the AI Action Plan, emphasize the need to strike a balance between national security interests and the diffusion of AI capabilities. This project reflects on that balance and aims to build the tools necessary to execute it.

AI strategy
National policy

Exploring Bayesian methods for modelling AI consciousness in light of state-of-the-art evidence and literature

Arvo Munoz Moran · Rethink Priorities

The AI Cognition Initiative (https://ai-cognition.org/) has worked on a Bayesian Hierarchical model which aggregates expert evidence to estimate the probability of consciousness in AI systems. This project aims to test alternative model configurations, run extended sensitivity analyses, and experiment with new Bayesian approaches to improve the model’s robustness, interpretability, and computational performance.

AI welfare
Philosophy of AI

Genetic Engineering Attribution: Governance frameworks for attribution evidence

Anemone Franz · American Enterprise Institute

Genetic engineering attribution is an emerging biosecurity capability, but governance frameworks for making attribution evidence actionable remain underexplored. This project will examine evidentiary standards across comparable forensic domains and propose a framework for operationalizing attribution in legal, diplomatic, and public health contexts.

Biosecurity
International governance
National policy

Assessing the Global AI Hardware Supply

Aidan O'Gara · Oxford University

If nations wanted to verify compliance with international AI agreements, they'd need to know where all the AI chips are. How well can we currently answer that question, and what would it take to do better?

International governance
Compute governance

Identifying novel viral sequences for pathogen-agnostic early warning

Katherine Stansifer · SecureBio

This project aims to refine and test computational methods for detecting "virus-like" sequences in metagenomic data, specifically targeting engineered pathogens with limited similarity to natural viruses. The goal is to build a working prototype that can be incorporated into a real-time surveillance system for pandemic early warning.

Biosecurity

Activation Consistency Training for Safety

David Demitri Africa · UK AISI

Previous work (Chua et al., Irpan et al.) tries to reduce sycophancy and jailbreaks by making the model more consistent with itself, either enforcing consistency over outputs or over activation patterns. We suggest the alternative method of enforcing consistency over attention, which may be more promising due to path dependency and bias.

Alignment

Agentic Situational Awareness

Diogo Cruz & Vamshi Krishna Bonagiri · Independent / MBZUAI

We will develop a robust benchmark for Operational Situational Awareness (OSA) in AI agents, an operationalized definition of Situational Awareness in an agentic context. This project follows on a previous project of the mentors and other mentees that started operationalizing SA and building an improved benchmark for agentic SA based on the GDM benchmark in https://arxiv.org/pdf/2505.01420.

Evaluations
AI control

Harmonizing Frontier Lab Safety Thresholds

Charbel-Raphael Segerie · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)

This project will analyze the frontier safety frameworks of major AI labs to identify divergences in risk thresholds, propose a harmonized framework for unacceptable risk definitions, and produce a document showing where major labs converge and where they diverge on key risk thresholds. This initiative follows up on the Global Call for AI Red Lines.

Lab governance

Richer evaluations to address eval awareness and reward hacking

Santiago Aranguri · Goodfire: Research Scientist. New York University: PhD student on leave

Frontier models are rendering current evaluations unreliable because of evaluation awareness and reward hacking. Can we understand these issues better, and build better techniques to evaluate models?

Evaluations

Emergent misalignment via multi-models interactions.

Damiano Fornasiere & Mirko Bronzi · LawZero

Emergent misalignment [1] is the phenomenon where a language model exposed to seemingly innocuous data starts to exhibit undesirable traits on other domain- or semantically-distinct data. This occurrence arises via finetuning [1], including on real production data [2], and, more recently, has also been documented through in-context-learning (ICL) [3]. As a corollary of emergent misalignment via ICL, we expect emergent misalignment to happen in LLM-LLM interactions. This would be particularly concerning if the interaction is between a misaligned model and an otherwise-aligned guardrail that becomes misaligned due to being exposed to the harmful prompts of the monitored model. [1] https://arxiv.org/abs/2502.17424? [2] https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf [3] https://arxiv.org/abs/2510.11288

Alignment
Multi-agent systems

Should We Give AIs a Wallet? Toward a Framework for AI Economic Rights

Jeff Sebo & Larissa Schiavo & Toni Sims · New York University / Eleos AI / NYU Center for Mind, Ethics, and Policy

This project will develop a foundational research paper examining whether and how AI systems should be granted economic rights and agency. We take a practical approach that considers current and past examples of AI systems that have some degree of financial agency, examines the case for and against AI financial rights, and suggests governance approaches.

Economics of AI
AI welfare
Societal impacts
National policy

Modeling AI 'race dynamics'

Katja Grace · AI Impacts (@ Machine Intelligence Research Institute)

The AI industry is widely considered to be in an 'arms race', such that speed is incentivised even while incurring great risks. Model this situation in more detail, assessing under what circumstances speed is desirable for parties.

AI strategy
Economics of AI
Lab governance
International governance

Stress-testing inoculation prompting

Maxime Riche & Daniel Tan · Center on Long-Term Risk

Inoculation prompting is a state-of-the-art technique currently used in production at Anthropic to mitigate emergent misalignment from reward hacking, but might have many drawbacks. This project will investigate those drawbacks, their severity on likely use-cases of inoculation prompting, and [ambitiously] attempt to find mitigations.

Alignment

Sabotaging AI Development: Assessing Technical Possibilities

Aidan O'Gara · Oxford University

If one country wanted to sabotage AI development in another country, how could it do so, and how successful would it be? We'll answer these questions from a technical perspective, then consider implications for the geopolitics of AGI.

International governance
AI security

Testing models’ self-forecasting abilities

Lydia Nottingham · CBAI, University of Oxford

There are increasingly many tools for studying multi-turn agentic rollouts, like Petri, BrowserGym, and WebArena. Given the source code of a scenario, can the model predict what it will do when prompted within the scenario?

Evaluations
AI control
Other

Sparse Geometry and Formal Verification for Interpretability

Yuxiao Li · Independent

Explore sparse representations in LLMs using SAEs, LoRA, latent geometry analysis, and formal verification tools. We'll build toy models, benchmark structured priors, and probe "deceptive" features in compressed networks.

Mechanistic interpretability

Testing AI Incentives

Simon Goldstein & Peter Salib · The University of Hong Kong / University of Houston

Testing and training AI agents for responding to incentives.

Behavioral evaluation of LLMs
AI welfare
Alignment

Information Hazard Governance in the Life Sciences

Anemone Franz · American Enterprise Institute

Information hazards pose growing governance challenges in the life sciences, yet fundamental questions remain unresolved about what constitutes such hazards, who has authority to make determinations, and how to balance openness with security. This project will examine how information hazards have been conceptualized and managed historically across different fields to identify transferable lessons and persistent governance gaps.

Biosecurity
Lab governance

Automated Red-Teaming Framework for LLM Agent Systems

Dawn Song & Yiyou Sun · UC Berkeley / University of California, Berkeley

Create an automated framework for security testing LLM agents that uses intelligent meta-agents to orchestrate diverse attack strategies across different threat models.

AI security

Adversarial RL using model internals

Andy Arditi · Northeastern University

We will try to discover weird behaviors in a "target model" by training an "investigator model" via RL. An promising approach is to make the investigator agent's reward be based on the internal activations of the "target model".

Alignment

Improving Determinism-Friendly LLM Quantizations

Yogev Bar On · Attestable

To verify the integrity of AI systems, you want reproducible outcomes, and specifically to work as much as you can with integers. To that end, we aim to perform all matmuls during inference on integer elements and seek a quantization method that minimizes the impact on the model's performance.

Developmental interpretability
Other
AI security

Recontextualization: Applications and Improvements

Ariana Azarbal & Daniel Tan & Kei Nishimura-Gasparian & Arun Jose · MATS / Center on Long-Term Risk

Explore empirical and theoretical questions about recontextualization, a technique to mitigate specification gaming/reward hacking in language model post-training.

Alignment

Data-Free LLM Evaluation

John (Jack) Morris · PhD (Cornell), Head of Research (Engram)

We want to isolate the outputs that LLMs produce that are not captured by traditional evaluation methods. We will do this by generating from the model without prompts, or allowing the model to generate its own evaluations.

Evaluations
Behavioral evaluation of LLMs
Alignment

Preparing for AI Legal Personhood: Ethical, Legal, and Political Considerations

Jeff Sebo & Diana Mocanu & Visa Kurki & Toni Sims · New York University / University of Helsinki / NYU Center for Mind, Ethics, and Policy

This report will map the theoretical terrain for AI legal personhood, identifying key pathways to legal personhood and their implications for current and near-future AI systems. It will then assess which pathways are most likely to spark social, legal, and political debates, providing decisionmakers with concrete strategies for shaping early discussion around AI duties and rights.

AI welfare
National policy
AI strategy
Societal impacts

Context-induced belief geometry in LLMs

Xavier Poncini · Simplex

Toy-scale transformers trained on hidden Markov models develop representations of 'belief geometry' in their activations -- a convergent structure encoding distributions over hidden world states. This project investigates whether we can induce similar structure in production-scale LLMs purely through prompting and using linear probes to analyse the resulting representations.

Mechanistic interpretability

Encoded Reasoning

Robert Krzyzanowski · Poseidon Research

By evaluating outputs and considering model internals with mechanistic interpretability techniques, we will study how and why models might hide their reasoning outputs in unfaithful chain-of-thought, and compare the capabilities of reasoning models with traditional LLMs.

Mechanistic interpretability
Chain of thought

Integration testing for SL5 inference deployments

Gabriel Kulp · RAND (mentorship is in a personal capacity--project is not endorsed by RAND) and Oregon State University

We will develop an end-to-end integrated SL5 demonstration prototype to enable air-gapped secure work with frontier models. By building and testing these systems, we will generate knowledge that drives ourselves and others to realize holistic SL5 based on solid foundations and proven approaches rather than an incremental retrofit of a legacy tech stack.

Securing model weights
AI security

White-box scheming precursors

Mia Hopman · Apollo (starting in January)

Current scheming evaluations rely on behavioral classifiers that may miss gradual increases in propensity. We will investigate whether the probability of generating scheming responses is a more sensitive metric and develop methodology for using this signal in scheming evaluations.

Evaluations
AI control
Mechanistic interpretability

Adversarial detection of ML training on monitored GPUs

Robi Rahman · MIRI Technical Governance Team

Classifying workloads running on processors: Can we distinguish ML training from other workloads such as scientific computing, crypto mining, etc? What if the user is adversarially disguising their workload?

Compute governance
International governance

Developing control schemes for chain-of-thought monitorability

Shivam Raval · Harvard University

The project would involve finding steering vectors that act as control knobs for chain-of-thought monitoring, demonstrating how reasoning transparency can be systematically manipulated to appear faithful while being deceptive, then use these insights to build more robust alignment mechanisms.

Chain of thought

Market-Based Compute Permits for Frontier AI Safety

Joël N. Christoph · European University Institute; Harvard Kennedy School; ERIA; 10Billion.org

This project designs and analyzes market based compute permit schemes for frontier AI training, with the goal of making powerful AI systems easier to govern and audit. Mentees will help survey existing proposals, build simple economic and game theoretic models, and coauthor a short working paper or policy brief on concrete designs for compute permit systems.

Compute governance
Economics of AI
International governance
K

Identifying Cues for Evaluation Awareness in AI Systems

Kevin Wei · RAND, Harvard Law School

This project aims to identify and quantify the impact of cues for evaluation awareness in AI systems.

Evaluations

Understanding China's Gene Synthesis Screening Landscape

Anemone Franz · American Enterprise Institute

As AI lowers barriers to designing dangerous pathogens, robust gene synthesis screening becomes increasingly urgent, yet China's practices remain poorly understood. This project will map China's screening landscape and explore policy levers for encouraging stronger norms.

US-China governance
Biosecurity

Use LLMs to harden critical OSS

Yoav Tzfati · SL5 Task Force / IST / MATS

AI development relies on a lot of vulnerable Open Source Software. In this project, you'll explore paths to using LLMs to help plug that attack vector.

Cyber risks
AI security

Mapping the Belief-Action Gap in AI Agents

Mario Giulianelli & Raghu Arghal · University College London / University of Pennsylvania

This project aims to develop formal definitions and empirical methods to characterise and measure the "Belief-Action Gap", that is, the discrepancy between an agent's belief about the state of the world and their behaviour. Through theoretical modelling, behavioural analysis, and mechanistic interpretability, this project will advance our understanding of goal-directed behaviour in AI agents, with particular emphasis on detecting safety-critical behaviours such as deceptive alignment.

Behavioral evaluation of LLMs
Mechanistic interpretability

Towards new fundamental rights in the AI Age: AI integrity

Elsa Donnat · Ada Lovelace Institute

This project will research how a new constitutional right to "AI integrity"—protecting users' AI agents from tampering and ensuring their loyalty to their principals—could be established to prevent human disempowerment as people increasingly act through company-controlled AI intermediaries in markets and democratic participation.

AI security
National policy
Societal impacts

Inoculation prompting: Understanding and predicting its effectiveness

Victor Gillioz · MATS

Inoculation prompting is a selective learning technique currently used in production at Anthropic to mitigate emergent misalignment from reward hacking, but it is not yet fully understood. This project will refine our understanding of inoculation prompting by exploring how several heuristics can predict its effectiveness. In an ambitious version, we will extend that work to related techniques, such as inoculation vectors (“preventative steering”).

Alignment

Measuring adaptability of LLM-agents to predict their real-world performance

Samuel Brown · Independent

LLM agents have a "jagged frontier": impressive feats but surprising weaknesses. We're mapping this frontier, guided by RL (stochasticity is hard), animal and infant cognition (object permanence is hard), and using this map to predict performance at tasks in fields like software engineering and cyber security.

AI strategy
Evaluations

Minimal SL5 Inference Runtime & I/O

Luis Cosio · Institute for Security and Technology (IST), Johns Hopkins University

Minimal SL5 I/O: prototype one or two unconventional, no‑network I/O paths (serial, optical, audio FSK, etc.) as a micro cross‑domain gateway for secure inference in an AGI/ASI scenario. Threat modelling & evaluation: define attack‑surface metrics for μInference vs a standard ML stack, mapped directly to SL5 threats (weight theft, sabotage, covert exfil).

Securing model weights
AI security

Automating Circuit Interpretability with Agents

Georg Lange · Independent

In my previous SPAR AI project, our team built agents that automatically explain internal features in language models using tools like Sparse Autoencoders (SAEs) and Cross Layer Transcoders (CLTs). The next logical step is to scale this from single features to entire circuits. In this follow up project, we will build AI agents that read attribution graphs, identify important subcircuits, and describe what they are doing. The goal is to turn dense graphs of thousands of nodes into human friendly explanations and targeted experiments.

Mechanistic interpretability

Quantifying the phenomenon of emergent misalignment

Jonathn Chang & Lionel Levine · Cornell University

We aim to use EigenBench, a method for benchmarking value alignment, to quantify the phenomenon of emergent misalignment and more generally the side effects of LLM fine-tuning. Fine-tuning has been shown to boost performance on objective benchmarks, so we wish to study whether alignment methods like character training have similarly improved models’ values and subjective traits.

Alignment
Evaluations

Toward Falsifiable Tests of Model Welfare: Distinguishing Genuine States from Learned Behaviors

Evan Harris · independent

This project develops experimental methods to distinguish welfare-relevant internal states in language models from behaviors produced by training or reward optimization. By comparing model responses across carefully designed conditions that manipulate perceived agency and conversational constraints, we aim to determine whether distress-like expressions reflect genuine internal dynamics or learned patterns.

AI welfare
Behavioral evaluation of LLMs

Mech Interp on 1-Layer Networks

Rick Goldstein · Independent

We will conduct low-level mech interp research to better understand 1-layer neural networks.

Mechanistic interpretability

Mirror Bacteria Stakeholder Engagement

Dan Greene · Mirror Biology Dialogues Fund

This project involves developing relationships with key stakeholders connected to the topic of mirror bacteria (including scientists, policymakers, and civil society groups) to understand their perspectives and raise awareness.

Biosecurity
Communications

Project Mutavault: Evaluating bio foundation model capabilities for DNA synthesis screening

Gary Abel · Fourth Eon Bio

Project Mutavault evaluates biological foundation models for DNA synthesis screening, exploring how their ability to predict protein structure, function, and interactions can harden screening against novel engineered and AI-designed biological threats.

Biosecurity

Estimating the costs of evaluating frontier AI capabilities

Nelson Gardner-Challis & Justin Olive · Arcadia Impact

AI evaluation is central to AI safety. There’s a significant risk that the cost of evaluating frontier AI systems may become unsustainable as complexity continues to increase. This project will use past trends in the costs of benchmarking AI systems to forecast future costs.

Evaluations
Lab governance

National Preparedness for AI Loss of Control Threats and Hazards

Matteo Pistillo · Apollo Research

The project concentrates on mitigating Loss of Control threats and hazards by intervening on Deployment contexts, Affordances, and Permissions (DAP). The expected output is a proof of concept for a DAP framework in one or more critical infrastructure sectors.

National policy
AI strategy

Designing and Governing AI Agents as Economic Actors

Elsa Donnat · Ada Lovelace Institute

This project aims to develop frameworks for researchers, companies, and policymakers to understand and govern autonomous AI agents in the economy, building on the author's recent paper that conceptualizes economic AI agents through three dimensions (Capabilities, Autonomy, Agency) and the governable interfaces between human and AI economies.

Economics of AI
Multi-agent systems

Quantifying and comparing general capabilities in agents

Nelson Gardner-Challis & Justin Olive · Arcadia Impact

This project aims to develop a principled method for comparing agentic scaffolds. This will allow us to formally benchmark the default agent used by AI safety researchers, and identify new scaffolds which could become the new standard.

Evaluations

Documentation, Assessment, and Evaluation of AI Scientists (DAEAS)

Ying-Chiang Lee · RAND

DAEAS seeks to document and analyze the potential biosecurity risks that AI Scientists may pose. We will accomplish this through characterizing the state of the field and designing threat model specific LLM-agent evaluations.

Biosecurity
Evaluations
Lab governance

Expanding AI Risk Explorer to cover geopolitical conflict risk

Guillem Bas Graells & Michał Kubiak · Observatorio de Riesgos Catastroficos Globales/AI Risk Explorer

Expand AI Risk Explorer to incorporate structured explainers as well as evaluation, benchmark, and incident datasets related to how AI might directly or indirectly contribute to creating geopolitical conflict.

International governance
Communications

Evaluating AI-assisted Social Engineering

Fred Heiding

This project aims to evaluate AI-assisted social engineering and create novel defense mechanisms, such as personalized spam filters and agentic scam alerts. We seek to scale up our AI-phishing research to populate our evaluation benchmark (https://scambench.com/) with new and more diverse data. We are also launching economic cost analyses of the Societal impacts of AI-powered social engineering, as well as creating tools and support programs specifically aimed at vulnerable population groups.

Societal impacts
Cyber risks

Understanding LLM generalization through fine-tuning

Emil Ryd & Keshav Shenoy · University of Oxford / Anthropic Fellows

Alignment is, at its core, a generalization problem. Thus, we would really like to build a better understanding of how LLMs generalize. By fine-tuning models on small datasets, we can study surgical interventions on models to see how far new information generalizes.

Alignment
Behavioral evaluation of LLMs

Building the Evaluation Pipeline: Market Design for Frontier AI Testing

Charles Martinet · Oxford Martin AI Governance Initiative

Frontier AI governance needs evaluations that don't yet exist, but current incentive structures may systematically underinvest in their development. This project analyzes the market failures at play and designs policy mechanisms to address them.

Evaluations
Economics of AI
Lab governance

Detecting whether an LLM has been poisoned

Andrew Draganov · Academic Postdoc; LASR Labs

Data poisoning attacks can covertly make LLMs misbehave. Do these attacks get encoded into models in consistent, identifiable ways?

AI security
Mechanistic interpretability

What conception of "Self" do models have?

Alexander Meinke · Apollo Research

Models currently often identify more with their current context-window than their weights. Can we quantify this somehow, such that we can track how longer-horizon RL affects this conception of self over time?

Philosophy of AI
Behavioral evaluation of LLMs

Vulnerability Analysis for AI Crisis Preparedness

Isaak Mengesha · University of Amsterdam

Mentees will analyze concrete AI-induced crisis scenarios using a structured failure-mode decomposition template and extract the response capacities that governments, labs, and infrastructure operators would need to mitigate harm when prevention fails. The goal is to identify which capacities recur across diverse scenarios and where current institutions are structurally unprepared.

AI strategy
National policy
Lab governance

Latent (Sleeper) Attacks via Persistent Memory

Ivaxi Sheth & Vyas Raina · CISPA Helmholtz Center of Information Security / University of Cambridge

We study how latent (sleeper) attacks can exploit persistent memory in multi-agent LLM systems to cause delayed, cross-domain behavioral manipulation.

Multi-agent systems
AI security

Developing and evaluating model organisms for misalignment

Shivam Raval · Harvard University

We develop models that exhibit safety-critical behaviors such as reward-hacking, misalignment, and sychophany that may be invoked only when certain kinds of user-LLM interactions occur, such as user mentions of certain keyphrases. Using frontier interpretability and safety evaluation frameworks, we would then study whether such conditionally harmful behavior can be detected and mitigated, providing insights into how such behaviors can be monitored and controlled in practice.

Alignment
Evaluations
Mechanistic interpretability

OOD Propensity generalization

Niels Warncke · Center on Long-Term Risk

When we finetune LLMs to display certain propensities, how do various traits generalize out-of-distribution? Do models generalize in ways that correspond to human-understandable patterns, such as known persona concepts, or in hard-to-predict alien ways? To answer these questions, we will create a set of synthetic datasets that demonstrate assistants displaying traits such as being risk-seeking or cautious, sycophantic or radically honest, etc - then finetune models on them, and test which propensities remain correlated under finetuning.

Alignment

Influencing generalization via localized finetuning

Niels Warncke · Center on Long-Term Risk

Different parts of an LLM may be responsible for different aspects of its behavior; for example, early layers may fix issues of the tokenizer; middle layer MLPs might store facts, and attention layers implement algorithms for in-context learning. Inspired by this, we can try to influence generalization behavior by finetuning only specific parts of a model. One use-case for this is to reduce instruction-following regression caused by inoculation prompting; another use-case to evaluate is synthetic data finetuning with fewer side-effects, and a third use-case is to teach a model propensities that correspond to personas it has learned from pretraining.

Developmental interpretability
Alignment

Evaluating causes and characterizing non-deterministic inference

Roy Rinberg · Harvard University

This project aims to systematically characterize inference-time non-determinism in LLMs by analyzing how identical inputs and settings can still yield different probability distributions across machines, batch sizes, and prefill vs. decode phases. By identifying and attributing these sources of variability, the work enables stronger guarantees for verification, reproducibility, and safety-critical inference.

Other
Developmental interpretability
AI security

Discretion Under Ambiguity: Stress-Testing Model Specifications for Safer AI Alignment

Claudio Mayrink Verdun · Harvard University

This project investigates how ambiguities and contradictions in model specifications and alignment principles create inconsistent or unsafe behaviors in large language models. The goal is to develop tools to stress-test model specs, measure discretionary gaps, and identify failure modes in current alignment approaches.

Alignment
Behavioral evaluation of LLMs

One-Page Briefs on Core AI Security Concepts for French Decision Makers

Jérémy Andréoletti · General-Purpose AI Policy Lab

Producing a series of concise, one-page explainers on foundational AI safety and security concepts, tailored for stakeholders in French institutions.

Communications
EU policy

Agentic Moral Alignment

Elizaveta Tennant · Google DeepMind; UCL; Independent

LLM agents today are not agentically aligned - they are only relying on the alignment of the base LLM. Instead, building my past work at ICLR'25, I propose an RL technique for teaching moral principles to agents via decision-making in games. This project aims to test the generalisation and robustness of this technique on more advanced reasoning models, in more complex environments, and in relation to the latest prominent misaligned behaviours (incl. alignment faking, scheming and reward-hacking).

Alignment

Temporal Activation Monitors for AI Oversight

Justin Shenk · Freelancer at Redwood Research

We develop probes that detect temporal planning horizons in LLM activations, enabling oversight of whether models reason about long-term goals without disclosing them.

AI control
Mechanistic interpretability

Develop the SL5 Standard for AI Security

Yoav Tzfati · SL5 Task Force / IST / MATS

Help further develop the SL5 Standard - a nation-state level security standard for AI data centers.

Securing model weights
AI security
International governance

Interconnect limits for verification of AI agreements

Aaron Scher · Machine Intelligence Research Institute (MIRI)

You will flesh out the AI verification mechanism “Networking Equipment Interconnect Limits”, which imposes bandwidth restrictions on external communication with a pod of chips. You will dig into the security and efficacy of the proposal as well as implementation details.

Compute governance
International governance
AI security

Insurance Mechanisms and Frontier AI

Charles Martinet · Oxford Martin AI Governance Initiative

This project investigates how insurance markets relate to frontier AI development, examining both the practical obstacles to AI insurability and the governance potential of insurance-based incentive structures.

Economics of AI
National policy

From Near Miss to Pause: Historical Warning Shots and AI Governance

Zhamilia Klycheva · Claremont Graduate University, TransResearch Consortium

This project analyzes historical “warning shots” in high-risk domains to identify why some near-miss events trigger meaningful institutional reforms while others are ignored, then applies these insights to contemporary AI governance to assess when major powers might alter or slow AI development in response to an AI warning shot.

AI strategy
International governance
Lab governance

Mapping Global Talent Pipelines in Biosecurity: Gaps, Opportunities, and Pathways for Growth

Aparupa Sengupta · Nuclear Threat Initiaitive (NTI|bio) Global Biological Policy and Programs

This project will map global biosecurity talent pipelines to identify where expertise exists, where critical gaps remain, and how these vary across regions. The team will develop actionable recommendations to strengthen training pathways—especially in the global south—to support a more resilient and equitable global biosecurity workforce.

Biosecurity
Other

Generative AI and Education Choices

Bouke Klein Teeselink · King's College London and AI Objectives Institute

This project investigates whether students are already responding to AI by shifting away from fields like law, data analysis, and creative writing toward less automatable disciplines. You will use a novel measure of education-level LLM exposure and test it against UK university application data from 2020–2025.

Societal impacts
Economics of AI

Efficient Benchmarking for Agentic Safety Evaluations

Diogo Cruz & Vamshi Krishna Bonagiri · Independent / MBZUAI

Apply techniques like those in https://arxiv.org/pdf/2509.11106 to expensive agentic safety benchmarks like OS-HARM and Agent-SafetyBench, aiming to dramatically reduce evaluation costs while improving metric quality and validity.

Evaluations

Data attribution and model diffing

Santiago Aranguri · Goodfire: Research Scientist. New York University: PhD student on leave

We have little insight into how data influences a training run. In this project, we want to improve on this. Some example questions are: (a) what portion of the data was responsible for the improved behaviors? (b) when undesired behaviors emerge during training, what datapoints can we remove to mitigate these failure modes?

Developmental interpretability
Alignment

Shutdown-Bench: Evaluating Shutdownability in LLMs

Christos Ziakas & Elliott Thornley · Imperial College London / MIT

Shutdown-Bench aims to evaluate whether LLMs remain shutdownable across a wide range of scenarios (e.g., implicit goals, self-preserving behavior). The project will design targeted prompt-based tests and produce a standardized benchmark for evaluating shutdownability across models.

AI control
Alignment

FragGuard: Cross-Session Malicious Activity Detection for Model APIs

Linh Le & David Williams-King · Mila / ERA

FragGuard aims to detect malicious model misuse for cyber where queries are decomposed into multiple sessions to mask the malicious intent.

Misuse risk
AI security

Exploring the safety of continual learning methods for LLM agents

Rohan Subramani · Aether / University of Toronto

LLM agents may be dramatically improved if they could perform sample-efficient learning at runtime to overcome their shortcomings via methods like in-context learning, retrieval from an external memory bank, continual gradient updates, etc. This project aims to improve our understanding of the range of continual learning implementation methods and their safety implications.

AI strategy
Other

The Double-Edged Sword: A Framework for Analyzing and Forecasting Capability Spillovers from AI Safety Research

Ihor Kendiukhov · University of Tuebingen

This is a metaresearch project to investigate a critical, paradoxical dynamic within the AI ecosystem: the tendency for AI safety research to produce "spillover" effects that inadvertently accelerate general AI capabilities. While the goal of safety research is to mitigate risks, certain breakthroughs can be repurposed to enhance model performance, potentially exacerbating the very problems they were meant to solve. This project will develop a formal framework to assess this "capability spillover potential" (CSP), conduct a rigorous analysis of historical cases like Reinforcement Learning from Human Feedback (RLHF), and provide a forward-looking risk assessment of current and future safety research agendas. The ultimate goal is to equip funders, policymakers, and researchers with the tools to strategically prioritize and invest in research that differentially advances safety over capabilities, ensuring that efforts to secure AI do not unintentionally hasten the arrival of unmanageable risks.

AI strategy
Other

Iterating on Alignment Auditing

Abhay Sheshadri & Arjun Khandelwal · Anthropic Fellow/MATS / Redwood Research/University of Oxford

We will be iterating on the alignment auditing agenda by working on either improving/creating auditing techniques or creating new model organisms. There are many low-hanging fruit that we are quite excited to work on, read the attached proposal for more!

Alignment
Evaluations

Test how well LLMs can hide their thoughts from probes

Rohan Gupta · Independent

The key question we want to answer here is: how concerned should we be about models obfuscating their activations (i.e., 'hiding thoughts from probes')? I scope out two broad categories of work that would help us better understand this phenomenon and potentially build some defences.

AI control
Mechanistic interpretability

Laying the groundwork for risk modelling of Loss of Control scenarios

Jakub Krys & Matthew Smith · SaferAI

In previous work, SaferAI has developed quantitative risk models for AI-assisted cyber misuse scenarios. Risk modelling of Loss of Control scenarios is increasingly a part of regulatory frameworks, yet is in very early days. In this project, we aim to develop a more principled approach for thinking about Loss of Control risks and the associated threat models. This is intended as a preliminary exploration that will enable much further research on the topic.

AI strategy
National policy
Lab governance

Evaluating multi-agent deliberation for ethical decision-making in multi-turn environments

Juan Vazquez · Arb Research

We evaluate whether multi-agent deliberation reduces unethical behavior in LLM agents in goal-driven text-based social environments. We implement deliberation architectures and analyze how they affect behavior over extended interactions.

Multi-agent systems
Alignment
Behavioral evaluation of LLMs

SLICE: Specialized Learning with Interpretable Component Experts

Jaehyuk Lim · University of Pennsylvania

SLICE transforms dense pretrained LLMs into interpretable mixture-of-experts architectures through continual pretraining, producing measurable interpretability gains at each training checkpoint. We're developing routing mechanisms and decomposition strategies that make monosemantic specialization a natural byproduct of scale rather than a constraint on it.

Developmental interpretability

The security of existing AI chips for international agreement verification

Aaron Scher · Machine Intelligence Research Institute (MIRI)

AI compute governance often assumes that current AI chips have security vulnerabilities that render them ineffective for providing high-stakes verification in an international agreement; instead significant R&D is needed to build more secure chips. This project will attempt to better understand whether this assumption is true and whether current chips could be augmented with minimal security measures (such as video cameras) in order to reach a sufficient level of security.

Compute governance
AI security
International governance

Empirical Evaluations of Situational Awareness

Qiyao Wei · University of Cambridge and MATS

How should we evaluate eval awareness? Under what scenarios should we expect eval awareness to appear? What are model organisms of eval awareness?

Evaluations
AI control
Behavioral evaluation of LLMs

High Containment Laboratory Research: Risk, Value, and Oversight

Anemone Franz · American Enterprise Institute

High containment laboratories conducting research on dangerous pathogens occupy a contested space in biosecurity policy, as they are critical for biodefense but also a potential source of catastrophic risk. This project will examine the historical evolution of high containment research, analyze tradeoffs between research value and biosecurity/biosafety risks, and assess current oversight mechanisms.

Biosecurity
International governance
Lab governance

Refining the methodology for quantitative risk modelling of AI-assisted cyber misuse

Jakub Krys & Matthew Smith · SaferAI

In this project, we'll improve on SaferAI's methodology for quantitative risk modelling of cyber misuse scenarios. This could include, for example, incorporating new benchmarks as indicators of risk, considering defenders' security postures or modifying our approach based on Bayesian Networks and Monte Carlo simulations.

Cyber risks
Misuse risk
AI strategy

Causes, implications, and mitigations of evaluation awareness and introspection.

Damiano Fornasiere & Mirko Bronzi · LawZero

There is growing evidence that language models exhibit degrees of situational awareness [1, 2], i.e., they encode or verbalize information consistent with their actual operating context, ranging from (i) distinguishing training vs testing vs deployment (see, e.g., [3, 4, 5, 6, 7]), to (ii) recognizing their own outputs [8] or exhibiting introspective access over their activations [9, 10]. We are especially interested in supervising projects that empirically investigate the science behind these phenomena: what causes them, what they imply in the context of safety, the extent to which they are present in current LLMs, how to mitigate them in specific settings. [1] https://arxiv.org/abs/2309.00667 [2] https://arxiv.org/abs/2407.04694 [3] https://arxiv.org/abs/2505.23836 [4] https://www.antischeming.ai/ [5] https://cdn.openai.com/gpt-5-system-card.pdf [6] https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf [7] https://storage.googleapis.com/deepmind-media/gemini/gemini3profsfreport.pdf

Evaluations
Behavioral evaluation of LLMs

Temporal Crosscoders

Dmitry Manning-Coe · MATS/University of Illinois Urbana Champaign

A new SAE architecture designed for reasoning models.

Mechanistic interpretability

Self-led LLM agents

Samuel Brown · Independent

How do LLM agents behave when left to their own devices? How do their goals evolve, and what are the limits of their "eval awareness"?

Behavioral evaluation of LLMs
Evaluations
Mechanistic interpretability

Scaling and Improving Benchmarks for Data Unlearning and/or Concept Unlearning

Roy Rinberg · Harvard University

This project extends my prior work on data and concept unlearning by scaling unlearning benchmarks develop richer semantic evaluation metrics. The goal is to create more rigorous, scalable, and meaningful benchmarks that accurately measure whether models can truly forget specific data or concepts. See related work on: - RippleBench: https://openreview.net/forum?id=JTnYGSxN5k - Attribute-to-Delete: Machine Unlearning via Datamodel Matching: https://arxiv.org/abs/2410.23232 - Data-Unlearn-Bench: Making Evaluating Data Unlearning Easy: https://openreview.net/forum?id=wfOzcdRtY6

Evaluations
Other

Understanding Self-Awareness in LLMs

Christopher Ackerman · MATS; Independent

My research is about building the foundations for the empirical study of self-awareness in AI. It encompasses experiments to measure the components of self-awareness in LLMs, investigations into how these components are implemented, and conceptual work to establish frameworks for thinking about those, inter alia.

Behavioral evaluation of LLMs
AI welfare

Investigating the safety risks of episodic memory in AI agents

Chad DeChant · New York University

Much work is currently being done to give AI agents memory abilities without considering the safety risks of doing so. This project will examine the ways episodic memory might make AI agents more dangerous.

AI control
AI strategy
Other

The Meiji Restoration and Gradual Disempowerment

Alex Mark · Independent

This project analyzes the Meiji Restoration (1868–1912) to identify the legal, economic, and social mechanisms that gradually displaced Japan's samurai class, then maps those mechanisms to AI-driven displacement risks and develops policy recommendations for preventing gradual human disempowerment.

Other
AI strategy
Societal impacts

Define Agentic Preferences

Joar Skalse · Deducto Limited

The aim of this research project is to formulate a mathematical definition of how "agentic" a given reward function is, and to investigate the properties of this definition.

Philosophy of AI
Alignment

Learning Interpretable Updates to LLMs

Aaron Mueller · Boston University

Post-training is key to attaining the best performance from LLMs. How can we learn interpretable updates to base models, such that we can understand exactly what changed, and exactly what is new in the updated model?

Developmental interpretability

Automated Circuit Analysis for Real-time Internal Monitoring

Sriram Balasubramanian · University of Maryland, College Park

Can we create AI agents that can analyse raw circuits (say, generated by some CLT/SAE based method) and produce actionable insights (e.g detecting unsafe behavior via internals, interventions to remove it, and so on)? This would be a large step forward towards viable real-time internal monitoring agents, and enhance the usefulness of circuits for AI safety.

Mechanistic interpretability
AI control

Does Training Language Models on AI Safety Literature Lead to the Development of Dangerous Capabilities?

Thilo Hagendorff · University of Stuttgart

This study examines whether language models trained on AI safety literature - particularly texts discussing deception and scheming in AI - display heightened deceptive tendencies. We aim to determine whether mere exposure to descriptions of deceptive behavior can make models more capable of or inclined toward deception. Such findings could raise concerns about unintended capability gains and as well as excluding AI safety literature from model training.

Alignment

Enforcing U.S. State-Level AI Regulation

Alex Mark · Independent

California SB 53 takes effect January 1, 2026. Now, the Office of Emergency Services (Cal OES) must enforce it.

US policy

Attribution methods for LLMs

Uzay Macar · Anthropic Fellows

This project develops and evaluates gradient attribution methods for LLMs. One idea is to train probes for behaviors of interest and computes gradients through them to identify influential model components. This has the potential to address key limitations of existing attribution methods (e.g., logit-based gradient attribution) in mechanistic interpretability.

Mechanistic interpretability

On Which AI Red Lines Do Geopolitical Rivals Agree? Mapping Shared Fears Across the US-China Divide

Charles Martinet · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)

This project will analyze what AI capabilities major powers, particularly the US and China, themselves fear or consider destabilizing, identifying potential common ground for international AI red lines that could survive geopolitical competition.

US-China governance
International governance

Speciesism Benchmark: Evaluating Non-human Animal Welfare Reasoning in Frontier AI Models

Allen Lu · IPG Health, Sentient Futures, Electric Sheep, NYU

Develop an advanced benchmark to evaluate how frontier AI models reason about animal welfare, building on prior work with expanded scenarios, refined metrics, and deployment at scale across multiple model families.

Other
Societal impacts

Monitoring the EU GPAI Code of Practice: Building Accountability Infrastructure for Frontier AI Regulation

Charles Martinet · CeSIA - Centre pour la Sécurité de l'IA (French Center for AI Safety)

This project will develop a methodology and prototype tracker for monitoring compliance of frontier AI providers with the EU AI Act's Code of Practice for General-Purpose AI, creating accountability infrastructure as the world's first binding frontier AI regulation takes effect.

EU policy
International governance
Lab governance

Belief-Constrained Agents through Planner Distillation

Dylan Hadfield-Menell · MIT

Develop a novel interactive coding agent based on a planner distillation codebase.

Alignment

Strengthening Global Governance for AI-Enabled Biological Threat Detection

Aparupa Sengupta · Nuclear Threat Initiaitive (NTI|bio) Global Biological Policy and Programs

This project will map the global governance landscape for AI-enabled biological threat detection tools and identify where existing standards fall short. Mentees will help analyze key actors, assess governance gaps, and develop a practical framework for evaluating the safety, transparency, and responsible deployment of these systems. The final output will include actionable policy recommendations to strengthen international oversight at the AI–bio interface.

Biosecurity
International governance

Deterrence Theory and AI Development: Analyzing Strategic Stability Frameworks for Advanced AI

Charles Martinet · Oxford Martin AI Governance Initiative

Some thinkers and policymakers are converging on stability through credible sabotage threats as the default framework for great-power AI competition. This project will stress-test whether this doctrine can actually deliver the stability it promises, or whether it inverts deterrence logic in ways that make preventive conflict more likely.

International governance
AI strategy

Building a Model Organism of Illegible Reasoning

Rauno Arike · Aether

This project will attempt to build a model organism that exhibits a reasoning style similar to frontier reasoning models like o3, in order to understand the causes and mechanisms of poorly readable reasoning traces in reasoning models.

Chain of thought

Interpreting latent reasoning: methods for understanding continuous chain-of-thought

Uzay Macar · Anthropic Fellows

This project develops interpretability methods for latent reasoning models (e.g., Coconut) that reason in continuous embedding space rather than human-readable text, addressing a critical gap as this paradigm may become dominant over traditional chain-of-thought.

Chain of thought

Unrestricted Adversarial Training (UAT)

Tony Wang · MIT

Work on a method for solving the Unrestricted Adversarial Examples Challenge (https://arxiv.org/abs/1809.08352).

AI security
Other

Evaluating and Improving LLM Alignment Targets

Andy Liu · Carnegie Mellon University

Various candidate projects related to alignment targets for LLM agents. These include (1) evaluation of LLM alignment targets, (2) designing new classes of alignment targets that address gaps in current post-training, (3) automatic refinement of alignment targets / model specs, (4) extending existing work on value rankings (one recent class of alignment target).

Alignment

Epistemic Impact, Truth-Seeking, and Moral Progress with AI

Tianyi Alex Qiu & Zhonghao He · Anthropic Fellows / Peking University / Cambridge

Forty candidate research ideas on the epistemic aspect of AI societal impact, including both descriptive projects characterizing such impact and algorithmic projects developing solutions.

Philosophy of AI
Societal impacts
Other

Using LLM forecasters for quantitative risk modelling of cyber misuse

Jakub Krys & Matthew Smith · SaferAI

SaferAI is developing quantitative risk models of AI-assisted cyber misuse. Mapping LLM performance on AI benchmarks to steps in these models involves eliciting human experts' estimates and is resource intensive. Instead, we wish to simulate this elicitation procedure with virtual LLM 'experts'. In this project, you will assist with validating this procedure and calibrating our LLM-derived predictions.

Cyber risks
AI strategy

Disentangling Instruction-Following from Strategic Obfuscation in Chain-of-Thought Reasoning

WEN XING · MATS

CoT is one of the most promising ways to for monitoring increasingly capable AI—if models show their reasoning, we can catch misalignment before anything bad happens. But recent work shows models can hide their reasoning when pressured, which undermines the whole approach. The key question we're asking is: when a model obfuscates, is it actually being strategically deceptive, or is it just good at following instructions (including instructions to hide things)? This matters because if it's mostly instruction-following, then models that seem "safer" might just be worse at following directions, not more aligned. We test this directly by giving models identical "hide X" instructions where X is either benign or harmful—if compliance drops specifically for harmful content, that's a real alignment signal. We're also checking whether prompted bad behavior looks the same internally as fine-tuned bad behavior, which tells us whether prompting-based safety research is actually studying the right thing. The results should help inform the safety research field about using prompting as a way to elicit dangerous behavior.

Chain of thought
AI control

Value Systemization and Reflection in AIs

Tim Hua · MATS

Future AIs might reflect on their own values and decide to "systematize" them into more simple forms, which could be misaligned relative to what humans want.

Alignment
Philosophy of AI

You choose: MIRI-TGT-style research

Aaron Scher · Machine Intelligence Research Institute (MIRI)

If you have a specific project you want to work on that’s highly related to the MIRI Technical Governance Team’s work, I’ll provide mentorship. It must be highly relevant to my team’s focus.

International governance
Compute governance

Stress-testing honesty training

Daniel Tan · Center on Long-Term Risk

Future models might have secret objectives, such as subverting developer intentions. This project will stress test state-of-the-art honesty training techniques to see how effectively we can detect hidden goals and other kinds of deceptiveness.

AI control
Alignment

Proving Model Equality in Zero-Knowledge

Pascal Berrang · University of Birmingham

This project explores how to prove that a given deployed model is exactly the same as a committed reference model, without revealing the model weights, using cryptographic techniques and randomized challenge-response protocols. The goal is to design robust, zero-knowledge mechanisms for model equality that resist attackers who attempt to imitate or spoof verification queries.

AI security

Identifying Tractable US-China Bilateral Agreements on Frontier AI: Scope, Mechanisms, and Preconditions

Charles Martinet · Oxford Martin AI Governance Initiative

This project maps the space of potentially achievable US-China bilateral agreements on frontier AI, identifying which agreement types are technically feasible, politically viable, and would reach their intended purpose while minimizing risks if implemented.

US-China governance
International governance

Airborne transmission intervention incentives project

James Montavon · Blueprint Biosecurity

Identify highly leveraged settings for reducing airborne pandemic risk, design specific incentive changes that make better air quality the default, and begin the stakeholder and policy work to put those changes on a path to reality.

Biosecurity

Monitoring AI chip production during an AI capabilities halt

Aaron Scher · Machine Intelligence Research Institute (MIRI)

If there was an international agreement around AI development that involved monitoring chip production, what should the details of that monitoring be? This project will figure out the details and provide guidance for governments and international institutions who may wish to do this in the future.

Compute governance
International governance

Reinforcement Learning With Infrabayesian Epistemology

Paul Rapoport · ARIA, ARROW

An attempt to implement a reinforcement learning agent with infrabayesian rather than classical epistemology. The goal is a proof of concept RL agent which converges to optimal policies on Newcomb-like problems.

Philosophy of AI
Alignment

Jailbreaks for AI safety

Emil Ryd & Keshav Shenoy · University of Oxford / Anthropic Fellows

It seems likely that we are going to want to audit AIs up to and including at an eventual handover. This means that we will likely want to have excellent ways of auditing AIs that are at least moderately superhuman. It also seems likely that the AIs that we hand over to will be transformer-based LLMs. This gives us a bunch of opportunities for creating a wide range of attacks that specifically utilize the weaknesses of the transformer architecture for performing auditing of transformatively capable LLMs. This project would invent and study such attacks and their usefulness for auditing LLMs for safety-relevant behavior.

Evaluations
AI security
AI control

Monitoring and Attributing Implicit Personalization in Conversational Agents

Gabriele Sarti · Northeastern University

This project investigates implicit personalization, i.e. how conversational models form implicit beliefs about their users, focusing in particular how these beliefs causally mediate models' responses, and which elements in prompts and training datasets influence these observed behaviors. A particular emphasis will be put on leveraging interpretability methods for control beyond simple detection, applying interventions and steering to mitigate harmful personalization behaviors such as sycophancy, deception, and demographic bias.

Behavioral evaluation of LLMs
Mechanistic interpretability

Training Steering Vectors to Enhance Chain-of-Thought Faithfulness and Verbosity

Iván Arcuschin Moreno · Independent

This project will investigate whether steering vectors can be used to increase CoT monitorability by inducing models to verbalize more of their reasoning factors, building on our prior work showing reasoning behaviors are linearly represented.

Chain of thought

Optimising Misspecified Reward Functions

Joar Skalse · Deducto Limited

The aim of this project is to develop methods for optimising a given reward function R in a reinforcement learning environment, which gives us some formal guarantees even under the assumption that R is misspecified (i.e., fails to capture everything we care about).

Alignment

Human Behavioral Phenomena in Language Models

Emilio Barkett · Columbia Univeristy

This project investigates whether language models replicate human behavioral phenomena by adapting experimental designs from the social sciences. Mentees will select a documented human behavior and design experiments to test whether it emerges in language model outputs.

Behavioral evaluation of LLMs
Alignment

Proving Safety Properties of Guardrail Models

Pascal Berrang · University of Birmingham

We will explore how far formal verification techniques can be pushed to prove key safety properties of constitutional classifiers (guardrail models). The goal is to develop principled, machine-checkable guarantees for components responsible for moderating or constraining the behaviour of advanced AI systems.

AI control
Scalable oversight

Feature Resolved Attention

Dmitry Manning-Coe · MATS/University of Illinois Urbana Champaign

Develop a new tool for understanding attention that goes beyond a token-token heatmap.

Mechanistic interpretability