Loading schedule...

Coordination & Cooperation Session

Thursday, June 5 | 👑 Eugene Vinitsky

Uncertainty constrains coordination: robustness to humans without human data

Eugene Vinitsky

4:00 PM - 4:30 PM | Merrill Hall

Repeated AI Interaction: How Agents Learn and Strategize Over Time

Natalie Collina

4:30 PM - 4:50 PM | Merrill Hall

The Habermas Machine: AI-Mediated Deliberation to Protect Human Agency

Michiel Bakker

4:50 PM - 5:10 PM | Merrill Hall

AssistanceZero: Scalably Solving Assistance Games

Cassidy Laidlaw

5:10 PM - 5:30 PM | Merrill Hall

Panel Discussion

Eugene VinitskyNatalie CollinaMichiel BakkerCassidy Laidlaw

5:30 PM - 6:00 PM | Merrill Hall

Model Organisms of Misalignment Session

Friday, June 6 | 👑 Program Committee

New findings in Emergent Misalignment

Owain Evans

10:30 AM - 11:00 AM | Merrill Hall

Alignment faking in large language models

Ryan Greenblatt

11:00 AM - 11:30 PM | Merrill Hall

Auditing Language Models for Hidden Objectives

Evan Hubinger

11:30 AM - 12:00 PM | Merrill Hall

Robustness & Guaranteed Safety Session

Friday, June 6 | 👑 Program Committee

Trustworthy and Transparent Alignment of Large Language Models

Tong Zhang

10:30 AM - 11:00 AM | Nautilus

Safeguarded AI & Neural Proof Certificates

Davidad

11:00 AM - 11:30 AM | Nautilus

Guaranteed safety via pessimism

Michael Cohen

11:30 AM - 12:00 PM | Nautilus

Student Lightning Talks

Friday, June 6

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Jiahai Feng

1:30 PM - 1:40 PM | Merrill Hall

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Dylan Cope

1:40 PM - 1:50 PM | Merrill Hall

AI Safety for Everyone

Balint Gyevnar

1:50 PM - 2:00 PM | Merrill Hall

Neural Manifold Geometry Encodes Feature Fields

Julian Yocum

2:00 PM - 2:10 PM | Merrill Hall

Political Neutrality in AI is Impossible – But Here is How to Approximate it

Ruth Elisabeth Appel

2:10 PM - 2:20 PM | Merrill Hall

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Hanlin Zhu

2:20 PM - 2:30 PM | Merrill Hall

Scaling Trends in Language Model Robustness

Nikolaus (Niki) Howe

2:30 PM - 2:40 PM | Merrill Hall

HumanAgencyEval: Measuring AI Threats to Human Agency

Jacy Anthis

2:40 PM - 2:50 PM | Merrill Hall

Advancing Bayesian Inverse Reinforcement Learning

Ondrej Bajgar

2:50 PM - 3:00 PM | Merrill Hall

Robust and Diverse Multi-agent Learning via Rational Policy Gradient

Niklas Lauffer

3:00 PM - 3:10 PM | Merrill Hall

Control Vision: One Pathway for Technical Success

Alan Cooney

3:10 PM - 3:20 PM | Merrill Hall

Spooky Demos Session

Friday, June 6 | 👑 Max Tegmark

Creepy demos for AI safety

Max Tegmark

9:00 AM - 9:25 AM | Merrill Hall

o3 prevents itself from being shutdown: empirical evidence of instrumental convergence

Jeffrey Ladish

1:30 PM - 1:50 PM | Nautilus

Live Demos of AI Risks for Policymakers and Civil Society

Siddharth Hiregowdara

1:50 PM - 2:10 PM | Nautilus

Demos that don’t spook boil the frog

Holly Elmore

2:10 PM - 2:30 PM | Nautilus

Panel Discussion

Max TegmarkJeffrey LadishSiddharth HiregowdaraHolly Elmore

2:30 PM - 3:00 PM | Nautilus

Well-Founded AI Session

Saturday, June 7 | 👑 Sanjit Seshia

Verified AI: Progress and Challenges

Sanjit Seshia

9:00 AM - 9:30 AM | Merrill Hall

Symbolic Reasoning about Large Language Models

Guy Van den Broeck

10:30 AM - 11:00 AM | Merrill Hall

Human-like Concept Induction through Library Learning and Probabilistic Program Synthesis

Maddy Bowers

11:00 AM - 11:30 AM | Merrill Hall

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

11:30 AM - 12:00 PM | Merrill Hall

Human Value Learning Session

Saturday, June 7 | 👑 Brad Knox

Understanding what people want: Towards descriptive and identifiable psychological models for reward inference

Brad Knox

9:30 AM - 10:00 AM | Merrill Hall

Neglected Approaches to Value Learning and Alignment

Diogo Schwerz de Lucena

10:30 AM - 10:50 AM | Nautilus

Settling the Reward Hypothesis

Michael Bowling

10:50 AM - 11:10 AM | Nautilus

Reasoning Models Don’t Always Say What They Think

Arushi Somani

11:10 AM - 11:30 AM | Nautilus

Panel Discussion

Brad KnoxDiogo Schwerz de LucenaMichael BowlingArushi Somani

11:30 AM - 12:00 PM | Nautilus

Researcher Spotlight Talks

Saturday, June 7

A Field Test of AI That De-escalates Conflict

Jonathan Stray

1:30 PM - 1:50 PM | Merrill Hall

The Agentic Turn: Philosophical Reflections on AI Agents, Alignment and Impact

Iason Gabriel

1:50 PM - 2:10 PM | Merrill Hall

The AI perspective on consciousness

Joscha Bach

2:10 PM - 2:30 PM | Merrill Hall

A Meta-Game Evaluation Framework for Advanced Interactive AI

Michael P. Wellman

2:30 PM - 2:50 PM | Merrill Hall

Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian

2:50 PM - 3:10 PM | Merrill Hall

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Nika Haghtalab

3:10 PM - 3:30 PM | Merrill Hall

Explainability & Interpretability Session

Sunday, June 8 | 👑 David Bau

Areas of consensus and disagreement in interpretability research

Chris Potts

10:30 AM - 10:50 AM | Merrill Hall

Scaling AI Understanding with an Automated Interpretability Agent

Tamar Rott Shaham

10:50 AM - 11:10 AM | Merrill Hall

Benchmarking Methods for Understanding and Controlling Large Language Models

Atticus Geiger

11:10 AM - 11:30 AM | Merrill Hall

Panel Discussion

David BauChris PottsTamar Rott ShahamAtticus Geiger

11:30 AM - 12:00 PM | Merrill Hall

Adversarial Robustness Session

Sunday, June 8 | 👑 Adam Gleave

Adversarial Robustness of Advanced AI

Adam Gleave

9:30 AM - 10:00 AM | Merrill Hall

Adversarial Challenges for AGI Safety

Xiangyu Qi

10:30 AM - 10:52 AM | Nautilus

Tamper resistance as a key priority for AI safety

Stephen Casper

10:52 AM - 11:14 AM | Nautilus

Scalable robustness at OpenAI

Sam Toyer

11:14 AM - 11:36 AM | Nautilus

Obfuscated Activations Bypass LLM Latent-Space Defenses

Scott Emmons

11:36 AM - 11:58 AM | Nautilus

Booking Meeting Space in Triton Room

Book Triton Room

Surf & Sand Meeting Room

If you’d just like a space with a some tables, chairs, whiteboard and markers, please feel free to use the Surf & Sand Room (see map in Logistics).

This room is first come, first serve. No reservation required.