Mathematical theory of AI reasoning

Navn på bevillingshaver

Frederik Hytting Jørgensen

Titel

PhD student

Institution

Stanford University

Beløb

DKK 2,678,330

År

2026

Bevillingstype

Internationalisation Fellowships

Hvad?

This project develops a mathematical theory for verifying and controlling AI reasoning. We provide provable guarantees on when an interpretation of a model's internal reasoning is valid in the sense that it helps us predict how the system will behave in novel situations. We validate our methods on large language models.

Hvorfor?

Just as we may not be able to tell whether someone avoids stealing based on an ethical commitment or a fear of getting caught, we cannot always tell whether an AI system is truly safe or merely appears safe in testing. If we knew what reasoning an AI relied on, then that could increase our confidence that an AI would remain safe and reliable beyond the specific situations it was tested on.

Hvordan?

Current methods can make a neural network appear to follow essentially any reasoning if we allow overly complex interpretations. We address this by combining causal abstraction – a method for testing whether a neural network follows a particular reasoning process – with ideas from statistical learning theory that constrain interpretation complexity to provide generalization guarantees.

Tilbage til oversigtssiden