Mathematical theory of AI reasoning
Mathematical theory of AI reasoning
Navn på bevillingshaver
Frederik Hytting Jørgensen
Titel
PhD student
Institution
Stanford University
Beløb
DKK 2,678,330
År
2026
Bevillingstype
Internationalisation Fellowships
Hvad?
This project develops a mathematical theory for verifying and controlling AI reasoning. We provide provable guarantees on when an interpretation of a model's internal reasoning is valid in the sense that it helps us predict how the system will behave in novel situations. We validate our methods on large language models.
Hvorfor?
Just as we may not be able to tell whether someone avoids stealing based on an ethical commitment or a fear of getting caught, we cannot always tell whether an AI system is truly safe or merely appears safe in testing. If we knew what reasoning an AI relied on, then that could increase our confidence that an AI would remain safe and reliable beyond the specific situations it was tested on.
Hvordan?
Current methods can make a neural network appear to follow essentially any reasoning if we allow overly complex interpretations. We address this by combining causal abstraction – a method for testing whether a neural network follows a particular reasoning process – with ideas from statistical learning theory that constrain interpretation complexity to provide generalization guarantees.