DeepMind
Investigated the use of supervised methods to detect deceptive behavior in Transformer-based language models in the activation-space.
Investigated the use of supervised methods to detect deceptive behavior in Transformer-based language models in the activation-space.