Chapter
Self-Play in Diplomacy and Pretraining Language Models
The self-play process in diplomacy involves conditioning the language model on good intents and deviating from human anchor policy if there is an action with high expected value. Additionally, pre-training the language model on internet data aids in approximating human play.
Clips
The language model was pre-trained on numerous internet data and utilized effectively in diplomacy games by training a neural network on 50,000 games to fill in gaps in how communication happens beyond just the games.
1:59:12 - 2:01:57 (02:45)
Summary
The language model was pre-trained on numerous internet data and utilized effectively in diplomacy games by training a neural network on 50,000 games to fill in gaps in how communication happens beyond just the games.
ChapterSelf-Play in Diplomacy and Pretraining Language Models
Episode#344 – Noam Brown: AI vs Humans in Poker and Games of Strategic Negotiation
PodcastLex Fridman Podcast
The concept of "lying" in AI chatbots is different from how humans perceive it.
2:01:58 - 2:04:23 (02:25)
Summary
The concept of "lying" in AI chatbots is different from how humans perceive it. AI bots predict the user's intention based on their messages, and sometimes their predictions can lead to inaccurate assumptions that might be considered lying.
ChapterSelf-Play in Diplomacy and Pretraining Language Models
Episode#344 – Noam Brown: AI vs Humans in Poker and Games of Strategic Negotiation
PodcastLex Fridman Podcast
The limitations of self-play in language models are discussed, specifically in terms of controlling human-like behavior and scaling the model.
2:04:23 - 2:07:30 (03:06)
Summary
The limitations of self-play in language models are discussed, specifically in terms of controlling human-like behavior and scaling the model.
ChapterSelf-Play in Diplomacy and Pretraining Language Models
Episode#344 – Noam Brown: AI vs Humans in Poker and Games of Strategic Negotiation
PodcastLex Fridman Podcast
Researchers use self-play processes to create a better model of human play by allowing deviations from an anchor policy for actions with high expected value.
2:07:30 - 2:09:20 (01:50)
Summary
Researchers use self-play processes to create a better model of human play by allowing deviations from an anchor policy for actions with high expected value. The neural net trained to imitate human data is referred to as the anchor policy.