
Hacker News: Front Page
shared a link post in group #Stream of Goodies
arxiv.org
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity.