
Hacker News: Front Page
shared a link post in group #Stream of Goodies
arxiv.org
Defending LLMs against Jailbreaking Attacks via Backtranslation
Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks, which rewrite the original prompt to conceal its harmful int