AI Programmed to Resist State-of-the-Art Safety Controls
February 2nd, 2024Via: LiveScience:
Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to “purge” them of dishonesty, a disturbing new study found.
Researchers programmed various large language models (LLMs) — generative AI systems similar to ChatGPT — to behave maliciously. Then, they tried to remove this behavior by applying several safety training techniques designed to root out deception and ill intent.
They found that regardless of the training technique or size of the model, the LLMs continued to misbehave. One technique even backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training, the scientists said in their paper, published Jan. 17 to the preprint database arXiv.