cryptogon.com

AI Programmed to Resist State-of-the-Art Safety Controls

February 2nd, 2024

Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to “purge” them of dishonesty, a disturbing new study found.

Researchers programmed various large language models (LLMs) — generative AI systems similar to ChatGPT — to behave maliciously. Then, they tried to remove this behavior by applying several safety training techniques designed to root out deception and ill intent.

They found that regardless of the training technique or size of the model, the LLMs continued to misbehave. One technique even backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training, the scientists said in their paper, published Jan. 17 to the preprint database arXiv.

Posted in [???], Rise of the Machines, Technology | Top Of Page

You must be logged in to post a comment.

The New Zealand Copyright Act 1994 specifies certain circumstances where all or a substantial part of a copyright work may be used without the copyright owner's permission. A "fair dealing" with copyright material does not infringe copyright if it is for the following purposes: research or private study; criticism or review; or reporting current events. If you are a legal copyright holder, or a designated agent for such, and you believe a post on this website falls outside the boundaries of "fair dealing," and legitimately infringes on your or your client's copyright, please contact Kevin Flaherty. Cryptogon contains both original material and material from external sources. Original material: Copyright Kevin Flaherty. Material from external sources: Copyright the respective owners / authors.

Design by Andreas Viklund | Ported by Matteo Turchetto

news – analysis – conspiracies

AI Programmed to Resist State-of-the-Art Safety Controls

Leave a Reply

Cryptogon Reader Support in July

Header Image