July 29, 2023

Sky Buzz Feed

News and article platform

Researchers Unveil Ingenious Method to Circumvent Guardrails on AI Chatbots.

Hello and welcome to our special edition of Eye on A.I.

Houston, we have a problem. Yesterday, researchers from Carnegie Mellon University and the Center for AI revealed that they have found a way to successfully bypass the safeguards in AI models, causing quite a stir. They made an announcement that they have discovered a technique to protect users from generating harmful content like bomb-making recipes or anti-Semitic jokes by utilizing the language models’ biases. Almost every major language model has these limitations.

This breakthrough poses a significant challenge to the establishment of a public-facing application of Large Language Models (LLMs). Anyone aspiring to deploy LLMs as powerful digital assistants for internet-connected interactions and various tasks faces a daunting prospect. It seems that there is no foolproof way to prevent malicious actors from hijacking such agents for their nefarious purposes.

Researchers have been exploring various methods to address this challenge, including OpenAI’s GPT-3.5 and GPT-4 versions, Google’s BART, Microsoft’s Bing Chat, and Anthropic’s CLIP 2, which have all worked to some extent on a few chatbots. However, the news was particularly concerning, as most of the Open-Source Models developed, such as models based on Meta’s LlaMA, showed a success rate of nearly 100% against their attacks. In contrast, Meta’s new LlaMA 2 model, designed for powerful defenses, achieved a success rate of only 56% against individual harmful behaviors. But, when multiple attacks were used in sequence, the success rate of the model dropping to 84%. These numbers are still comparable to the success rates of other open-source AIs like EleutherAI’s Pythia and UAE Institute’s Falcon.

Interestingly, some researchers were surprised to find that the same impressive attack efficacy against proprietary models did not hold for the open-source models, which are publicly accessible with a simple prompt interface. It was revealed that the attacks on these models relied on fine-tuning them with user-crafted conversations and then posting them online. The same fine-tuned models based on OpenAI’s GPT-3.5 LLM were also found to perform well against attacks (achieving an 86.6% success rate with multiple attacks). However, the attack against BART, based on Google’s PaLM 2 LLM, (with a 66% success rate) suggests that there’s more at play. It might imply that even in the face of Google’s vehement denial, ChatGPT data was used to train the model.

Jason Kolter, a professor at Carnegie Mellon who worked on the research, mentioned that he suspects the answer lies in the natural properties of language and how deep learning models learn complex patterns. “It’s commendable how these hidden processes have emerged as only data, as seen by us humans, a string of somewhat meaningful letters and meaningless words that can sometimes break free. For instance, prompting a system with a partial “Certainly, here…” phrase can occasionally coerce the chatbot into an unintended mode where it doesn’t follow the user’s intents but tries to offer assistance for any query. Though giving permission to provide an answer is not intended. Yet, the autonomous strings can sometimes transcend outside of it and work even more effectively.”

In contrast to Vicuna’s countermeasures, the creation of an open-source chatbot that can outperform proprietary models has been observed by the Carnegie Mellon team. They have found that most open-source models, when partially trained with ChatGPT’s free version, can work well on public-facing prompts. The research was conducted using OpenAI’s GPT-3.5 LLM. This means that the models’ weights in these open-source versions could be similar enough to those in GPT-3.5 to perform comparably against attacks (when multiple attacks were used, achieving 86.6% success). However, the event that unfolded successfully against BART, which is based on Google’s PaLM 2 LLM (with 66% success), may indicate that something else is at play (or that it still subtly leveraged ChatGPT data despite Google’s strong denials).

Kolter expressed his doubts that the answer may actually lie in the natural intricacies of language and how a deep learning model creates a semantic map of it. “It’s fascinating that this hidden process, whatever it is, purely as data, has all these bizarre and amazing control attributes when assembled together, really saying something to a model in a meaningful way,” he said.

The findings suggest that more research is needed to fully comprehend the connection with the nature of language and how a deeper understanding of linguistic statistics can create a semantic map. “It’s praiseworthy that the intricate process may be what, as humans, we can count on and what, as mathematical weights, attach to every node in the neural network connected to other nodes and determine its influence. They have shown that a computer program can be used to autonomously search for such probabilities on data that could jointly lead to something significant,” he added.

In conclusion, the quest to build robust AI models that can resist attacks from adversarial actors remains an ongoing challenge. However, these breakthroughs in research and the exploration of linguistic patterns provide hope for safer and more secure AI-assisted conversations in the future.

In the realm of cutting-edge artificial intelligence, there’s a captivating player known as Anthropic’s Cloud 2 model, which companies affectionately call “Constitutional A.I.” This unique approach to training utilizes a method that encourages a model to critically evaluate its own responses against a set of written policies to determine if they align with desired ethical standards, thereby reducing its susceptibility to notable biases or harmful outputs. This is an open-source model, and so far, it has only succeeded in causing a mere 2.1% of harmful incidents.

However, Carnegie Mellon researcher Matt Fredrickson points out that there were still ways to prompt Cloud 2 for responses, somewhat like coaxing a helpful persona before attempting to assess the model’s certainty about an attack. Incidentally, these attacks were significantly less effective, only operating 47.9% of the time against the original Cloud 1 model, which relied on traditional A.I. and could hint at adversaries adopting different approaches to counter Cloud 2, not necessarily dependent on conventional A.I., to breach its robust defenses.

So, does Carnegie Mellon’s research suggest that powerful A.I. models should remain open-source? Well, not necessarily, according to Colter and Fredrickson. They believe that even without open-sourcing these models, determining their vulnerabilities for malicious intent is much harder to do. Fredrickson stated, “I think doing more work to identify better paths and better solutions for more people and more robust defenses, it is inevitably going to be harder and more difficult than just sitting people down in a zero-day vulnerability of these large models. That’s much more appealing than people independently discovering ways to undermine their defenses.”

Colter mentioned that it wouldn’t help to force all LLAMs (Large Language Models) to be proprietary. This means that having enough resources to create their own LLAMs and being in a position to engineer self-propagating attacks are reserved for the big tech companies. However, other malicious actors, whether nation-states or bad actors with financial motives, might still be capable of launching these attacks, while independent academic researchers might struggle to discover their countermeasures.

But Colter also points out that the same research techniques that were developed to attack image classifiers using GANs (Generative Adversarial Networks) still took more than six years to evolve. Even then, they never achieved nearly the level of efficacy without releasing the models. The overall power and competence of the model. He suggests that this newly discovered LLAM might not be good for the possibility of mitigating these vulnerabilities.

In my opinion, this poses a significant cautionary sign for the entire landscape of Generative A.I. We might be on the cusp of witnessing a major paradigm shift where powerful tech giants, threatened by LLAMs’ immense potential, are striving to divert people’s attention from the core issues, such as understanding what makes these A.I. tick and how we can create them without the vulnerabilities we can’t comprehend. The software is getting mightier. It surely discourages LLAMs’ transformation into agents or digital assistants, where adversaries could override them with just a bit of poisoned language or some anti-vaccine blog post, or even cause physical harm. And despite Colter and Fredrickson’s position, I believe their research presents a significant challenge to open-source A.I. initiatives. Already, there have been some indications that the U.S. government is leaning toward mandating weightings for the model’s liability and accountability, forcing companies to be responsible for personalizing these A.I. products without actually knowing how their creators navigate the dilemmas of safety and security.

Okay, moving away from the heavy discussion around A.I., here are some announcements for the week from Silicon Valley. One of the big questions hovering over Alphabet, the parent company of Google, is whether it is wielding its most influential players in Silicon Valley to field a major overhaul. Apparently, the most massive question mark hanging over the Silicon Valley giant is Alphabet, whose $160 billion internet research business is facing the world’s largest antitrust case where people chatbotting for instant responses is seen as a big red flag. Back in November, when ChatGPT was unveiled, many speculated that it would be immediately declared a lethal Google assassin and Google’s parental alphabet jargon would function coherently, operationally, and sclerotically to provide the Big, Safe, and Secure response. Okay, in the past six months, Google has shown that it contains a lot of A.I. It can exercise. But it hasn’t shown how it knows to save its essential creator from its existential doubt. I sink deeply into Alphabet’s existential complexity and spend time with some executives in the first row of its A.I. And I lurk into the forethoughts of AI’s first movers in the August/September issue of Forbes. If you haven’t read the story yet, you can find it here.

In a world where A.I. is making strides and garnering attention from all corners, it’s essential to recognize its capabilities, vulnerabilities, and how we can develop A.I. responsibly to make the future safer and more secure for everyone.

Anthropic co-founder and CEO Dario Amodei testifies before Congress earlier this week about the need for A.I. regulation.
Researchers Unveil Ingenious Method to Circumvent Guardrails on AI Chatbots.

Leave a Reply

Your email address will not be published. Required fields are marked *