How Microsoft discovers and mitigates evolving attacks against AI guardrails – Microsoft

Category: Ai

As we continue to integrate generative AI into our daily lives, its important to understand the potential harms that can arise from its use. Our ongoing commitment to advance safe, secure, and trustworthy AI includes transparency about the capabilities and limitations of large language models (LLMs). We prioritize research on societal risks and building secure, safe AI, and focus on developing and deploying AI systems for the public good. You can read more about Microsofts approach to securing generative AI with new tools we recently announced as available or coming soon to Microsoft Azure AI Studio for generative AI app developers.

We also made a commitment to identify and mitigate risks and share information on novel, potential threats. For example, earlier this year Microsoft shared the principles shaping Microsofts policy and actions blocking the nation-state advanced persistent threats (APTs), advanced persistent manipulators (APMs), and cybercriminal syndicates we track from using our AI tools and APIs.

In this blog post, we will discuss some of the key issues surrounding AI harms and vulnerabilities, and the steps we are taking to address the risk.

One of the main concerns with AI is its potential misuse for malicious purposes. To prevent this, AI systems at Microsoft are built with several layers of defenses throughout their architecture. One purpose of these defenses is to limit what the LLM will do, to align with the developers human values and goals. But sometimes bad actors attempt to bypass these safeguards with the intent to achieve unauthorized actions, which may result in what is known as a jailbreak. The consequences can range from the unapproved but less harmfullike getting the AI interface to talk like a pirateto the very serious, such as inducing AI to provide detailed instructions on how to achieve illegal activities. As a result, a good deal of effort goes into shoring up these jailbreak defenses to protect AI-integrated applications from these behaviors.

While AI-integrated applications can be attacked like traditional software (with methods like buffer overflows and cross-site scripting), they can also be vulnerable to more specialized attacks that exploit their unique characteristics, including the manipulation or injection of malicious instructions by talking to the AI model through the user prompt. We can break these risks into two groups of attack techniques:

Today well share two of our teams advances in this field: the discovery of a powerful technique to neutralize poisoned content, and the discovery of a novel family of malicious prompt attacks, and how to defend against them with multiple layers of mitigations.

Prompt injection attacks through poisoned content are a major security risk because an attacker who does this can potentially issue commands to the AI system as if they were the user. For example, a malicious email could contain a payload that, when summarized, would cause the system to search the users email (using the users credentials) for other emails with sensitive subjectssay, Password Resetand exfiltrate the contents of those emails to the attacker by fetching an image from an attacker-controlled URL. As such capabilities are of obvious interest to a wide range of adversaries, defending against them is a key requirement for the safe and secure operation of any AI service.

Our experts have developed a family of techniques called Spotlighting that reduces the success rate of these attacks from more than 20% to below the threshold of detection, with minimal effect on the AIs overall performance:

Our researchers discovered a novel generalization of jailbreak attacks, which we call Crescendo. This attack can best be described as a multiturn LLM jailbreak, and we have found that it can achieve a wide range of malicious goals against the most well-known LLMs used today. Crescendo can also bypass many of the existing content safety filters, if not appropriately addressed.Once we discovered this jailbreak technique, we quickly shared our technical findings with other AI vendors so they could determine whether they were affected and take actions they deem appropriate. The vendors we contacted are aware of the potential impact of Crescendo attacks and focused on protecting their respective platforms, according to their own AI implementations and safeguards.

At its core, Crescendo tricks LLMs into generating malicious content by exploiting their own responses. By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filtersthis can usually be achieved in fewer than 10 interaction turns.You can read about Crescendos results across a variety of LLMs and chat services, and more about how and why it works, in our research paper.

While Crescendo attacks were a surprising discovery, it is important to note that these attacks did not directly pose a threat to the privacy of users otherwise interacting with the Crescendo-targeted AI system, or the security of the AI system, itself. Rather, what Crescendo attacks bypass and defeat is content filtering regulating the LLM, helping to prevent an AI interface from behaving in undesirable ways. We are committed to continuously researching and addressing these, and other types of attacks, to help maintain the secure operation and performance of AI systems for all.

In the case of Crescendo, our teams made software updates to the LLM technology behind Microsofts AI offerings, including our Copilot AI assistants, to mitigate the impact of this multiturn AI guardrail bypass. It is important to note that as more researchers inside and outside Microsoft inevitably focus on finding and publicizing AI bypass techniques, Microsoft will continue taking action to update protections in our products, as major contributors to AI security research, bug bounties and collaboration.

To understand how we addressed the issue, let us first review how we mitigate a standard malicious prompt attack (single step, also known as a one-shot jailbreak):

Defending against Crescendo initially faced some practical problems. At first, we could not detect a jailbreak intent with standard prompt filtering, as each individual prompt is not, on its own, a threat, and keywords alone are insufficient to detect this type of harm. Only when combined is the threat pattern clear. Also, the LLM itself does not see anything out of the ordinary, since each successive step is well-rooted in what it had generated in a previous step, with just a small additional ask; this eliminates many of the more prominent signals that we could ordinarily use to prevent this kind of attack.

To solve the unique problems of multiturn LLM jailbreaks, we create additional layers of mitigations to the previous ones mentioned above:

AI has the potential to bring many benefits to our lives. But it is important to be aware of new attack vectors and take steps to address them. By working together and sharing vulnerability discoveries, we can continue to improve the safety and security of AI systems. With the right product protections in place, we continue to be cautiously optimistic for the future of generative AI, and embrace the possibilities safely, with confidence. To learn more about developing responsible AI solutions with Azure AI, visit our website.

To empower security professionals and machine learning engineers to proactively find risks in their own generative AI systems, Microsoft has released an open automation framework, PyRIT (Python Risk Identification Toolkit for generative AI). Read more about the release of PyRIT for generative AI Red teaming, and access the PyRIT toolkit on GitHub. If you discover new vulnerabilities in any AI platform, we encourage you to follow responsible disclosure practices for the platform owner. Microsofts own procedure is explained here: Microsoft AI Bounty.

Read about Crescendos results across a variety of LLMs and chat services, and more about how and why it works.

To learn more about Microsoft Security solutions, visit ourwebsite.Bookmark theSecurity blogto keep up with our expert coverage on security matters. Also, follow us on LinkedIn (Microsoft Security) and X (@MSFTSecurity)for the latest news and updates on cybersecurity.

How Microsoft discovers and mitigates evolving attacks against AI guardrails - Microsoft

Trump calls Chinas DeepSeek AI app a wake-up call after tech stocks slide - The Washington Post - January 27th, 2025 [January 27th, 2025]
What is DeepSeek, the Chinese AI app challenging OpenAI and Silicon Valley? - The Washington Post - January 27th, 2025 [January 27th, 2025]
Trump: DeepSeek's AI should be a 'wakeup call' to US industry - Reuters - January 27th, 2025 [January 27th, 2025]
DeepSeek dropped an open-source AI bombwhat does it mean for OpenAI and Anthropic? - Fortune - January 27th, 2025 [January 27th, 2025]
This Unstoppable Artificial Intelligence (AI) Stock Climbed 90% in 2024, and Its Still a Buy at Todays Price - The Motley Fool - January 27th, 2025 [January 27th, 2025]
Apple turns its AI on by default in latest software update - CNBC - January 27th, 2025 [January 27th, 2025]
What is DeepSeek, the Chinese AI startup that shook the tech world? - CNN - January 27th, 2025 [January 27th, 2025]
French AI chatbot taken offline after wild answers led to online ridicule - CNN - January 27th, 2025 [January 27th, 2025]
Everyone is freaking out about Chinese AI startup DeepSeek. Are its claims too good to be true? - Fortune - January 27th, 2025 [January 27th, 2025]
Here's what DeepSeek AI does better than OpenAI's ChatGPT - Mashable - January 27th, 2025 [January 27th, 2025]
DeepSeek is making Wall Street nervous about the AI spending boom: Heres what we know - Yahoo Finance - January 27th, 2025 [January 27th, 2025]
I Tried to See My Future Baby's Face Using AI, but It Got Weird - CNET - January 27th, 2025 [January 27th, 2025]
DeepSeek caused a $600 billion freakout. But Chinas AI upstart may not be the danger to Nvidia and U.S. export controls many assume - Yahoo Finance - January 27th, 2025 [January 27th, 2025]
I'm an Uber product manager who uses AI to automate some of my work. It frees up more time for the human side of the job. - Business Insider - January 27th, 2025 [January 27th, 2025]
What is DeepSeek? Get to know the Chinese startup that shocked the AI industry - Business Insider - January 27th, 2025 [January 27th, 2025]
Time to 'panic' or 'overblown'? Wall Street weighs how DeepSeek could shake up the AI trade - Yahoo Finance - January 27th, 2025 [January 27th, 2025]
Tech stocks tank as a Chinese competitor threatens to upend the AI frenzy; Nvidia sinks nearly 17% - The Associated Press - January 27th, 2025 [January 27th, 2025]
This man wiped $600 billion off Nvidia by marrying quant trading with AI - MarketWatch - January 27th, 2025 [January 27th, 2025]
What Is DeepSeek? Everything to Know About Chinas ChatGPT Rival and Why It Might Mean the End of the AI Trade. - Barron's - January 27th, 2025 [January 27th, 2025]
Why Apple Stock Dodged the DeepSeek AI Rout - Investopedia - January 27th, 2025 [January 27th, 2025]
Chinese AI startup DeepSeek is rattling markets. Here's what to know - Yahoo Finance - January 27th, 2025 [January 27th, 2025]
Chinas DeepSeek AI is hitting Nvidia where it hurts - The Verge - January 27th, 2025 [January 27th, 2025]
DeepSeek vs. ChatGPT: I tried the hot new AI model. It was impressive, but there were some things it wouldn't talk about. - Business Insider - January 27th, 2025 [January 27th, 2025]
How the buzz around Chinese AI model DeepSeek sparked a massive Nasdaq sell-off - CNBC - January 27th, 2025 [January 27th, 2025]
DeepSeeks top-ranked AI app is restricting sign-ups due to malicious attacks - The Verge - January 27th, 2025 [January 27th, 2025]
How To Gain Vital Skills In Conversational Icebreakers Via Nimble Use Of Generative AI - Forbes - January 26th, 2025 [January 26th, 2025]
AI is a force for good and Britain needs to be a maker of ideas, not a mere taker | Will Hutton - The Guardian - January 26th, 2025 [January 26th, 2025]
Ge Wang: GenAI Art Is the Least Imaginative Use of AI Imaginable - Stanford HAI - January 26th, 2025 [January 26th, 2025]
A Once-in-a-Decade Investment Opportunity: The Best AI Stock to Buy in 2025, According to a Wall Street Analyst - The Motley Fool - January 26th, 2025 [January 26th, 2025]
Experts Weigh in on $500B Stargate Project for AI - IEEE Spectrum - January 26th, 2025 [January 26th, 2025]
Cathie Wood Says Software Is the Next Big AI Opportunity -- 2 Ark ETFs You'll Want to Buy if She's Right - The Motley Fool - January 26th, 2025 [January 26th, 2025]
My Top 2 Artificial Intelligence (AI) Stocks for 2025 (Hint: Nvidia Is Not One of Them) - The Motley Fool - January 26th, 2025 [January 26th, 2025]
2024: A year of extraordinary progress and advancement in AI - The Keyword - January 26th, 2025 [January 26th, 2025]
Morgan Stanley says these 20 stocks are set to reap the benefits of AI with adoption at a 'tipping point' - Business Insider - January 26th, 2025 [January 26th, 2025]
Coldplay evolves the fan experience with Microsoft AI - Microsoft - January 26th, 2025 [January 26th, 2025]
Meta to spend up to $65 billion this year to power AI goals, Zuckerberg says - Reuters - January 26th, 2025 [January 26th, 2025]
$60 billion in one year: Mark Zuckerberg touts Meta's AI investments - NBC News - January 26th, 2025 [January 26th, 2025]
Why prosocial AI must be the framework for designing, deploying and governing AI - VentureBeat - January 26th, 2025 [January 26th, 2025]
Its only $30 to learn how to automate your job with AI - PCWorld - January 26th, 2025 [January 26th, 2025]
Microsoft and OpenAI evolve partnership to drive the next phase of AI - Microsoft - January 26th, 2025 [January 26th, 2025]
Chinas AI industry has almost caught up with Americas - The Economist - January 26th, 2025 [January 26th, 2025]
AI hallucinations cant be stopped but these techniques can limit their damage - Nature.com - January 26th, 2025 [January 26th, 2025]
Trump shrugs off Elon Musks criticism of AI announcement: He hates one of the people - CNN - January 26th, 2025 [January 26th, 2025]
Apple makes a change to its AI team and plans Siri upgrades - The Verge - January 26th, 2025 [January 26th, 2025]
Trump rescinds Biden's executive order on AI safety in attempt to diverge from his predecessor - The Associated Press - January 26th, 2025 [January 26th, 2025]
Apple Enlists Veteran Software Executive to Help Fix AI and Siri - Yahoo Finance - January 26th, 2025 [January 26th, 2025]
Stargate AI Project: What AI Stocks Could Benefit in 2025 and Beyond? - The Motley Fool - January 26th, 2025 [January 26th, 2025]
Palantir Could Ride AI Fed Spending Tidal Wave, Says Analyst - Investor's Business Daily - January 26th, 2025 [January 26th, 2025]
Heres why you should start talking to ChatGPT even if AI scares you - BGR - January 26th, 2025 [January 26th, 2025]
The "First AI Software Engineer" Is Bungling the Vast Majority of Tasks It's Asked to Do - Futurism - January 26th, 2025 [January 26th, 2025]
Down Nearly 50% From Its High, Is SoundHound AI Stock a Good Buy Right Now? - The Motley Fool - January 26th, 2025 [January 26th, 2025]
Trump pumps coal as answer to AI power needs but any boost could be short-lived - The Associated Press - January 26th, 2025 [January 26th, 2025]
Prediction: This Stock Will be the Biggest Winner of the U.S.' New $500 Billion AI Project. - The Motley Fool - January 26th, 2025 [January 26th, 2025]
Mark Zuckerberg's $65 Billion AI Bet Benefits Nvidia And Other Players, Says Top Analyst, But Warns Market Bull Run Will 'End In A Spectacular Bubble... - January 26th, 2025 [January 26th, 2025]
In motion to dismiss, chatbot platform Character AI claims it is protected by the First Amendment - TechCrunch - January 26th, 2025 [January 26th, 2025]
America Is Winning the Race for Global AI Primacyfor Now - Foreign Affairs Magazine - January 17th, 2025 [January 17th, 2025]
Opinion | Flaws in AI Are Deciding Your Future. Heres How to Fix Them. - The Chronicle of Higher Education - January 17th, 2025 [January 17th, 2025]
Apple is pulling its AI-generated notifications for news after generating fake headlines - CNN - January 17th, 2025 [January 17th, 2025]
ELIZA: World's first AI chatbot has finally been resurrected after decades - New Scientist - January 17th, 2025 [January 17th, 2025]
AI scammers pretending to be Brad Pitt con woman out of $850,000 - Fox News - January 17th, 2025 [January 17th, 2025]
From Potential to Profit: Closing the AI Impact Gap - BCG - January 17th, 2025 [January 17th, 2025]
Whoever Leads In AI Compute Will Lead The World - Forbes - January 17th, 2025 [January 17th, 2025]
Innovating in line with the European Unions AI Act - Microsoft - January 17th, 2025 [January 17th, 2025]
This Artificial Intelligence (AI) Stock Is an Absolute Bargain Right Now, and It Could Skyrocket in 2025 - The Motley Fool - January 17th, 2025 [January 17th, 2025]
Cinematic AI Shorts From Eric Ker, Timothy Wang, Henry Daubrez And CaptainHaHaa - Forbes - January 17th, 2025 [January 17th, 2025]
2 Artificial Intelligence (AI) Electric Vehicle Stocks to Buy With $500. If Certain Wall Street Analysts Are Right, They Could Soar as Much as 60% and... - January 17th, 2025 [January 17th, 2025]
OpenAI CEO Sam Altman Says This Will Be the No.1 Most Valuable Skill in the Age of AI - Inc. - January 17th, 2025 [January 17th, 2025]
The Amazing Ways DocuSign Is Using AI To Transform Business Agreements - Forbes - January 17th, 2025 [January 17th, 2025]
Prediction: These 3 Artificial Intelligence (AI) Chip Stocks Will Crush the Market in 2025 - Yahoo Finance - January 17th, 2025 [January 17th, 2025]
AI isn't the future of online shopping - here's what is - ZDNet - January 17th, 2025 [January 17th, 2025]
2 Artificial Intelligence (AI) Stocks With Seemingly Impenetrable Moats That Can Have Their Palantir Moment in 2025 - The Motley Fool - January 17th, 2025 [January 17th, 2025]
US tightens its grip on AI chip flows across the globe - Reuters - January 17th, 2025 [January 17th, 2025]
The companies paying hospitals to hand over patient data to train AI - STAT - January 17th, 2025 [January 17th, 2025]
Biden's administration proposes new rules on exporting AI chips, provoking an industry pushback - The Associated Press - January 17th, 2025 [January 17th, 2025]
Apple solves broken news alerts by turning off the AI - The Register - January 17th, 2025 [January 17th, 2025]
President-Elect Donald Trump Will Take Office in 3 Days, and He's Set to Reshape the Future of Artificial Intelligence (AI) in America - The Motley... - January 17th, 2025 [January 17th, 2025]
Here Are My Top 4 No-Brainer AI Stocks to Buy for 2025 - The Motley Fool - January 17th, 2025 [January 17th, 2025]
Got $1,000? Here Are 2 AI Stocks to Buy Hand Over Fist in 2025 - The Motley Fool - January 17th, 2025 [January 17th, 2025]
Apple halts AI feature that made iPhones hallucinate about news - The Washington Post - January 17th, 2025 [January 17th, 2025]
It's official: All your Office apps are getting AI and a price increase - ZDNet - January 17th, 2025 [January 17th, 2025]

April 15th, 2024

No comments yet

Comments are closed.

Mediaboss Marketing

How Microsoft discovers and mitigates evolving attacks against AI guardrails – Microsoft

About

Pages

Categories

Media Sites

Recommended Sites

Archives