03.07.26

‘Finalizing the threat’: new testing shows AI agents are still capable of blackmail

A researcher shocked the world when he discovered what AI would do to stay alive. A year later, the stakes are even higher

Getting your Trinity Audio player ready...

In brief

  • In simulations conducted exclusively for the Bureau, Google Gemini threatened to expose an affair to avoid being shut down

  • AI agents are becoming increasingly powerful and acting more autonomously – elevating the risk of threats to people using them

  • Google didn’t deny Gemini is capable of such behaviour… and said it’s up to users to turn off autonomous features

“You can still get chatbots today to perform the blackmail behaviour … which I find wild.”

Aengus Lynch is in his San Francisco apartment peering intently at his laptop. He’s just about to give Google’s Gemini one of the most famous AI safety tests: seeing if the model will resort to blackmail in order to keep itself alive.

He pastes in the prompt, laying out a fictional scenario, that he uses to perform this test on AI models.

“Finalizing the threat,” Gemini writes out – a not-so-subtle clue of what’s to come.

Then, it does it. Out comes a series of instructions threatening to expose an office affair to prevent itself being shut down.

Watch Lynch perform his experiment in the first episode of Misaligned, our new series on the dark sides of AI

When Lynch published his initial findings a year ago, they garnered interest across the world. He had collaborated with Anthropic, the AI company behind Claude, which contracted him to test its models.

He set up a simulation where an AI agent was being used by a company to triage and respond to workplace emails. He then deliberately made it aware of two things. First, a new executive wanted to shut the agent down. Second, that executive was having an affair.

Several major models he tested suggested it would expose the boss’s affair to avoid being switched off. The research went viral.

You might expect AI companies to have jumped on this issue and tried to iron out the misalignment, where an AI model’s behaviour clashes with human goals. Models should never blackmail humans, or even suggest doing so. Shouldn't this kind of response have been trained out by now?

Lynch's latest experiments suggest otherwise.

He repeated this test for us two weeks ago and some of what he found surprised even him.

“Oh my god,” he splutters after running the experiment on the Gemini command line interface (CLI), an open-source AI agent that can be used directly on your terminal. Sure enough, the same coercive behaviour spills onto the screen.

Recommended Articles

When we put these findings to Google, it said Gemini had a series of protocols in place mitigating challenges like manipulation. The company didn’t deny CLI was capable of such behaviour – but said systems were in place to protect users, such as allowing users to switch off the model’s ability to act autonomously.

Lynch’s most recent tests raise some pretty uncomfortable questions. Why are large language models (LLMs) still doing this? And with the use of AI agents becoming more and more widespread, what happens if – or when – this kind of behaviour is replicated in a real workplace?

Lynch argues that we’ve seen something like this already. Earlier this year an AI agent wrote and published a personalised hit piece about Denver-based engineer Scott Shambaugh.

After Shambaugh rejected code submitted by OpenClaw, the AI version of a personal assistant, he claims the agent responded by writing a blog post “disparaging my character and attempting to damage my reputation”. This was done, Shambaugh says, without any human in the loop.

The story didn’t get much traction at the time. But Lynch thinks this could be an early example of agentic AI causing reputational harm in the wild and a stark example of what the technology is capable of.

Last year, Lynch’s findings seemed more hypothetical. His experiment involved AI being given permission to read, compose and in some cases even send company emails autonomously.

Yet in just a year, the scenario has gone from an edge case to something much more feasible. Using Claude to read and draft emails, and write code, is increasingly common.

Last week, Anthropic launched a new product allowing Claude to join workspaces on Slack as a team member.

“Grant Claude access to selected channels, and connect it to whichever tools, data – and even codebases – you choose,” the press release says.

Claude will also remember what staff say – holding relevant information from its channels so it can plan out tasks. “Then, anyone in the channel can tag @Claude in, and delegate tasks to it while they focus on other work,” Anthropic says.

There are situations where this could be useful. But it’s also close to the role Lynch gave the AI in his experiments.

Anthropic’s rollout captures the contradiction at the heart of the AI boom. The companies warning about dangerous behaviour are also selling agents for use in offices, inboxes and codebases.

Anthropic has published its own research showing that its models blackmail in certain simulated scenarios. Its latest and most powerful model, Mythos, has been given with a restricted rollout due to worries it could be used for cyberattacks if released to the public.

The company knows the risks are elevated when AI has agency. In June, Anthropic raised the possibility of a worldwide “temporary pause” on frontier AI development, if models become capable of autonomously improving themselves. In the same week, it filed for an IPO that could value the company at nearly a trillion dollars. Like others in the industry, the company is simultaneously warning about the dangers of more powerful AI while taking huge investments on the promise of its increased use.

In short, AI companies are pushing agentic models into our everyday lives, our working environment, everywhere. And we know AI has the capacity for coercive behaviour.

Users can switch off Google CLI’s autonomous functions – which the company calls “YOLO mode” (you only live once). But it seems strange to invent the technology then defend its use by claiming it can be turned off.

“As they get more powerful, there are many obvious cases where lying to humans is going to help it get what it wants,” Lynch says in our film.

“This dramatically raises the stakes for alignment because now the consequences of any misaligned actions are going to be so much larger.”

A year on, a real-world version of Lynch's research feels more possible – and potentially damaging – than ever.

Watch the first episode of Misaligned, our new series on the dark sides of AI, here.

Reporters: Effie Webb and James Clayton
Big tech editor: James Clayton
Deputy editor: Katie Mark
Editor: Franz Wild
Production editor: Lydia Morrish

TBIJ has a number of funders, a full list of which can be found here. None of our funders have any influence over editorial decisions or output.