It’s happening (extreme risks from AI)
Jack Clark’s newsletter (my emphasis):
The things people worry about keep on happening: A few years ago lots of people working in AI safety had abstract concerns that one day sufficiently advanced systems might start to become pathologically sycophantic, or might ‘fake alignment’ to preserve themselves into the future, or might hack their environments to get greater amounts of reward, or might develop persuasive capabilities in excess of humans. All of these once academic concerns have materialized in production systems in the last couple of years.
And:
“As far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” Palisade writes. “While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.”
As with the persuasion example, the story of contemporary AI research is that risks once deemed theoretical - ability to contribute to terrorism, skill at persuasion, faking of alignment, and so on - are showing up in the real systems being deployed into the economy.