How good are humans at detecting AI-generated images? Learnings from an experiment
Thomas Roca, Anthony Cintron Roman, Jehú Torres Vega, Marcelo Duarte, Pengce Wang, Kevin White, Amit Misra, Juan Lavista Ferre
A large-scale study involving 12,500 participants conducted by researchers at Microsoft AI found that humans are on average depressingly bad at spotting AI-generated images with a 62% success rate that is not much better than a random guess. The experiment was deployed through the Real Or Not? quiz which has users rate 15 images randomly selected from a sample of 350 ‘real’ images and 700 diffusion-based images and GAN4-based fake faces.

Security Benefits and Side Effects of Labeling AI-Generated Images
Sandra Höltervennhoff, Jonas Ricker, Maike M. Raphael, Charlotte Schwedes, Rebecca Weil, Asja Fischer, Thorsten Holz, Lea Schönherr, Sascha Fahl
Really appreciate this practical preprint on AI labels from a group of researchers at the CISPA Helmholtz Center for Information Security and the universities of Leibniz and Ruhr. In a survey of more than 1,300 respondents across the EU and US, more than 75% said they want to see labels on AI-generated content in their social media feeds. But when asked to rate posts with the labels, there was little improvement in discernment of accuracy. Instead, the “AI-generated” labels slightly increased the likelihood that people believed the content was false — including when it was true. The labels also slightly increased the perception of accuracy of content labeled as human-generated. This suggests more nuanced labels are better (duh) and that “AI” is more likely than not to be considered a proxy for “false.”

Characterizing AI-Generated Misinformation on Social Media
Chiara Drolsbach, Nicolas Pröllochs
In this preprint, Chiara Drolsbach and Nicolas Pröllochs looked at the prevalence and characteristics of AI-generated misinformation in two years-worth of Community Notes on X. They found that 5% of the approximately 90,000 misleading tweets in the study period contained AI-generated content. These tweets were more likely than the non-AI ones to be viral and to discuss less serious topics like entertainment.

High-quality deepfakes have a heart!
Clemens Seibold, Eric L. Wisotzky, Arian Beckmann, Benjamin Kossack, Anna Hilsmann, Peter Eisert
This paper on Frontiers in Imaging concludes that the previous belief that a deepfake video could be detected by monitoring for absent heartbeats is “no longer valid for current deepfake methods.” One of the paper’s co-authors told BBC Science Focus that “deepfakes will get so good that they’ll be hard to detect unless we focus more on technology that proves something hasn’t been altered, rather than detecting if something is fake.”
SoK: Systematization and Benchmarking of Deepfake Detectors in a Unified Framework
Binh M. Le, Jiwon Kim, Simon S. Woo, Kristen Moore, Alsharif Abuadbba, Shahroz Tariq
A preprint by researchers based in Australia and South Korea found that deepfake detection is still imperfect. While the sixteen detectors tested performed passably on known datasets of facial deepfakes, the best of them could only score 69% on real world data.
This echoes a seminal 2024 literature review (see FU#6) that found that minimal edits such as cropping and resizing short-circuited many of the detection techniques in the literature. My experience with commercial detectors has also been that they promise precision rates that they fail to achieve.
If all this wasn't enough, apparently generative AI is getting really good at stripping watermarks off images anyway!

"Better Be Computer or I’m Dumb": A Large-Scale Evaluation of Humans as Audio Deepfake Detectors
Kevin Warren, Tyler Tucker, Anna Crowder, Daniel Olszewski, Allison Lu, Caroline Fedele, Magdalena Pasternak, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor
This study had 1,200 users listen to twenty audio clips and determine whether they thought they were synthetic, how confident they were, and what they based their decision on. While a few individuals did correctly identify 100% of the deepfakes they were exposed to, the average response was less impressive. Mean accuracy was as low as 65% on audio samples from the Wavefake dataset, 71% on ASVspoof2021 and 81% on FakeAVCeleb.

People are poorly equipped to detect AI-powered voice clones
Sarah Barrington, Emily A. Cooper, Hany Farid
Humans don’t appear all that well equipped to detect faked voices.
UC Berkeley researchers used ElevenLabs's Instant Voice Cloning API to clone 220 speakers. They then had survey respondents discern whether two clips were from the same person and whether any of them were deepfaked. In almost 80% of the cases, a real voice and its audio clone were deemed to be from the same speaker (real clips of the same voice were correctly attributed with a slightly higher precision). Slicing the data another way, Barrington and Farid found that respondents correctly flagged an audio as synthetic ~60% of the time. That is not much better than flipping a coin.

People are skeptical of headlines labeled as AI-generated, even if true or human-made, because they assume full AI automation
Sacha Altay and Fabrizio Gilardi
In a study on PNAS Nexus, two political scientists at the University of Zurich concluded that “labeling headlines as AI-generated reduced the perceived accuracy of the headlines and participants’ intention to share them, regardless of the headlines’ veracity (true vs. false) or origin (human- vs. AI-generated).” Still, the effect was relatively small: a 2.66 percentage point decrease for a “generated by AI” headline compared to a 9.33 percentage point decrease for content labeled as “false.”
AI-Generated Faces in the Real World: A Large-Scale Case Study of Twitter Profile Images
Jonas Ricker, Dennis Assenmacher, Thorsten Holz, Asja Fischer and Erwin Quiring
Holz, Quiring, et al. tried to quantify the reach of AI accounts on Twitter by selecting a random 1 percent of all public posts in one week in March 2023 and seeing how many associated accounts used deepfaked profile pictures. Their estimate is 7,723 accounts, or 0.05% of the total.
While “fake-image accounts” did not post more than real-image accounts, they were on average far newer and were primarily focused on large-scale spamming attacks. In that limited time frame of study, most content published was in English, Turkish and Arabic and the principal topics were politics and finance.

A real-world test of artificial intelligence infiltration of a university examinations system: A “Turing Test” case study
Peter Scarfe, Kelly Watcham, Alasdair Clarke, and Etienne Roesch
Psychology researchers at the universities of Reading and Essex tested the proposition that generative AI can be reliably used to cheat in university exams. The answer is a resounding Yes.
The researchers used GPT-4 to produce 63 submissions for the at-home exams of five different classes in Reading’s undergraduate psychology department. The researchers did not touch the AI output other than to remove reference sections and re-generate responses if they were identical to another submission.
Only four of the 63 exams got flagged as suspicious during grading, and only half of those were explicitly called out as possibly AI-generated. The kicker is that the average AI submission also got a better grade than the average human.
The study concludes that “from a perspective of academic integrity, 100% AI written exam submissions being virtually undetectable is extremely concerning.”

Delving into ChatGPT usage in academic writing through excess vocabulary
Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause
A group of machine learning researchers claim in a preprint that as many as 10% of the abstracts published on PubMed in 2024 were “processed with LLMs” based on the excess usage of certain words like “delves.” Seems significant, important — even crucial!

Delving into ChatGPT usage in academic writing through excess vocabulary
Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause
A group of machine learning researchers claim in a preprint that as many as 10% of the abstracts published on PubMed in 2024 were “processed with LLMs” based on the excess usage of certain words like “delves.” Seems significant, important — even crucial!
Detecting hallucinations in large language models using semantic entropy
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal
On Nature, computer scientists at the University of Oxford presented the results of their effort to detect LLM hallucinations with LLMs (see also WaPo write-up). In a skeptical riposte, RMIT computer scientist Karin Verspoor warns that this approach could backfire “by layering multiple systems that are prone to hallucinations and unpredictable errors.”

If I understand the figure below correctly, the accuracy of this method, billed “semantic entropy,” is still only about 80%.

Careless Whisper: Speech-to-Text Hallucination Harms
Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, Mona Sloane
In a great paper presented at FAccT, Alison Koenecke and colleagues tested Whisper, OpenAI’s transcription service, on 13,140 audio snippets. They found that in 187 cases (~1.4%), Whisper consistently transcribed things that the speakers never said.
More worryingly, one third of these hallucinations were not innocuous substitutions of homophones but truly wild additions that could have a material consequence if taken at face value. See for yourself:

Also concerning was the fact that Whisper performed markedly worse on speakers with aphasia, a language disorder, than with those in the control group.

People cannot distinguish GPT-4 from a human in a Turing test
Cameron R. Jones and Benjamin K. Bergen
In this preprint by two cognitive scientists at UC San Diego, 500 participants spent 5 minutes texting with a human or one of three AI interfaces through an interface that concealed who was on the other side. 54% of the respondents assigned to GPT-4 thought they were chatting with a human, not much lower than the share respondents who rated the actual human as a human (67%).

Synthetic Image Verification in the Era of Generative Artificial Intelligence: What Works and What Isn’t There yet
Diangarti Tariang, Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva
This paper claims post-processing typical of image sharing — such as cropping, resizing and compression — can have a strong impact on detector accuracy. Compare the “without PP” results and the “with PP” results to get a sense of how significant this impact can be.

More hopefully, the paper finds that low-level forensic artifacts can still be used as artificial fingerprints of a particular model. Look for instance below, from left to right, at the spectral analysis of images generated by Latent Diffusion, Stable Diffusion, Midjourney v5, DALL·E Mini, DALL·E 2 and DALL·E 3.

Co-Writing with Opinionated Language Models Affects Users’ Views
Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, Mor Naaman
In this worrying study on the possible persuasiveness of AI assistants, participants were asked to answer "Is social media good for society?" Those with a virtual assistant primed to be pro-social media were 2x more likely to answer affirmatively than control group. The exercise appears to have affected reported opinions in the same direction, too.

Login or upgrade your account to become a member to access content below.