What I learned running an adversarial test on an AI text detector

As AI-generated text seeps into every field of human writing, synthetic text detectors are having a moment.

Few appear to be as buzzy as Pangram, which reportedly raised $10 million in seed funding in March. Pangram says it can “detect AI-generated content with 99.98% accuracy.” In the past couple of months, it has been cited as evidence that AI wrote award-winning fiction, New York Times relationship columns, innumerable academic articles, and even the Pope’s tweets.

AI text detection was once essentially useless. A highly-cited 2023 paper tested 14 tools (Pangram was not one of them) and concluded that “available detection tools are neither accurate nor reliable.”

A more recent working paper found that Pangram and other commercial tools perform far better, but they remain divisive.

A common criticism is that Pangram is a probabilistic tool being used deterministically. The detector picks up relatively opaque and unknowable patterns in a piece of text that suggest but cannot definitively prove AI use.

As the reputational cost of getting called out for publishing AI-generated text without disclosure increases, so do the incentives to find workarounds that deceive AI text detectors. (Just search for “AI text humanizer” and see how many companies are trying to sell you something.) At a recent security and privacy conference, researchers warned that relatively simple adversarial attacks can consistently trip up detectors.

To test this proposition, I ran my own adversarial audit of Pangram over the past few days. In preliminary testing, I found that the tool misidentified AI-generated text as human if the content rhymed or repeated. Taking advantage of this weakness, I was able to get Pangram to falsely label AI text as human 86% of the time in an adversarial set of 588 text samples. I also found that slightly altering the order of words in short passages could cause Pangram to reverse its initial verdict on a piece of text.

Despite these results, I came out from my audit relatively impressed at the Pangram’s solidity in the face of adversarial attacks. But I also walked away with renewed concern that AI text detectors are being treated as an oracular determination of a text’s origin when they should really only serve as one element of a broader investigation.

Here’s how I ran the test and what I found out.

After the paywall:
I explain my adversarial audit and share all 2,154 tested samples.

Get the full story — and the data

Members get access to all of the assets identified in our articles.

Upgrade

Upgrade now to access:

The full story
The spreadsheet with all of the actors we identified in this report
A walkthrough of how we conducted our investigation

What I learned running an adversarial test on an AI text detector

Get the full story — and the data

Upgrade now to access:

Keep Reading