AI text detector false positives: what the score proves

A detector score is a perplexity reading taken at an operating point you are usually not shown. It tells you the writing is statistically plain and predictable, not that a machine produced it. The only score that reads a real injected signal, rather than guessing from style, is a watermark score, and that exists only for text a cooperating generator marked. Everything else is a probability dressed up as a verdict.

What the number actually measures

Strip away the interface and a post-hoc detector reports one thing: how surprising your words are to a language model. Low perplexity and low burstiness read as “AI.” The trouble is that plain, clean, formulaic human writing is low-perplexity too, so the same score that flags a machine flags a whole class of people. Liang et al. (Patterns 2023) prove the point cleanly. Across seven detectors run on 91 human-written TOEFL essays, the essays every detector unanimously misflagged had significantly lower perplexity than the rest (P = 9.74e-05). Then they moved the score on demand: prompting ChatGPT to “Enhance the word choices to sound more like that of a native speaker” dropped the average false-positive rate from 61.22% to 11.77%, while “Simplify word choices as if written by a non-native speaker,” applied to native essays, raised their misclassification from 5.19% to 56.65%. The score followed the perplexity in both directions. It was never reading authorship.

”99% accuracy” is quoted at a hidden operating point

A bare accuracy figure is uninterpretable without the false-positive rate it was read at. RAID (Dugan et al., ACL 2024) makes the dependence explicit: it selects a threshold so the false-positive rate is held at 5%, and describes itself as “one of the first shared resources to fix and disclose FPR.” Read the same open-source detectors at a naive default threshold instead and their false-positive rates become, in RAID’s words, “dangerously high,” reaching 99.3% for GLTR and 96.0% for LLMDet. The detectors do reach the high accuracies cited in viral reports, but “only at similarly high FPR.” A 99% number with no disclosed false-positive rate behind it is not a measurement, it is a figure quoted at an operating point chosen to flatter it. That is why a score without the tool version, the threshold, and the false-positive rate is not evidence you can weigh.

The same score, multiplied by scale

Two pieces of arithmetic show why even a low rate matters, and both are illustrative rather than measured corpus figures. First, base rates: a 1% per-document false-positive rate is one wrongly flagged person in a hundred, which is roughly 10 across a 1,000-essay course and 500 across a 50,000-essay term. Second, granularity: a tool that flags at the sentence level compounds over a document, so a 1% per-sentence rate works out to about an 18% chance of a false flag somewhere in a 20-sentence essay. The headline percentage and the lived risk are not the same number, and the gap always runs against the writer.

What the measured false-accusation rates look like

On real human writing the numbers are not reassuring. Weber-Wulff et al. (International Journal for Educational Integrity 2023) measured a false-accusation rate of 2.4% on original human documents, rising to 11.1% once that human writing was machine-translated, with GPT Zero at 50.0%. Perkins et al. (2024) found a mean false-accusation ratio of 15% across their tools and human controls, with Copyleaks at 50%. These are not national-scale samples and should not be inflated into universal rates, but they make the key point: real human writing is repeatedly labelled as AI, and the rate is high enough that any serious process must assume false accusations will occur.

The strongest counterexample, and what it actually shows

Pangram is the strongest rebuttal on the false-positive side, and it deserves full strength. Emi and Spero (Pangram Labs 2024) report a 0.19% false-positive rate overall, 0% on the exact 91 TOEFL essays Liang used, and 0.09% on 5,600 ICNALE ESL essays. Taken at face value, that shows a carefully trained classifier can drive a known bias down. It does not show that a detector score generally proves authorship. Pangram’s own report attributes the 0% TOEFL result to “the composition of our training set,” which means the bias was trained away by including ESL data, not that the underlying signal became authorship-aware. And it is a single point on a curve, measured in distribution on the vendor’s own benchmark. A score is only as trustworthy as the false-positive rate at the operating point actually used, and most deployments never disclose theirs.

The one score with real footing

Contrast all of this with a watermark score, the exception that proves the rule. A green-list watermark reads an injected key rather than guessing at style: it is detectable “from short spans of tokens (as few as 25 tokens)” (Kirchenbauer et al., ICML 2023), and survives strong human paraphrase, still detectable after about 800 tokens at a 1e-5 false-positive rate (Kirchenbauer et al., ICLR 2024). That is a score reading something real, a signal the generator actually inserted. But it exists only for cooperatively generated text, the case where a model marked its own output. For a person accused of using AI on their own writing, no such signal was ever present, and the only available score is the perplexity guess.

What the score proves, and what it does not

So what does a detector score prove? That the text is statistically unsurprising to a model, and nothing more. It does not prove a machine wrote it, because the same low-perplexity reading is produced by clean, plain, honest human prose, and Weber-Wulff et al. put the consequence bluntly: a detector report is “a simple claim without verifiable evidence.” For why the same mechanism singles out particular writers, see why AI text detectors falsely accuse real writers; for what to do when a score has been used against you, see falsely accused of using AI? what to do. The narrower lesson stands on its own: a score can start a question, but it cannot answer it. Without the operating point and corroboration from outside the detector, a perplexity reading is not an author.

Sources

Liang, Yuksekgonul, Mao et al. (2023). GPT detectors are biased against non-native English writers. Patterns 4(7):100779.
Dugan, Hwang, Trhlik et al. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. ACL 2024.
Weber-Wulff, Anohina-Naumeca, Bjelobaba et al. (2023). Testing of Detection Tools for AI-Generated Text. International Journal for Educational Integrity 19:26.
Perkins, Roe, Vu et al. (2024). GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education.
Emi, Spero (2024). Technical Report on the Pangram AI-Generated Text Classifier. Pangram Labs.
Kirchenbauer, Geiping, Wen et al. (2023). A Watermark for Large Language Models. ICML 2023.
Kirchenbauer, Geiping, Wen et al. (2023). On the Reliability of Watermarks for Large Language Models. ICLR 2024.