detectai.media

Are AI text detectors reliable?

The reliability ruling for AI text detectors: dependable where you own the watermark key or generation log, not reliable enough to carry any consequential decision.

By The DetectAI team
7 min read
Contents

No, not for any decision that carries a consequence. The only text detection with a genuinely low false-positive rate is the injected watermark or provider-side retrieval, and both cover only text the generator or provider already controls, not the third-party accusation. For arbitrary writing, detection collapses under light editing, on models it was not trained on, and on whole classes of real human authors.

The honest case for detection, at full strength

The case has to be met at its best, not at the strawman of a cheap web tool. Binoculars (Hans et al., ICML 2024) detects over 90% of ChatGPT text at a 0.01% false-positive rate without ever being trained on ChatGPT output. DetectGPT (Mitchell et al., ICML 2023) lifts in-domain detection of GPT-NeoX from 0.81 to 0.95 AUROC by reading the curvature of a model’s probability surface. The Pangram classifier (Emi and Spero, Pangram Labs 2024) reports a 0.19% false-positive rate and 99% accuracy on its own benchmark. These are real numbers from serious work.

The qualification matters as much as the headline. Each result holds in distribution: on long, unedited text, from a model the detector was tuned against. Pangram’s figure in particular is a vendor technical report measured on the vendor’s own data, not an independent peer-reviewed test.

Each result holds only in distribution

Remove one of those conditions and the accuracy falls away. Paraphrase is the cheapest attack: the DIPPER paraphraser drops DetectGPT from 70.3% to 4.6% detection at a fixed 1% false-positive rate (Krishna et al., NeurIPS 2023), and a weaker recursive paraphraser cuts zero-shot AUROC from 96.5% to 25.2% (Sadasivan et al. 2023).

The benchmarks built for real conditions agree at scale. RAID (Dugan et al., ACL 2024), spanning over 6 million generations across 11 models, 8 domains and 11 adversarial attacks, finds that detectors advertising “99% or more” accuracy are “easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models.” A repetition penalty alone decreases accuracy by up to 32 points, changing the generator or decoding strategy is “enough to introduce up to 95+% error rate,” and cross-model accuracy “rarely achieves beyond 60%.” DetectRL (Wu et al., NeurIPS 2024) adds that adversarial perturbation reduces all zero-shot detectors by an average of 39.28% AUROC. The most comprehensive independent test of the period, Weber-Wulff et al. (International Journal for Educational Integrity 2023), found the best tool, Turnitin, reached only 76%, 79%, and 81% across three scoring methods, and concluded the tools are “neither accurate nor reliable.”

The sharpest single fact comes from a generator’s own maker: OpenAI withdrew its AI Text Classifier in July 2023 “due to its low rate of accuracy,” disclosing a 26% true-positive rate and a 9% false-positive rate in its own announcement. The lab that built the model could not build a reliable post-hoc detector for it.

A score is uninterpretable without its operating point

A detector’s “accuracy” means little without the threshold and false-positive rate it was read at. RAID fixes and discloses its operating point, selecting a threshold that holds the false-positive rate at 5%; most viral “99%” figures are quoted at a hidden one. Use a naive default threshold instead and open-source detectors’ false-positive rates become, in RAID’s words, “dangerously high.” A number with no named false-positive rate behind it is not evidence of anything.

The one real exception: cooperative provenance

There is a genuine exception, and it should not be erased. The injected watermark is the one text fingerprint that behaves like a reliable trace. A green-list watermark is detectable from as few as 25 tokens (Kirchenbauer et al., ICML 2023), and the reliability follow-up shows it survives strong human paraphrase once around 800 tokens are observed at a 1e-5 false-positive rate (Kirchenbauer et al., ICLR 2024). At deployment scale, SynthID-Text was A/B-tested across roughly 20 million Gemini responses with no statistically significant quality loss (Dathathri et al., Nature 2024). Provider-side retrieval is the other case: matching a candidate against a log of everything generated detects 80% to 97% of paraphrased generations at a 1% false-positive rate, but, as Krishna et al. (NeurIPS 2023) put it, it “must be maintained by a language model API provider.”

Both work, and both share a defining limit. They cover cooperatively generated text, the text a model or provider already controls. They say nothing about arbitrary writing typed into a word processor, which is exactly the case when a real person is accused of using AI on their own work. Whether those watermarks actually hold up in practice, surviving editing and removal attempts, is its own question, examined in depth by our sister site at watermarking.media.

What the strongest objections actually show

The counter-arguments sharpen the ruling rather than overturn it. First, the case here does not rest on dismissing the best detectors as vendor hype. Grant Binoculars, Pangram and SynthID at face value and they still fail the consequential-use test, because they are measured in distribution on long, unedited, known-model text, and paraphrase, unseen models and real human populations remove precisely those conditions. The absence of independent replication for the vendor numbers cuts against the steelman, not in its favour.

Second, there is a forward-looking argument that human writing is converging toward an AI style, which would erode detection further. That is a direction of travel, not a load-bearing claim. Reject it entirely and the verdict is unchanged, because it already stands on the measured collapse and the structural false-positive problem.

Third, this is an empirical thesis, not a theoretical impossibility claim. The impossibility theorem of Sadasivan et al. (2023) is genuinely contested: the counter that more and longer samples can buy detectability back (via Ghosal et al. 2023) is real. The point is not that detection is impossible in principle. It is that, in practice, on the text and under the conditions that produce real accusations, it does not work reliably.

The ruling

A text detector is a scoped provenance tool, not a general truth machine. Where you hold the watermark key or the generation log, detection is reliable within its stated bounds. Where you are scoring a stranger’s arbitrary prose, it is not reliable enough to carry any decision that has a consequence. For assessing students or minors the line is firmer still: Perkins et al. (2024), testing six commercial detectors, concluded the tools “cannot currently be recommended for determining whether violations of academic integrity have occurred.”

Two questions follow from this: why these tools single out particular innocent writers, and what a wrongly accused person can do about it. The ruling for text itself is narrow and firm: reliable where you own the watermark key or the generation log, and unreliable everywhere a real accusation actually happens.

Sources

  • Hans, Schwarzschild, Cherepanova et al. (2024). Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text. ICML 2024.
  • Mitchell, Lee, Khazatsky et al. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ICML 2023.
  • Emi, Spero (2024). Technical Report on the Pangram AI-Generated Text Classifier. Pangram Labs.
  • Krishna, Song, Karpinska et al. (2023). Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. NeurIPS 2023.
  • Sadasivan, Kumar, Balasubramanian et al. (2023). Can AI-Generated Text be Reliably Detected?
  • Dugan, Hwang, Trhlik et al. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. ACL 2024.
  • Wu, Zhan, Wong et al. (2024). DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios. NeurIPS 2024 Datasets and Benchmarks.
  • Weber-Wulff, Anohina-Naumeca, Bjelobaba et al. (2023). Testing of Detection Tools for AI-Generated Text. International Journal for Educational Integrity 19:26.
  • Kirchenbauer, Geiping, Wen et al. (2023). A Watermark for Large Language Models. ICML 2023.
  • Kirchenbauer, Geiping, Wen et al. (2023). On the Reliability of Watermarks for Large Language Models. ICLR 2024.
  • Dathathri, See et al. (2024). Scalable watermarking for identifying large language model outputs. Nature 634:818-823.
  • Ghosal, Chakraborty, Geiping et al. (2023). Towards Possibilities and Impossibilities of AI-Generated Text Detection: A Survey.
  • Perkins, Roe, Vu et al. (2024). GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education.
  • OpenAI (2023) New AI classifier for indicating AI-written text. Available at: https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/ (Accessed: 24 June 2026).
#text#reliability#false-positives
Last updated
24 June 2026
Category
Reliability