Why AI text detectors falsely accuse real writers

Because the only signal a post-hoc detector actually has, low perplexity and low burstiness, is shared by whole classes of innocent people. The false accusation is therefore a structural property of the method, not a bug that better tuning removes. Write plainly and predictably, as many honest people do, and the tool reads you as a machine.

The clearest evidence: seven detectors, one biased result

Liang et al. (Patterns 2023) ran seven widely used GPT detectors over 91 human-written TOEFL essays from non-native English writers and 88 essays from native US 8th-graders. The detectors reached near-perfect accuracy on the native essays and “misclassified over half of the TOEFL essays as ‘AI-generated’ (average false positive rate: 61.22%).” Per detector the TOEFL false-positive rate ran from 48% (ZeroGPT) up to 76% (Originality.ai), against 0% to 12% on the native essays. The difference between the two groups is not authorship, since both sets are human. The difference is style.

The mechanism: it tracks perplexity, not authorship

Liang supplies the causal proof. The essays that all seven detectors unanimously misflagged had significantly lower perplexity than the rest (P = 9.74e-05): the model found them less surprising, and low surprise is the exact signal a detector reads as machine-written. It is the same low-surprise signal that powers curvature-based detectors like DetectGPT (Mitchell et al., ICML 2023), here misfiring on people. The study then closes the loop both ways. Prompting ChatGPT to “Enhance the word choices to sound more like that of a native speaker” raised the TOEFL essays’ perplexity and dropped their average false-positive rate from 61.22% to 11.77%. Prompting it to “Simplify word choices as if written by a non-native speaker,” applied to the native essays, raised their misclassification from 5.19% to 56.65%. Raise the perplexity and the accusation disappears; lower it and an innocent native essay is condemned. The flag follows the writing’s predictability, not its origin.

Who shares the low-perplexity signature

This is why the harm is not random. Plain, predictable, low-surprise prose is the natural register of identifiable groups: non-native and ESL writers, people writing through machine translation, neurodivergent and formulaic writers, and developing or young writers still building range. None of them did anything but write in clean, standard English, which is precisely what a perplexity-based detector reads as artificial.

The same failure on real human controls

Independent tests measure the same thing outside Liang’s TOEFL set. Weber-Wulff et al. (International Journal for Educational Integrity 2023) recorded a false-accusation rate of 2.4% on original human documents, rising to 11.1% on machine-translated human documents, with GPT Zero at 50.0%. Perkins et al. (2024) measured a mean false-accusation ratio of 15% across their detectors and human controls, with Copyleaks at 50% and control-sample accuracy of only 67%. These are small samples, but they are real human writing wrongly labelled as AI, and the worst tools miss one human in two.

The best counterexample, and its limit

The strongest rebuttal deserves full strength. On the exact 91 TOEFL essays Liang used, the Pangram classifier (Emi and Spero, Pangram Labs 2024) reports a 0% false-positive rate, with 0 false positives across 3,907 ELLIPSE essays and 0.09% across 5,600 ICNALE ESL essays, concluding it is “not biased against text written by non-native English speakers.” Taken at face value, that is a real, large-sample result.

It bounds the harm rather than erasing it, for one decisive reason that Pangram states itself: it attributes the 0% to the composition of its training set, which includes 165,000 ESL examples. The bias is trained away, not shown to be intrinsically absent, and the proof is that the deployed tools that caused the documented harm still bite. Even GPTZero’s updated, ESL-corrected model still shows a 7.7% TOEFL false-positive rate on Pangram’s own chart. A vendor can build a 2024 classifier that avoids the bias; that does not make the perplexity signal itself authorship-aware.

Base rates turn a small percentage into many people

Even a small per-document false-positive rate scales badly. A 1% rate is roughly 10 wrongly flagged students across a 1,000-essay course and about 500 across a 50,000-essay term (illustrative arithmetic, not a measured corpus figure). And 1% is optimistic for many tools: RAID (Dugan et al., ACL 2024) shows that at a naive threshold open-source detectors’ false-positive rates are “dangerously high,” reaching 99.3% for GLTR and 96.0% for LLMDet, and that the high headline accuracies hold “only at similarly high FPR.” A reassuring accuracy figure and a damaging false-positive rate routinely describe the same tool at two different operating points.

Mechanical, not incidental

The pattern across every study points one way. A detector that can be defeated by writing less predictably, and whose failure mode is to condemn the plainest, most standard prose, is not measuring who wrote a text. If this has happened to you, our companion piece Falsely accused of using AI? What to do sets out how to contest it. What the evidence settles here is narrower and harder: the false positive is mechanical, not incidental, because low surprise is the only thing the method reads, and whole classes of honest writers are low-surprise by nature.

Sources

Liang, Yuksekgonul, Mao et al. (2023). GPT detectors are biased against non-native English writers. Patterns 4(7):100779.
Mitchell, Lee, Khazatsky et al. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ICML 2023.
Weber-Wulff, Anohina-Naumeca, Bjelobaba et al. (2023). Testing of Detection Tools for AI-Generated Text. International Journal for Educational Integrity 19:26.
Perkins, Roe, Vu et al. (2024). GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education.
Emi, Spero (2024). Technical Report on the Pangram AI-Generated Text Classifier. Pangram Labs.
Dugan, Hwang, Trhlik et al. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. ACL 2024.