Metannoying
Your brain on "not X but Y"
In 1994, the College Board removed antonym questions from the SAT. Many of you don’t remember this. I do, because my former Johns Hopkins professor, Julian Stanley, thought it was a scientific tragedy.
Stanley had spent two decades using the SAT to identify gifted and talented twelve-year-olds. He had founded the Study of Mathematically Precocious Youth in 1971, and, by the early 1990s had identified thousands of children whose reasoning abilities were five or six years ahead of their age. He had used the SAT because usual age- and grade-level tests administered by school districts across the country weren’t able to distinguish between a moderately gifted seventh-grader ready for high school algebra and a profoundly gifted one ready for college calculus. The ceiling was too low.
When you give a 12-year-old the SAT, you can see the difference between a student getting a 1200 and one getting a perfect 1600. It turns out that the category of “gifted” is enormous.
It also turns out that the antonym section of the verbal test does important work supporting the math test in assessing reasoning, which is why removing it was such a bad move for seeing talent.
Here’s your brain on antonyms. You’re given a word, like OBDURATE, and a list of five options (PERMANENT, FADING, YIELDING, FLEEING, SOFT). Answering swiftly is the goal so you’re supposed to do it in seconds. The brain must retrieve the semantic definition, identify the relevant dimension (flexibility of will), invert it, and evaluate which option maps precisely onto the inversion. PERMANENT seems like it is friends with OBDURATE, not enemies, so that’s out. YIELDING works. FLEEING involves movement but not compliance or will, is probably out. SOFT is tempting (you don’t want an obdurate pillow) but it’s imprecise, so best to go with YIELDING.
This happens in seconds. A student with high verbal aptitude processes this almost instantaneously.
Antonym tests separate the “well-read” from the “verbally precocious.” A 12-year-old who could correctly identify the antonym of LACONIC or DIAPHANOUS was demonstrating a lexical network that was statistically deviant (in the positive sense) for their age.
Psychometricians have called antonyms the most efficient measure of verbal intelligence in the toolkit: high g-loading, high discrimination at the top of the ability range, no contextual scaffolding. Stanley thought that the College Board removing them and renaming the test “assessment” was capitulating to the idea that talent is something you could achieve, not something innate.1 He believed talent could be found everywhere, not just in the “best schools.” He believed the complex process of working out antonyms, which involves executive function, inhibitory control (suppressing things that are “like” the word), and abstract reasoning all working together, was a purer measure of ability than “reading comprehension.”
In any event, this same task — building a concept and inverting it — is very similar to what LLM prose asks with its tiresome “not X but Y” formulation, also known as “corrective contrast” or contrastive negation.
“The challenge wasn’t finding talent; it was retaining it.” “She didn’t want sympathy; she wanted solutions.” “The real issue isn’t technical — it’s cultural.”
It’s nails on a chalkboard for me, if you remember chalkboards. I stop reading at the third instance in any piece of prose.
It surprises me that more people aren’t mad. So I decided to analyze my own irritation. It turns out the antonym research explains why.
First, why do LLMs do this? Because of limitations on how models represent meaning. In vector space models, word meaning is defined by distributional context. Synonyms have high cosine similarity because they appear in similar sentences. Antonyms also have high cosine similarity, because they appear in identical sentences. “I like hot coffee” and “I like cold coffee” occupy the same distributional space. The models see that hot and cold are mathematically close.2 They do not inherently compute the oppositeness relation. One way to understand the “not X but Y” construction is as a workaround for the model’s inability to compute opposition the way humans do. By explicitly stating both the rejected term and the replacement, the model externalizes onto the page an operation it cannot perform internally.
The “corrective contrast” construction reduces ambiguity in the output space. Users want clarity. “Not X but Y” to the LLM is an insurance policy on clarity.
Luckily for the designers of LLMs, corrective contrast also sounds cool, memorable, and often profound, at least in moderation. “I came not to bring peace but a sword.” (Matthew 10:34). Or “it’s not the heat it’s the humidity.”
Classical rhetoric had a name for the deliberate version: metanoia, or correctio, the performed self-correction where a speaker revises mid-sentence to find the more precise or more forceful formulation. When Brutus tells the Roman crowd “Not that I loved Caesar less, but that I loved Rome more,” the audience holds “loved Caesar less” and suppresses the idea to receive the reframe. The delay and the cognitive cost is the point. Shakespeare knows the negated proposition will linger as a kind of understatement that makes the correction feel like an escalation.
But LLMs are not Shakespeare (yet) and there’s no rhetorical reason for it and worse, there’s no limiting function, which is why you can get “not x but y” every other paragraph. LLMs are corrective-contrast-maxxing for maximum comprehension across the widest possible readership.
Back to my original point. The more skilled you are as a reader, the more this construction costs you. Here’s why.
Take a sentence like “The problem was not a lack of resources but a lack of coordination.” If you are a skilled reader, your brain does what it does with an antonym test: you encode the concept, activate its semantic neighborhood, and suppress it. Psycholinguistic research on negation shows that when readers encounter a negated proposition, they first simulate the affirmed state of affairs, building a model of “lack of resources,” before constructing a mental model of the actual, negated situation.
Neuroimaging research shows that this polarity reversal (hearing the “not”) recruits the inferior frontal gyrus and the dorsolateral prefrontal cortex, the same inhibitory networks associated with fluid intelligence. The process adds several hundred milliseconds of processing time. Separately, the brain produces a measurable electrical signal called the N400 (a negative voltage deflection peaking around 400 milliseconds after stimulus onset) whenever a word is unexpected in its semantic context. A later P600 component reflects the brain going back to revise the structure it already built, which means the reader is paying twice: once to suppress, once to rebuild.
The cost can be measured, in fact. MEG studies tracking the time-course of semantic processing suggest that skilled readers process semantic relatedness within the first 200-400 milliseconds. By the time your eyes reached “coordination,” your prefrontal cortex had already suppressed “resources,” generated alternatives, and was evaluating the fit. The “not X but Y” construction then asks you to spend another one to four seconds reading a clause that delivers an answer you computed in under half a second.
A skilled reader is like yeah, I already guessed that, thanks for wasting my time.
There’s a social cost on top of the cognitive cost. Linguists use the term “K- position” to describe the status of presupposed ignorance, the stance a speaker assigns to a listener who is assumed not to know something. Every instance of “not X but Y” places the reader in the K- position. The construction implies the reader was holding X and needed correction. When you, the reader, were not holding X, you feel like you’re being talked down to.
Paul Grice would call this a violation of the Maxim of Manner, which asks speakers to avoid unnecessary complexity. There’s no reason the reader needed to feel corrected.
Corrective contrast is computationally cheap. The saved cost is emotional asymmetry. Your LLM doesn’t care that you’re irritated or feeling condescended to. It never entertained the first proposition as a belief state; it never felt the idea of “contrast.” It simply generated a sequence without commitment, without suppression, without revision, whose final semantics are already settled.
Developmental research shows that children acquire the concept of “opposite” around age four, a milestone that marks the transition from associative thinking to logical processing. Before four, a child who hears “hot” says “cold” because the words co-occur in speech. After four, the child understands the abstract relation of polarity reversal and can generate antonyms for unfamiliar words. The shift is from learning pairs to learning a rule.
The difference between the three-year-old who says “cold” in response to “hot” and an LLM is that the child will soon acquire the abstract rule of polarity reversal and leave co-occurrence behind. LLMs may someday have architecture that supports that transition. But not yet (see footnote 2).
Which brings me back to Julian Stanley. A contrastive construction is not always wrong. A writer correcting a genuinely widespread misunderstanding needs “not X but Y.” It is the right tool when the reader actually holds X.
Consider the idea that for decades, gifted education involved an “enriched” curriculum with more arts and culture instead of acceleration in the subject a child was interested in, be it math or science or poetry. Stanley thought enrichment was a kind of malpractice.
Here are three ways to write a statement about the limitations of enrichment:
“Enrichment didn’t move students forward; it moved them sideways.” This is how an LLM would put it. It negates enrichment’s ability to move a student forward and then supplies the word “sideways.” The reader must build and demolish before receiving the point. It sounds profound and conclusive. Many readers would say yes, and nod.
Compare to: “Enrichment moved students sideways, not forward.” I like this phrasing better. It states the claim about the problem of enrichment and then sharpens the claim by excluding a plausible misreading. Rather than a “not X but Y” sentence, the reader receives Y first and X serves as refinement.
Compare to: “Enrichment moved students sideways.” I like this phrasing best. It is the poet-logician’s preferred construction. In the midst of a conversation about acceleration, pacing, and gifted children, the word “sideways” does all the work. The antonym of sideways is forward.
The real difference between these three versions is trust. The first assumes the reader is holding a wrong idea and needs correction. The second assumes the reader can receive the right idea and might appreciate knowing what’s excluded. The third assumes the reader can follow you. The more trust, the more the prose rewards fast processors. The less trust, the more it feels like being walked through eighth-grade math when you’re ready for calculus.
Stanley’s career demonstrated that giving every student the same grade-level test rendered the most capable students invisible. LLM-generated prose wants so much to be clear to everyone that it talks down to readers who don’t need the extra words, correcting beliefs they don’t hold, resolving contrasts they’ve already computed. It is the prose equivalent of that ceiling.
[I put this together prepping for a talk on Julian Stanley at Alpha School and I thank Pamela Hobart for the inspiration.]
For Stanley’s research with SMPY, this meant the “SAT-V” score after 1994 was measuring a slightly different construct than the “SAT-V” score before 1994. The new test was more “speeded” regarding reading volume. A gifted 12-year-old might have a college-level vocabulary (perfect for antonyms) but a middle-school reading speed (bad for long passages). Thus, the new test might underestimate the verbal reasoning potential of precocious youth.
The distributional indistinguishability of antonyms and synonyms is well documented in computational linguistics. Nguyen, Schulte im Walde, and Vu (2017) describe antonyms and synonyms as “notoriously difficult to distinguish by distributional co-occurrence models” because both relations produce similar contextual distributions. Ali et al. (2024) demonstrate the problem concretely: in GloVe embeddings, the nearest neighbors of “large” include synonyms (“larger,” “huge”) and antonyms (“small,” “smaller”) undifferentiated. The subfield of antonym-synonym distinction in NLP exists because standard embedding models lack a mechanism for computing polarity. See Nguyen et al., “Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network,” Proceedings of EACL 2017, pp. 76–85; Ali et al., “Antonym vs Synonym Distinction using InterlaCed Encoder NETworks (ICE-NET),” Findings of EACL 2024, pp. 1462–1473. Recent advances in “linear probing”—the use of diagnostic classifiers to examine the hidden states of neural networks—suggest that modern transformers may not be as blind to opposition as early vector models. Research on “Contrast-Consistent Search” (CCS) and “Polarity-Aware Probing” indicates that LLMs can develop internal representations that distinguish synonyms from antonyms in their intermediate layers. However, the persistence of the “not X but Y” construction in their generated prose suggests that even when a model “knows” an opposition internally, it still relies on explicit linguistic scaffolding to resolve the distributional ambiguity inherent in its training data. See Agarwal (2025), Polarity-Aware Probing for Quantifying Latent Alignment in Language Models; and Liu et al. (2024), Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics.




Did Stanley have anything to say about the SAT getting rid of analogies? My husband, who teaches MBA students, says the inability to make analogies is the besetting flaw in his students. They'll understand something in one context but can't transfer it to another.
I wonder if there would be a market for a rigorous standardized test that would do the kind of separation Stanley was looking for.
Thanks for explaining in detail what deeply irritates me about current default AI text, far more of a tell than the use of an em-dash.
I note it creeping in, even when it is doing a plausible job of emulating a specific writers voice.
I have a friend who uses ChatGPT as editor - specifically to downgrade the reading age of his text, as he wants to reach a mid-market audience.
Even though I know he wrote the original, the rewrite always has this AI stylistic voice.
I didn’t use to hate it - there is always room for an unnecessary flourish versus following Orwell’s dictums to the letter. Just not when you use the same flourish multiple times in every piece.
And now the twist.