Science Is Not a Document Factory
Science is not strengthened by increasing the volume of plausible documents.
In Foucault’s Pendulum, Umberto Eco tells the story of three newspaper editors who create a computer program called Abulafia.
They feed it fragments — scraps of history, occult symbolism, Knights Templar lore, cabalistic diagrams — and let it synthesize them into patterns. At first it’s a joke. A game. They want to prove that if you give a machine enough disconnected facts, it can produce something that sounds like a grand theory of everything.
At one point, they push it further. They feed it nonsense inputs and watch as it confidently generates elaborate conspiratorial explanations — including one that links Mickey Mouse to secret mystical orders.
The machine doesn’t know anything. It is simply recombining fragments according to its programming.
But the output is coherent. It cites connections. It weaves patterns. It sounds meaningful.
And that is enough to make it unsettling.
Eco’s point wasn’t about computers. It was about humans — about our tendency to mistake pattern generation for truth.
That vignette came back to me when I read recent paper in Nature introducing a benchmark called ScholarQABench, designed to test whether large language models can perform “scientific synthesis.” In other words: can they read the literature and write long, well-cited answers to research questions?
On its face, this seems sensible. If AI systems are going to assist science, we should evaluate them.
But the benchmark measures something very specific: whether a model can produce text that resembles a literature review.
And that raises a deeper question.
Is science about producing more papers?
Or is it about building shared knowledge?
ScholarQABench measures the former and treats it as if it were the latter.
What the benchmark actually measures
The benchmark evaluates models on questions like:
Are the citations real?
Are they relevant?
Does the answer overlap with a reference review?
Is the prose coherent and well structured?
All of that concerns the form of scholarship.
What it does not measure is whether the output would survive scrutiny in a real scientific community — whether it integrates evidence correctly, represents disputes fairly, or reflects where consensus genuinely lies.
The benchmark defines “correctness” by comparison to reference texts. If the model resembles the reference review and cites plausible papers, it scores well.
That means truth is grounded in resemblance to other (potentially obsolete) writing.
And that matters, because scientific knowledge is not created when a document is produced. It is created when a community stabilizes around a claim.
This is, by definition, a moving target.
The crucial detail: the “experts” are trainees
Now here is the part that should give everyone pause.
The reference answers — the gold standards — were written by PhD students and postdocs.
These may be intelligent, capable researchers. But they are, by definition, still in training. They are not senior scholars who have shaped their fields. They are not panels synthesizing consensus after decades of debate.
Yet their literature reviews become the epistemic anchor against which models are judged.
Why does that matter?
Because early-career academic writing, although it may, in the best case, signal thoroughness, also tends to be overly narrow, insufficiently synthetic, and “in the weeds”. It signals that one has read many things, but not necessarily that one has had time to make sense of them and determine what is truly important.
Early career researchers can rarely resolve the field’s hardest disagreements or clarify what is genuinely known. In contrast, senior academics can do so for one simple reason: they were there when it happened.
To them, the literature is not just more papers: it is lived experience.
However, the benchmark trains models to imitate the performance of academic competence at a trainee level — and then treats resemblance to that performance as scientific synthesis.
That is not contributing to collective knowledge.
Science works because it is social
Scientific knowledge is not just information. It is coordinated belief across a community of people who can question one another, replicate one another’s work, and, crucially, change their minds.
A literature review is not knowledge. It is a summary of the state of the art of an ongoing conversation, from a given author’s perspective.
The reason we care about who writes that review — whether it reflects consensus, whether it is trusted, whether it will be challenged — is because knowledge is social. It depends on accountability and shared standards.
Large language models do not participate in that process. They do not defend claims. They do not revise interpretations in light of criticism. They do not accumulate credibility over time. They do not have lived experience. At best, they can ingest more papers than a human can read.
And then, they generate text.
When we evaluate them based on how well they reproduce literature-review-shaped text — calibrated to trainee-written examples — we are measuring document mimicry benchmarked against surface-level understanding, not true understanding, and certainly not participation in an epistemic community.
Why this distinction matters now
None of this means LLMs are useless in science. They can be helpful tools — for summarizing, reorganizing, clarifying, drafting. I used one to help draft this essay.
But here is the difference: I did the reading. I did the thinking. I formed the judgment. I verified that the output says what I mean. And I edited the results extensively, because the LLM got a lot wrong.
The model helped with expression. It does not bear epistemic responsibility.
That is fundamentally different from asking whether a system can autonomously synthesize scientific literature and then scoring it based on resemblance to trainee-authored reviews.
Science is not strengthened by increasing the volume of plausible documents.
It is strengthened when claims help us to make sense of the world around us, and that is integrated into shared understanding.
If we begin treating polished synthesis documents as equivalent to that process, we risk confusing the appearance of scholarship with the growth of knowledge.
And once those two things are confused, science is not advancing — it is just producing more text.
We know how this ends
In Foucault’s Pendulum, Abulafia begins as a joke. A playful machine for recombining fragments into grand designs. The three editors know they are inventing patterns. They know the “Plan” it generates is nonsense.
But something happens.
The patterns become elaborate. Convincing. Internally coherent. The machine produces so many connections that even its creators begin to feel their gravitational pull. What started as parody slowly acquires weight.
Others take it seriously.
People who lack expertise start to believe in hidden structures and see confirmation everywhere. The Plan spreads. The joke metastasizes. And by the end of the novel, the game has consequences — obsession, paranoia, and violence. The pattern-generator did not merely entertain. It destabilized the boundary between truth and belief.
Abulafia produced coherence without responsibility and humans treated that coherence as if it were knowledge.
That is the cautionary tale.
Large language models are far more sophisticated than Abulafia. But they share one fundamental property: they generate convincing patterns from fragments of text. If we build systems — and benchmarks — that reward the production of document-shaped coherence, we risk drifting toward a world where the appearance of synthesis substitutes for the collective labor that makes knowledge real.
Science does not collapse in a dramatic moment. It erodes when we begin mistaking beautifully arranged symbols for shared understanding.
Eco’s editors learned too late that playing with pattern-generating machines can change how people relate to truth itself.
We would do well to remember that lesson — before we start grading the machines on how well they can imitate us.


Good post. I imagine that citations are meant to be a good-faith demonstration that someone worked all of this out and isn't just guessing in general. The fear of bad AI citations is that the writer themselves didn't think all of this through and was relying on AI logic (which we already know is wobbly and fraught).
If we run with this idea, then it seems like the AI is kind of a breach of contract. Like discussing whether the bun you had for breakfast was really made by grandma or if it was made in a factory.
We have some notions of craft v mass produced in materials and media, but I don't think we have a really good way to think about mass production of ideas. That's going to be tricky, in part because being an intellectual has had it's own cache and status. Introducing mass production of ideas destabilizes that social economy.
Not sure what I think about that quite yet.
Science "...is strengthened when claims help us to make sense of the world around us, and that is integrated into shared understanding."
I suggest inserting the phrase "empirically validated" to modify "claims" and maybe add "...to enhance broadly beneficial outcomes" at the end. I accept that "beneficial outcomes" is not fundamental to science, but beneficial outcomes would likely help strengthen science within the broader socio-political context.
Otherwise, I agree with much of your critique of the Nature paper, but your critique seems to be something of a strawperson argument because the paper does not appear to be proposing to replace science or even lit reviews with LLMs. Rather, it seems to be a limited comparison of LLMs under controlled conditions. Maybe at some point LLM will be able to produce a decent fist draft of a lit review, which can then be edited by human experts with a range of experience. The LLMs might be able to identify literature that would be overlooked by human experts.