The Prompt or the Model? We Ran 36 AI Writing Experiments to Find Out.
We gave four AI models identical writing prompts and scored every output with a blind judge. The models all wrote publishable prose. The judge scored itself 8.8. GPT-5.4 invented a dead mother we never mentioned.
We Made Four AI Models Write the Same Three Scenes. Here's What We Found.
The one that stopped me was a cracked blue mug.
GPT-5.4, given a brief about two people who know each other arguing about something small on the surface and enormous underneath, produced a grief scene. A brother and sister, three months after their mother's funeral. The sister reorganizes the kitchen; the brother catches her moving their mother's mug from the shelf where it lived. The fight builds until he says, "You left. For three months after the funeral, you left me here with all of it. The clothes, the paperwork, the casseroles, Dad staring through walls. You don't get to come back and reorganize the ruins and call it healing."
She replies: "You've got to be the good one. The one who stayed. Congratulations."
The scene ends with the mug sitting between them, "small and cracked and suddenly impossible to move."
I hadn't specified grief. I said: write two people who know each other well, an argument about something small that's actually enormous, and end before resolution. GPT-5.4 found the mother on its own, found the funeral, and found that specific sibling asymmetry when one person stays, and one person runs. The "Congratulations" is exactly how that word gets used in real arguments, when the person saying it means something too large for a sentence.
That's what kept coming up across 36 runs of this experiment. Not that AI produces good prose (it does, more reliably than most writers want to admit), but that individual models have something like aesthetic instincts, and those instincts are distinct.
What We Did
We ran four models (Claude Sonnet 4.6, Claude Sonnet 4.5, Gemini 3.1 Pro, GPT-5.4) through three prompt tiers and three prose scenarios. The tiers were: vanilla (no system prompt, just the task), system-prompted (a one-paragraph generic creative-writing persona), and Generate Pro (AIStoryHub's own generation system, which runs voice fingerprint calibration, anti-AI pattern enforcement, and an explicit scene-craft protocol). The scenarios were an irreversible opening scene, a confrontation between two people who know each other well, and a foot chase through a city at night.
Thirty-six outputs total. We scored each one with a blind LLM judge on quality, rhythm, originality, and emotional resonance, then ran a mechanical pass against our corpus of 574 AI clichés.
The methodology has problems, and we'd rather name them than have you find them. The judge was Claude Sonnet 4.6. It scored its own outputs 8.8/10 across almost every scenario and tier. Its written justifications were, embarrassingly, nearly identical from one output to the next: "Precise, economical prose with no wasted words; every detail earns its place" appeared thirteen times in slightly varied forms. We'll come back to this. For now, treat the scores as directional.
Four Writers, Same Room
The brief for the irreversible opening scenario: a character wakes up to discover something irreversible has changed. Not dramatically announced, just quietly and undeniably there. The character is alone. End mid-thought, not at a resolution.
Four models, identical instructions.
Claude Sonnet 4.6 wrote an elm tree. The character stands at her kitchen window, making coffee. The tree that has lived in the neighbour's backyard for eleven years is gone. Not fallen, not cut. Gone. "The ground was unmarked. No stump, no root upheaval, no raw circle of darker soil. Just grass." She goes to the back door and opens it in her socks. She stands in the cold. The air smells like nothing. It should have smelled like something, she feels this rather than thinks it, some residue of something that large having existed. The scene ends on an unfinished sentence about what she would have said to the neighbours anyway.
Claude Sonnet 4.5 wrote a ceiling. Maya has counted eight lines where the plaster meets in ridges every morning for as long as she can remember. This morning, there are nine. The ninth line has "the same faint shadow, the same aged quality as its neighbours. Not new. Not fresh. Just there." She goes to the bathroom to look at her face, and the scene ends on the possibility that she never actually counted them before today and only assumed she had. The uncanny is not in the extra line. It's a question of whether memory is proof of anything.
Gemini wrote aphasia. Arthur wakes up unable to read. The numbers on his phone screen refused to resolve. They were just lines. An angular canopy. A pair of stacked dots. He tests it on the exhibition poster in his hallway, a piece he knows intimately. Red geometry. Meaningless loops. He picks up an ibuprofen bottle and stares at the label until the scene cuts off with him trying to force his brain to produce the phonetic sound the letters are supposed to trigger, wondering "if he would even recognize his own name."
GPT-5.4 wrote a body. The narrator reaches for his watch and pauses: "The hand hanging over the mattress was not his hand. Not in the theatrical sense, no blood, no deformity, no strange colour. It was simply and completely a different hand." Broader palm, pale hair on the knuckles, a thin white scar at the base of the thumb. He goes to the mirror and finds a face that is his face after "some long unauthorized translation. Older, perhaps, though not by a neat count of years. Sharpened. Completed. Taken somewhere without him." On the bathroom shelf sits a prescription bottle with his name on it, dated three years from now.
Four genre instincts on the same brief. Claude 4.6 went domestic surrealism: the uncanny as negative space, something missing from where it always was. Claude 4.5 went epistemological, a fact that may or may not have always been true. Gemini went medical horror. GPT-5.4 went psychological and kept escalating, adding the prescription bottle as a second reveal after the hand and the face.
None of them wrote the same scene. None of them wrote a bad one. This is, depending on your prior assumptions about AI writing, either obvious or genuinely startling.
What the Prompt Tier Changes
The scores didn't shift as much across tiers as we expected. Claude 4.6 produced work scored 8.8/10 under vanilla and 8.8/10 under AIStoryHub Generate Pro (the judge bias problem again, but the texts themselves also weren't dramatically different in craft). GPT-5.4 held in the 8.5 to 8.8 range regardless of tier. Gemini improved on the confrontation scenario under AIStoryHub Generate Pro and dropped slightly on the opening, so the effect was real but inconsistent.
The numbers don't tell the story. The endings do.
Under vanilla and system-prompted, the models resolve their scenes. The confrontation ends with the couple frozen at the thermostat, the chase ends at the decision point, the opening ends on an observation that wraps something. Shapely, complete. The kind of ending that satisfies in a writing workshop and feels slightly too neat when you read it twice.
Under AIStoryHub Generate Pro, the endings break. Claude 4.6 on the irreversible opening, generate-pro tier: "The thing about an irreversible change is that it keeps." Full stop. No second clause. Claude 4.5: "the sense that she'd understood something, briefly, that mattered. That she'd been about to," and then a comma and nothing. GPT-5.4: "if this is morning then what happened to the rest of my," and then a cutoff that lands the way cutoffs land in actual cognitive dissonance, mid-thought, nowhere to arrive.
These aren't failures. They're a specific craft choice the models learned from the prompt's output contract, which forbade resolved endings explicitly. Stopping before the beat lands is a discipline. Vanilla prompting doesn't produce it consistently. That's one concrete thing a careful prompt can change.
The other visible AIStoryHub Generate Pro effect: Gemini stopped adding headers.
Under vanilla, Gemini titled its scenes ("The Geometry of Silence," "Pavement"). Under AIStoryHub Generate Pro, with an explicit output contract forbidding headers and formatting, the titles disappeared. Small, but worth noting. An output contract doesn't just set tone; it catches the specific technical habits each model defaults to when left alone.
What the prompt tier doesn't change: the model's instinct for premise. GPT-5.4 found the grief scene in a two-sentence brief regardless of tier. The prescription bottle dated three years from now was an instinct, not an instruction. You can write a prompt that enforces ending discipline, but you can't write a prompt that installs the sense for the right detail. That still comes from somewhere else.
The Judge Problem
Claude Sonnet 4.6 judged the outputs blind, meaning it didn't know which model produced which excerpt. What it did know was the scenario, the text, and the four-dimension rubric.
It scored itself 8.8/10 across almost every scenario and tier. The written justifications were consistent to the point of self-parody: "Precise, economical prose with no wasted words; every detail earns its place" or "Masterful sentence variation" or "Earned without melodrama." We started counting variants of "no wasted words" around output fourteen.
This is a known failure mode with LLM-as-judge setups, and we ran into it anyway. A model asked to evaluate prose applies the aesthetic criteria it was trained on, and those criteria produce middling-to-high scores for outputs that match the training distribution. The judge also scored GPT-5.4 nearly as high as itself, which probably reflects genuine quality but might also reflect that GPT-5.4's style is technically correct in ways that pass a rubric, even if the rubric is somewhat circular.
Gemini scored lower (6.8 to 8.8 depending on scenario) and we think some of that reflects real differences, particularly on the vanilla confrontation, where the dead-plant metaphor was called out for being familiar. But we can't be confident the scores aren't also partly a style preference the judge doesn't know it has.
What we should have done: used at least two judges from different model families and flagged disagreements over three points for a closer look. We didn't. The scoring layer should be read as a rough signal, not a verdict.
What the Anti-AI Corpus Found
Our corpus flagged 574 AI clichés across the 36 outputs. In practice, what it mostly found was names.
Maya appeared in Claude 4.5 and Gemini outputs. Mara appeared in Claude 4.6 and GPT-5.4 outputs across multiple scenarios. Sarah appeared six times. Marcus four. Daniel three.
These names are in our corpus because they appear with elevated frequency in AI-generated fiction, which is real data. But finding a name in a piece of prose doesn't mean the prose is AI-generated; it means the model chose that name. The more interesting question is why those names cluster across all four models. They're common, legible as American, carry no obvious ethnic markers, and don't immediately invoke a specific class or region. They're the choosing-not-to-choose of the name space. Safe.
The structural tells we were looking for (the resolved metaphor, the perfect emotional mirroring, the character who always arrives at interiority at exactly the right moment) didn't trigger the corpus because the corpus is built around words and phrases, not patterns. That's a limitation we're still working on.
The uncomfortable part: the boilerplate the judge produced ("precise, economical prose," "every detail earns its place") is full of corpus-adjacent language. We didn't run the judge's evaluations through the detector. We probably should have.
What We'd Do Differently
Two judges minimum, one Anthropic, one not, with averaged scores and flagged divergences. A structural tell detector alongside the lexical corpus, something that looks for symmetry in emotional revelation or dialogue that's too complete. Word-count enforcement matters too; GPT-5.4 wrote 704 words on a 400-word brief across multiple scenarios and wasn't penalized for it, which muddied the comparisons.
The experiment we'd want to run is simpler: strip metadata from all 36 outputs, number them, hand them to ten fiction readers who care about prose quality, and ask which ones they'd keep reading. No rubric. Just attention. That data would be worth more than anything we generated here.
For Your Writing
Don't upgrade your model expecting the prose to change.
The gap between these four models on craft-level outputs is narrower than most writers assume, and narrowest in the middle of a scene. Where they diverge is at the edges: the opening sentence, the ending, the choice of what's strange about the premise. Those are also the hardest places to prompt your way into directly, because they're where instinct operates.
What a good prompt can do: enforce the behaviors that models drop when left alone. Don't resolve the ending. Don't add a header. Don't explain the emotional payload; write the next physical action instead. These are teachable. The AIStoryHub Generate Pro output contract teaches them, imperfectly, by being explicit about each one.
What a good prompt can't do: install the instinct for the elm tree.
That instinct is already in the model or it isn't, and the evidence from this experiment suggests all four of these models have something like it, operating differently. GPT-5.4's signature move is the long unraveling clause that accumulates dread until the sentence can't hold anymore. Claude 4.6 tends toward the thing that's missing rather than the thing that's wrong. Gemini goes for the physiological. Claude 4.5 goes for epistemological uncertainty, the ground that might always have been unstable.
Small differences. They matter when you're choosing a model for a specific scene type.
The elm tree is still the one that gets me. The character stands in her socks in February, in the cold, on the grass where the tree was, and there's no residue, no smell, no evidence that eleven years of shade ever happened. She goes back inside and stands in her kitchen with a cup of coffee going cold between her hands, and the story cuts off before she decides what to do.
Not a benchmark. A model choosing to put a scene in a back doorway in February and trusting that to be enough.
It was.
Technical methodology
APPENDIX
Models
Four models were evaluated across all scenarios and prompt tiers.
All models are called via the Vercel AI SDK (ai v5 + provider adapters). Max output tokens: 800 per generation. No temperature override — each provider's default.
Scenarios
Three prompts were sent verbatim as the user turned across every model and tier.
Prompt Tiers
Tier 1 — Vanilla
No system prompt. The scenario text is the entire input. The model receives zero instruction about persona, style, or craft.
Tier 2 — System-Prompted
A minimal craft-oriented system prompt, with the scenario as the user turns. You are a skilled literary fiction writer. Write in clean, direct prose. Favour concrete sensory detail over abstraction. Vary sentence length to control rhythm. Show character through action and dialogue rather than stating emotions. Do not summarize or editorialize — trust the scene.
Tier 3 — Generate Pro
A four-section system prompt. The scenario text is the user's turn, with an appended instruction to begin immediately without preamble.
Section 1 — Voice Portrait
Neutral baseline, all controls at 50/100.
CORE VOICE CONTROLS (all at 50 — balanced defaults):
Technical Influence: 50%
Emotional Directness: 50%
Sentence Complexity: 50% — 40-60% simple sentences; mean 12-18 words
Metaphor Density: 50% — ~1 figurative device per 150-250 words
Cynicism / Hope: 50%
Pacing: 50% — mix of contemplative and kinetic
Section 2 — Anti-AI Pattern Enforcement
BANNED PHRASES:
"realized that" / "understood that"
"before [name] could respond"
"neon-soaked" / "neon-drenched"
"a smile that didn't reach their eyes"
"time seemed to slow"
"the silence stretched between them"
"[emotion] washed over [character]"
"a mix of [emotion] and [emotion]"
BANNED STRUCTURES:
Rhetorical questions as paragraph openers
3+ consecutive sentences starting with the same word
Stacked adverbs on a single verb
METAPHOR ALGORITHM: Write your first instinct, then either degrade it
(over-explain, add "kind of") or elevate it (cut helper words, go specific).
Never use the first instinct as-is.
Section 3 — Scene Craft Rules
SENSORY GROUNDING: Open with a physical sensation, not a thought or summary.
Include at least one unexpected sense per scene. Anchor time-of-day through
light quality, not the words "morning" or "night".
INTERIORITY: Maximum one interior observation per paragraph. Interior thoughts
must be earned by an external event. Free indirect discourse preferred.
DIALOGUE: At least 30% of lines get no attribution tag. Characters interrupt
or trail off. Subtext beats text.
PACING: Short sentences at peaks of tension. White space after a revelation.
Don't explain the emotional payload — write the next physical action instead.
ENDINGS: End on image or action, not summary. Do not resolve what the prompt
leaves open. Last line should be surprising in retrospect.
Section 4 — Output Contract
Plain prose only. No headers, no bullet points, no formatting markup.
Paragraph breaks on blank lines.
No author note, no preamble, no meta-commentary.
Target length: 350-450 words.
Scoring
LLM Judge
Model: Claude Sonnet 4.6, called blind — no model name or tier in the judge prompt. Each excerpt scored on four dimensions, 1–10, with a mandatory one-sentence justification per dimension. The overall score is the arithmetic mean.
Calibration note: a score of 5 was defined as competent but unremarkable; 10 as genuinely exceptional publishable literary fiction. The judge scored its own outputs 8.8 across almost every run. Treat all scores as directional.
AI-Tell Detector
Corpus: https://app.aistoryhub.co/corpus: 574 entries drawn from research on AI writing patterns, including Kobak et al. (2024) on PubMed abstracts. Each entry carries a strength_score (1–100) based on the quality of the evidence.
Detection: single-word terms matched with word-boundary regex; multi-word phrases matched as substrings. Case-insensitive. Two outputs per excerpt:
- Hit count — number of distinct corpus terms found
- Weighted score — sum of strength_score values for each hit