100 Pages of Planning Before Any Code

The Slop Detector shipped. MVP was live. People could use it.

Time to build the real thing.

I’m not a traditional developer. I’m learning this as I build. Claude Code writes most of the code; I provide the taste and direction. The whole thesis of this site is that good taste might matter more than coding ability when you have AI assistance. So far, that thesis is holding.

For V2, I wanted to test a structured workflow instead of continuing to vibe code my way through. I’d heard about the BMAD Method, a framework for planning software with AI assistance. I opened two terminals: one for minor fixes and shaping, one for planning.

The Planning Arc

Seven sessions. About a hundred pages of planning documents before touching V2 implementation.

(A session: one Claude Code context window. About 150,000 words of conversation, roughly 2-3 hours of focused work before the AI’s memory fills up and you start fresh.)

What made BMAD different from just writing docs: you call in agents. They’re personas with distinct perspectives (more on personas in a later post). Mary the Business Analyst asks different questions than Winston the Architect. When you need to stress-test your thinking, you can run Advanced Elicitation, which offers 36 methods: Tree of Thoughts, Red Team vs Blue Team, Stakeholder Round Table, Socratic Questioning, Pre-mortem Analysis, etc.

For the UI epic, I ran a Stakeholder Round Table. Three personas showed up: the Skeptical Engineer who wants to see every signal that fired. The Content Creator who doesn’t want to feel judged when they use AI assistance. The Verifier checking someone else’s work who needs disclaimers visible.

They argued about language. Should obvious AI text be labeled “Obvious Slop” or “Classic Slop”? The Engineer didn’t care about naming. The Creator hated the word “Slop” entirely. The Verifier wanted clear categories regardless of how they felt.

That argument added a story to the backlog I wouldn’t have thought of: a language audit to make the tool analytical rather than accusatory. “Classic Slop” instead of “Obvious Slop.” Analysis, not detection. Small word choices that change how the tool feels to use.

The planning sessions produced: research on detection methods (V1 only covered 40% of known signals), a product requirements doc with fifteen specs, an architecture with eight key decisions, and twenty user stories across five epics. But the documents weren’t the point. The conversations that produced them were.

Building the Corpus

The planning docs called for a test corpus: known samples with expected scores so I could measure whether detection actually worked. Building it took another five sessions.

First lesson: Claude’s web search returns summaries. When I asked for Project Gutenberg content, I got 500-word synopses instead of actual prose. I needed the real text to test detection against. I switched to Bright Data and scraped the pages directly. Full text, not summaries.

Second lesson: provenance matters. I needed human-written samples I could prove were human-written. Solution: pre-2022 content, before the AI writing boom. I used the Wayback Machine to grab December 2021 snapshots of Wikipedia featured articles. Moon. Python programming language. Content that predates ChatGPT.

Third lesson: real polished-AI beats synthetic edits. The original plan was to take obvious slop and manually clean it up to simulate “good AI writing.” Better plan: use actual high-quality AI writing with real provenance. I scraped Stunspot’s Nova-written Medium articles. He’s a prompting expert who openly publishes AI-generated content and it’s genuinely good. I added my own edited blog posts. Real samples, known provenance.

Fourth lesson: not everything works. Reddit via Wayback failed. Specific WritingPrompts posts weren’t archived. I pivoted to more Gutenberg classics instead of forcing it.

Final corpus: 45 samples. 25 obvious slop (ChatGPT outputs, Reddit bot content). 10 human-written (Gutenberg, pre-2022 Wikipedia). 10 polished-AI (Stunspot’s articles, my blog posts).

The Baseline

I built a regression test runner. Every sample has an expected score range. The runner checks all 45 samples, reports which passed, calculates accuracy by category.

I ran V1 against all 45 samples.

0% accuracy. Every category failed.

Which is fine, actually. Before the corpus existed, I had vibes. “The detection seems off.” “This should score higher.” But I couldn’t actually prove any of it, and I definitely couldn’t tell you whether a code change made things better or just different.

Now I’ve got 45 samples with known answers and a test that runs automatically. The 0% baseline means V1 is broken, but broken in a way I can point at. When V2 ships and that number climbs, I’ll know by how much.

What I Noticed

Planning felt like building. The requirements doc, the architecture, the stories—they’re not notes about what to build later. They’re already shaping it. When I sit down to implement V2, I’m not going to be staring at a blank terminal wondering where to start. I already know.

The agents made it fun. I expected planning to be a chore. Having personas argue about my product was genuinely entertaining. The Skeptical Engineer and the Content Creator have different values. Watching them clash surfaced tradeoffs I wouldn’t have seen alone.

Research changed the scope. Without the research session, the requirements would have been “make V1 better.” With it, I learned V1 was missing entire categories of signals: burstiness, vocabulary metrics, sentence variance. The PRD specified exactly what “better” meant.

Corpus building is its own skill. Finding samples, proving provenance, handling tools that return summaries instead of content. It’s not glamorous but it’s real work with real learnings.

What’s Next

Epic 2: make the detection engine actually work. The corpus will tell me when I’ve succeeded.

Seven sessions of planning, five sessions of corpus building. Twelve sessions before writing V2 code. The full artifacts (PRD, architecture, epics, test corpus) are in the repo under docs/.

This post went through eight drafts. BMAD Party Mode helped with structure, but the real work was hunting down AI tells—mechanical parallel constructions like “not X, it’s Y,” dramatic reveals, hyperbolic transitions. Even after all that, it still reads a little AI to me. Maybe I’m too close to it now. But the goal is indistinguishable, and I’m not there yet. A voice module is climbing up the priority list.