Behind the scenes — how it works and how it was built
This application was built through an extended collaboration with Claude — first via Claude.ai on desktop, then through Claude Code CLI as the tooling matured. The project began with a question about whether the story generation idea was even a good fit for an agent demo, and grew into a live multi-agent system deployed on Google Cloud Run.
The honest split of roles:
Every meaningful decision in this project came from the human side. Whether this was a good agent use case. The three-agent separation of concerns. Using a dedicated API key per project. Keeping prompts outside code. The human-in-the-loop checkpoint. The world type concept and its four definitions. The original vs revised story framing. Pushing back when something felt wrong. Asking why. Knowing when to stop adding features. These are not small things — they are the project. The code is just the expression of those decisions.
The user experience was shaped entirely from the human side too — the three-column layout, hiding the facts textarea until the agent populates it, moving world type from an informational panel into an active input, removing the creativity slider when it became redundant with world type, the progressive reveal of agent controls after approval, collapsing setup cards once facts are generated to keep the interface clean, the dark About card as an elevator pitch, the punchy tagline. Every one of these was observed, questioned, and decided during the conversation — not generated.
Testing was also entirely human-driven. Every agent interaction was manually exercised across world types, contradiction levels, and story lengths. Edge cases — the Critic rejecting all three attempts, empty fact lists, very long generations — were caught and resolved through hands-on use. A unit test suite (40 tests, 7 classes) was added as the codebase matured, covering configuration, request models, result classes, prompt templates, and the zero-fact short-circuit. The decision to add tests, and the decision to refactor the flat package structure into layered sub-packages, both came from the human side — as did the direction to extract shared agent infrastructure into a base class once the duplication became visible.
Claude wrote all the Java, HTML, CSS, and JavaScript. It set up the Spring Boot structure, wired the agents together, handled the OkHttp calls, built the Mermaid diagram, generated the Dockerfile and deployment scripts, and kept the growing codebase consistent across dozens of iterations. It also pushed back where it had a view — on prompt design, agent temperature choices, and what "genuinely agentic" means versus a dressed-up prompt chain.
Partially — but not entirely. Vibe coding in its purest form is "describe what you want, accept what you get, don't worry about understanding it." That's not what happened here. The architectural decisions were deliberate and reasoned. The human understood what each agent was doing and why. When something broke, the cause was diagnosed — not just retried. The better description is AI-assisted development with genuine human judgment driving the architecture. The speed at which a working multi-agent system went from idea to production is very much the vibe coding energy. The depth of the decisions behind it is not.
The parts Claude could not contribute: knowing what problem to solve, deciding what made a demo compelling, choosing when the Critic loop was the right thing to show an audience, recognising that "append-only facts" needed rethinking when the Fact Generator arrived. The parts a human could not contribute at this speed: the Spring Boot wiring, the OkHttp timeout configuration, the Mermaid diagram syntax, the CSS grid layout, all of it produced in seconds. The collaboration worked because both sides stayed in their lane.
Five agents working in sequence, with a human checkpoint after fact generation and a Critic feedback loop before writing:
Human-in-the-loop: The pipeline pauses after the Fact Generator. The human can edit, reorder, add, or regenerate facts freely. Only after explicit approval does Planner → Critic → Writer proceed. This is a deliberate architectural choice, not a UX afterthought.
The Fact Generator produces facts, but the human approves them before the pipeline proceeds. This is not just a UX choice — it demonstrates one of the most important agent patterns: knowing when to pause for human judgment rather than running fully autonomously.
A single control shapes both what facts are invented and how the story is told. This keeps the two agents coherent — an outlandish fact set paired with plain literary prose would feel wrong, and vice versa.
Once approved, facts become the baseline and cannot be removed — only added to. New facts must be reconciled with all previous ones. This mirrors how real environments work and is what makes the original vs revised comparison meaningful.
A pipeline executes in sequence. An agent evaluates and decides its next step. The Critic's reject-and-replan cycle — with the rejection reason fed back into the Planner — is what crosses that line.
Fact Generator never plans. Planner never writes prose. Writer never judges feasibility. Critic never generates content. Clean separation makes each agent inspectable, testable, and independently tunable.
Model, temperature, and all prompt wording are in files — not hardcoded. Experiment freely without recompiling.
Rather than exposing a separate creativity slider, creativity is derived automatically from the world type the user already picked. Fewer controls, more coherent output — grounded → low creativity, outlandish → high creativity.
The codebase started with all classes in a single flat package — acceptable early on, but increasingly hard to navigate as the agent count grew. Rajiv directed a structural refactoring into five sub-packages: config, model, agent, result, and web. At the same time, duplicated infrastructure across all five agents (OkHttp client setup, prompt loading, Claude API call, cost calculation) was extracted into a shared BaseAgent class, and the four common fields on every result class were pulled into a shared AgentResult base. Zero behaviour change — purely structural, with all 40 unit tests passing before and after.
Invents life facts shaped by contradiction level, world type, and a requested fact count. Creativity is derived automatically from world type. Output is reviewed and optionally edited by a human before the pipeline proceeds.
Reads the world model (start, end, facts) and produces a structured 5–7 milestone outline. On retry attempts, the Critic's rejection reason is injected so the Planner knows exactly what to fix.
Evaluates the Planner's outline for timeline plausibility, fact consistency, and logical gaps. Returns APPROVED or REJECTED with a specific reason. Max 3 attempts before falling back to the last outline.
Takes the approved outline and generates full prose. Writing style is shaped by world type — plain literary fiction for grounded, epic narrative for outlandish. Used for both the original and revised story.
Runs automatically after the revised story is generated. Compares both stories side by side, identifies what changed, and explains exactly which new facts caused each divergence. Cost and time tracked independently from the main pipeline.
A single control that flows through both the Fact Generator and the Writer, keeping the invented facts and the prose style coherent with each other:
Real world, real rules. No heightened elements.
Heightened but believable. Extraordinary within reality.
One impossible element in an otherwise real world.
A completely different universe with its own rules.
Creativity level (low / medium / high) is derived automatically from the world type — it is passed to the Fact Generator but is not a user-facing control.
The application runs on Google Cloud Run — containerised, serverless, and auto-scaling to zero when idle. A single deploy script handles everything from build to live.
| Component | Details |
|---|---|
| Platform | Google Cloud Run — us-central1 region |
| Container | Docker, eclipse-temurin:17-jre-focal base image |
| Registry | Google Container Registry (gcr.io) |
| API key | Google Secret Manager — secret named ANTHROPIC_API_KEY, injected as STORY_OF_LIFETIME_ANTHROPIC_API_KEY |
| Deploy | ./deploy.sh — one-command build and deploy |
| Read timeout | 120 seconds — accommodates Claude's longest generation times |
Cloud Run scales to zero when idle and spins up within seconds on first request. The 120-second OkHttp read timeout is deliberate — Claude can take 60–90 seconds for long story generations and Cloud Run must not recycle the container mid-request.
One-time IAM setup grants the default compute service account access to the Secret Manager secret. After that, every ./deploy.sh run rebuilds the image, pushes it, and deploys the new revision in a single command.
Every change follows a feature branch → PR → review → merge cycle. No commits go directly to main.
| Step | Who | Action |
|---|---|---|
| 1. Branch | Claude | Creates a feature/<name> branch for the task |
| 2. Implement | Claude | Makes all changes on the branch, commits with co-authorship line |
| 3. PR | Claude | Opens a pull request via gh pr create with summary and test plan |
| 4. Review | Rajiv | Checks out the branch, runs the app or tests, and reviews the diff |
| 5. Merge | Claude | Merges once approved via gh pr merge |
| 6. Cleanup | Claude | Checks out main, pulls, deletes the local branch |
GitHub is configured to auto-delete the remote branch on merge. Commits include a
Co-Authored-By: Claude Sonnet 4.6 trailer to attribute AI contribution in the git log.
The gh CLI is authenticated and used for all GitHub operations — PR creation, merge, and status checks.
The project has two complementary layers of testing — automated unit tests covering the Java layer, and manual end-to-end testing through the live UI.
JUnit 5 via spring-boot-starter-test. All tests run without an API key — they test only local logic, not live Claude calls. The factCount=0 short-circuit in FactGeneratorAgent makes it possible to exercise the agent class itself without any network call.
Coverage: AppConfig defaults and story length parsing · WorldModel constructor and getters · CriticResult APPROVED/REJECTED decision logic · FactGenerateRequest field defaults and setters · all five agent result classes · every prompt template file (exists + all required placeholders present) · FactGeneratorAgent zero-count short-circuit.
Every agent interaction is exercised against the live Claude API across world types, contradiction levels, and story lengths. Edge cases — the Critic rejecting all three attempts, empty fact lists, very long generations, the 0-fact path — are caught through hands-on use. The feedback loop between running the app and directing the next change is where most of the product quality came from.
Run unit tests locally with mvn test. No API key required.
| Method | Path | What it does |
|---|---|---|
| POST | /api/generate-facts | Runs FactGeneratorAgent. Returns fact list for human review. |
| POST | /api/generate | Runs Planner → Critic loop → Writer. Returns outline, story, critic decisions, per-agent cost. |
| POST | /api/explain | Runs Explainer on original vs revised. Returns diff analysis and cost. |
| GET | / | Single-page application (index.html) |
| GET | /architecture.html | This document |
Token usage and cost are tracked independently per agent on every request. Pricing as of 2025:
| Model | Input per 1M tokens | Output per 1M tokens |
|---|---|---|
| claude-opus-4-6 | $15.00 | $75.00 |
| claude-sonnet-4-6 | $3.00 | $15.00 |
| claude-haiku-4-5-20251001 | $0.25 | $1.25 |
A typical medium-length generation (FactGen + Planner + Critic + Writer) costs roughly $0.05–$0.10 with default settings. Running the revised story + Explainer adds approximately $0.01–$0.03.
| Layer | Technology |
|---|---|
| Backend | Java 17, Spring Boot 3.2, Maven |
| Frontend | Single-page HTML/CSS/JS — no framework, no build step |
| LLM API | Anthropic Claude — configurable model per agent |
| HTTP client | OkHttp 4.12 (30s connect/write, 120s read timeout) |
| JSON | Jackson Databind |
| Hosting | Google Cloud Run — containerised, serverless, auto-scaling |
| Secrets | Google Secret Manager — API key injected as env variable |
| Container | Docker, eclipse-temurin:17-jre-focal base image |