An Operating Console for the Running Game: Six Weeks of MCP
StarGazer now hosts a Model Context Protocol server inside the game process. Twenty tools across read, mutate, capture, commit. Every screenshot in this post was captured by the MCP driving itself.

An Operating Console for the Running Game: Six Weeks of MCP
Dev Log
- The cold-launch tax died.
--screenshot path --at-tick Ntook ~10 seconds per shot — build, launch, voxel-cook, render, save, exit. The newframe.captureruns against a living game in 23 milliseconds. That's the load-bearing rewrite, and every other tool in this post followed from it. - 20 tools across foundation / capture / mutate / commit / replay. They're not a wishlist — every one of them has at least one smoke-test caller. Most have a skill consumer landed or scheduled.
- One protocol, no inventions. Standard Model Context Protocol over stdio, served by the official
ModelContextProtocol.NET SDK 1.3.0. Claude CLI spawnsStarGazer --mcpas a subprocess. No custom JSON-RPC framing, no in-process registry to maintain. - Compile-gated three ways.
#if STARGAZER_MCP, a newDebugMcpcsproj configuration, and a--mcpCLI flag. Release builds contain zero MCP code. CI greps the publish artifact to enforce. - Three-label determinism contract. Every mutation is tagged
sim_observedandside_channel. Lighting tweaks don't flag replay-dirty; weapon damage edits do. Honest labels are what makes the audit trail trustworthy. - Every screenshot you're about to see was driven by the MCP. Single
dotnet StarGazer.dll --mcpprocess, one Python script piping JSON-RPC into stdin, captures stamped sub-second each.
Baseline: one capture, sub-second
Captured by frame.capture at sim tick 492 against a running game. Capture latency is sub-frame (<33ms at 30fps); the same screenshot used to take ~10 seconds because it required a cold launch. The cost dropped two orders of magnitude.[^latency]
[^latency]: Latency measurements are from the developer machine; the artifacts checked into screenshots/2026-05-26/mcp-blog-demo/ are the outputs of the capture session, not the timing logs. The session.txt and state-snapshot.json files alongside the JPEGs record the session ID, game commit, and sim state at capture time.
Before, the visual-eval skill — the one that catches "did this rendering change actually look right" — burned a full cold launch per shot. Voxel-cook the fleet, allocate render targets, warp the ships in, render two frames, save, exit. Ten seconds of latency on a question that needs five iterations to answer.
Now the skill checks if MCP is available (scripts/mcp-available), and if it is, calls frame.capture against the existing process. The response carries the path, the dimensions, the sim tick the frame was captured at, and a server-side capture timer. The C1 hard gate was "less than one second"; observed dev-machine captures land well under that — comfortably inside a single 60Hz frame.
Sweeping a knob: lighting key intensity
The headline tool is frame.sweep. It takes one mutation key, a list of values, and an output directory; it applies each value in sequence, captures a frame, and reverts the original on the way out. Three shots from a single call:
lighting.key.intensity = 3.0 — warm key dimmed to half-default. Notice the cooler magenta-rim cast dominating the top of the hull.
lighting.key.intensity = 5.75 — the value Game1.LoadContent seeds at startup. The warm yellow on the hull dorsal returns to its tuned balance.
lighting.key.intensity = 8.0 — schema-clamp ceiling. Bright warm fill across the entire dorsal surface, picking up the engine plating that's flat in the dimmer pass.
Three captures, 128 ms total wall-clock for the whole sweep including the revert. The mutation is sim_observed: false, side_channel: true — lighting is a render-only uniform, the sim doesn't observe it, the determinism contract correctly says replay isn't dirty. The LightingMutator writes to both BasicEffect (the legacy ship-draw path) and ShipPbrEffect (the active PBR path) so whichever pipeline draws the ship today, the sweep affects it.
This is the workflow the persona reviews unanimously asked for. UI designers, creative directors, combat designers all converged on "show me three values side by side, let me commit the one I pick." The CLI flow couldn't deliver it — three sweeps means three cold launches. Six minutes to compare three numbers. The MCP variant is fast enough to be the default iteration loop, not a special tool.
A side-channel mutation that almost wasn't
While building this, the diff tool revealed a regression I'd missed: mcp.commit.diff reported tonemap mutations as not pending even after I'd applied them. The fix was a real bug — the rendering pipeline was hardcoding acesTonemap.Exposure = 1.0f every frame, clobbering whatever the MCP had written one tick earlier. After removing the reset, exposure mutations actually persist:
tonemap.exposure = 0.6. The deck floor reads darker, the warm star in the foreground compresses toward orange, the cyan running lights still pop but with less surrounding bloom.
tonemap.exposure = 1.5. The grid plane lifts noticeably, the magenta rim light on the dorsal becomes more present, the asteroid debris in the background catches more reflected fill.
What's interesting isn't the comparison — it's that the diff tool found the bug. I asked the MCP to show me what had changed in the session; it told me "nothing." I knew I'd just mutated three values. That contradiction surfaced the silent reset. Without the MCP audit-loop, that hardcoded 1.0f would have shipped invisibly until a designer noticed their tonemap tweaks didn't survive a frame.
Spawning a controlled matchup
spawn.matchup resets the simulation with a crafted SimulationConfig. scenario.force composes that with voxel.paint_damage to reach canonical encounter geometries from cold:
scenario.force {kind: "broadside_duel"} — two ships placed at [-30,0,0] and [30,0,0] with yaws facing each other, then sim.step 120 ticks to let combat unfold. Visible: cyan beam crossing center-frame, magenta-rim impact glow on the receiving ship's bow.
This unlocks a workflow that didn't exist before. A combat designer can:
spawn.matchupto set the seed + ship packs deterministicallycombat.telemetryfor the baseline aggregates over 600 ticksmutate.set weapon.beam.damageto a candidate value- Re-spawn with the same seed, run again, take new telemetry
- Diff the two aggregates programmatically
mcp.commit weapon.beam.damageif the candidate wins
The combat.telemetry tool returns honest data — its response includes a honest_warnings array that explicitly says "damage_dealt_is_beam_hit_count_proxy" and "tactic_phase_ticks_is_flip_count_not_dwell_ticks" so a balance designer can't mistake hit-counts for true damage points. That kind of self-documenting honesty was a persona-review demand: agents will trust labels they can't game, and stop trusting labels they catch lying.
Catching bugs visual-eval can't
Three tools (hud.assert_contrast, hud.assert_no_overlap, hud.assert_hit_targets) walk the HUD element registry and sample post-tonemap pixels. The smoke run for this blog post turned up 22 contrast failures and 1 overlap — real issues no screenshot-then-eyeball flow had caught:
{
"ui_path": "/hud/buttonpanel/screenshot",
"x": 1080, "y": 26, "w": 176, "h": 26,
"ratio": 1.389,
"fg": [21, 41, 71, 255],
"bg": [6, 6, 6, 255]
}
That's the "Screenshot" button's label — (21,41,71) on near-black (6,6,6). A 1.389 contrast ratio against the WCAG 4.5 floor. The persona-UI-designer review predicted exactly this category of bug: humans skim past low-contrast text, the asserter samples the pixels and surfaces it. Now it's a tool I can re-run in 74 ms during any UI iteration.
State queries that replace log-tailing
{
"sim_meta": {
"tick": 492, "paused": false, "seed": 1,
"total_ships": 2, "total_beams_active": 0,
"combat_enabled": true, "weapon_range": 120
},
"ships": [
{
"id": 0,
"pack": "voxel:spaceship:0xB705399BB32CE1E4",
"pose": { "pos": [0,0,0], "fwd": [0,0,1], "vel": [0,0,0], ... },
"hp_ratio": 0.9954,
"weapon": { "cooldown_ticks": 0, "beam_active": false }
},
...
]
}
That's state.query {scopes:["sim_meta","ships"]}. 35 ms. Before this, the same information meant grepping .run/stargazer.log for FLEET_LOAD / BEAM_FIRED / etc., then mentally reconstructing the world state from log lines. Now it's a structured object.
The query supports nine scopes — sim_meta | ships | voxels | beams | ai | render_pipeline | shader_uniforms | perf | log_tail — each a slice of game state. Skills can pull what they need and ignore the rest. A bug repro can ship as {tick, seed, state.query output} instead of a vague "I think the AI was orbiting weirdly."
Invariants that don't lie
{
"checks": [
{
"name": "determinism", "passed": true,
"details": "baseline=8010EE1396BEB86C live_hashes=[ship0=0xBFE6788A5D347DF2,ship1=0xD11570367C7E9913]"
},
{ "name": "voxel_topology", "passed": true },
{ "name": "physics_finite", "passed": true },
{ "name": "no_nan_transforms", "passed": true },
{ "name": "ai_state_well_formed", "passed": true }
],
"all_passed": true
}
invariants.check runs five sanity checks against the live sim: voxel topology consistency, NaN-free transforms, AI state well-formedness, determinism baseline alignment against .last-hash. It's forced-couple with replay.export — exporting a replay refuses to write the file if any invariant fails, unless the caller passes accept_dirty: true. That refusal is what makes a replay a regression test instead of a snapshot of broken state.
What it cost to build, and what it unlocks
Five commits collapsing what the build plan called "six weeks" — a deliberately conservative pace that admits the dispatcher, the determinism contract, and the source-write semantics each need real thought, not just code. Weeks one and two landed in a single commit (the screenshot-refactor prereq folded cleanly into the foundation drop); weeks three through five each got their own. Every commit was preceded by a smoke test and followed by a CI gate. The repo now contains:
- 20 MCP tools under
game/Mcp/Tools/ - A phase-aware dispatcher with LIFO inverse-mutation log for exception-safe revert (
McpDispatcher.RunMultiPhase+IPhaseContext.Hop+OnRevert) - Three audit logs at
.run/mcp-{calls,mutations,commits}.log— append+flush per line, mutation_seq monotonic counters - Schema-driven mutation namespace at
game/Mcp/Schema/v1.json— 48 keys with determinism labels, source paths, clamps - A migrated
stargazer-visual-evalskill that uses MCP when available and CLI as fallback - A CI publish-grep gate that fails the build if MCP code leaks into Release artifacts
DOC/MCP-SERVER-PLAN.md,DOC/MCP-SKILLS-PLAN.md,DOC/MCP-REPLAY-FORMAT.md— the design history with every persona review, every critique pass, every revision the proposal went through
What v1 doesn't ship and intentionally so:
- Roslyn rewriter for
inline_constantcommits. Lighting and tonemap live asVector3(5.75f, 4.25f, 2.50f)literals insideGame1.cs. The MCP can tune them runtime; it cannot write them back. The fix is a RoslynSyntaxRewriterthat finds the rightMemberAccessExpressionSyntax, replaces the value, and re-formats — that's a v2 line item. - Runtime-loadable scenarios.
scenarios/X.jsonfiles exist as skill metadata but the game doesn't parse them yet. Adding a--scenario <path>flag is small; getting the scenario-runner to round-trip againstmcp.commitis the v2 work. - HTTP+SSE transport. Stdio is enough for Claude CLI's MCP support. Multi-client over HTTP lands when there's a second consumer.
Where this goes
The pattern that emerged across the persona reviews and the implementation is: an MCP that's right is a much wider tool surface than the conservative cut would have given you, gated by a much narrower set of hard constraints. Compile gating prevents the surface from ever shipping to a player. Determinism labels prevent ephemeral tweaks from corrupting replays. The explicit mcp.commit verb means session-local exploration never becomes silent persistence. Inside those four walls, you can ship voxel.set_hp, weapon.beam.damage, material.<pack>.emissive — all the knobs the conservative reviews refused on procedural grounds — without making the game worse.
The blog post you're reading is the smallest possible demonstration: a single python3 script piped into dotnet StarGazer.dll --mcp, eight RPC calls, eight artifacts on disk in under two seconds, then committed alongside the post. Every other dev tool I've built lives on the wrong side of "fast enough to use during real iteration." This one moved over the line.
There's a screenshot for that.
Captured at sim tick 612 after scenario.force {kind: "broadside_duel"} + sim.step 120. Total time from "spawn this exact matchup" to "JPEG on disk": 1.8 seconds. Cold-launch equivalent: 14 seconds and a CLI argument tower I didn't have to invent.
Capture session pinned at commit 17991ee. MCP session id 5a189370-8736-4ca1-916d-fd74b20e813e. All proofs live at screenshots/2026-05-26/mcp-blog-demo/ in the repo.