For Agents
Implementation retrospective for the GCP VM provisioning Spec 1, deployed live on 2026-05-25. Focus is on process — what to repeat, what to adjust — rather than the technical details (those live in gcp-vm-provisioning-design and spec-1-deployment-complete).
Workflow Used
RPI-style staged workflow, with subagent-driven-development on the implementation phase:
- Brainstorm — open-ended scoping pass, no commitments. Surfaced the two-spec split (VM provisioning vs app deploy) and the Tailscale-only access decision early.
- Spec — wrote gcp-vm-provisioning-design as the approval artifact. Read sequentially top-to-bottom; every decision had a rationale paragraph.
- Plan — task-list breakdown of the spec into atomic implementation tasks (Terraform modules, Ansible roles, Makefile targets, etc.).
- Execute — subagent-driven development: a fresh subagent spawned per task, every output passed through a two-stage review (spec-compliance reviewer + code-quality reviewer), all on Opus. The two reviewers ran in parallel and the implementing subagent integrated both rounds of feedback before the work landed.
- Validate — pre-deployment paper-validation pass against the design note, then the live deployment pass against the real GCP project.
Key property: each subagent had only the task it owned in context — no cross-task contamination, no drift toward an implicit shared mental model that nobody had written down.
What Worked
- Two-stage review caught real issues — the validation pass surfaced four design-stage bugs that a single reviewer would have shipped:
- Upfront brainstorming kept scope tight — committing the two-spec split before writing a line of Terraform meant Spec 1 stayed focused on VM lifecycle and never grew app-deploy responsibilities by accretion. That discipline carried straight through to live deployment.
- Pinning provider and OTel versions stopped surprises —
hashicorp/google ~> 6.0andotelcol-contribv0.152.1 are pinned in code. No “the playbook used to work” mysteries on re-run. - Dedicated VPC + Tailscale-only made the security story trivial — there is literally no public SSH port to defend. The fail2ban unit is defense-in-depth, not the primary control. This kept the threat-model conversation a one-liner.
What Needed Adjustment Mid-Stream
- SSH model pivot to Tailscale SSH — the design landed on key-based SSH over Tailscale (with a macOS Keychain step on the operator workstation). During the deployment pass this was switched to Tailscale SSH (
tailscale up --ssh+ an ACL rule). Much simpler operator UX, no Keychain dance, nossh-agentlifecycle, no SSH keypair to manage. See Late Refinement — Tailscale SSH Replaces Key-Based SSH. The lesson: be willing to revisit auth-model decisions during deployment — the design-time choice is usually informed by a less-rich picture than the deployment-time one. - OTel collector version pin was stale — the design specified
otelcol-contrib0.119.0. By deployment day the actual current was0.152.1. The pin was bumped live; the v0.152 release also renamedhostmetrics→host_metricsandotlp→otlp_grpc(see gotcha 10). Lesson: for fast-moving upstream components, verify the pinned version is still current as the first task of the deployment pass, not after the role fails to converge. - Ansible
gather_factsordering — the play definition neededgather_facts: falseplus an explicitansible.builtin.setupafterwait_for_connection, otherwise the implicit fact-gather races VM boot and the run dies on the first connect withUNREACHABLE. Caught on the first livemake configure. See gotcha 8.
Surprises During Live Run
Five deployment-time gotchas are catalogued under entries 7–11 of gcp-terraform-ansible-gotchas. The Tailscale ACL was the one that took two tries — the first attempt used dst: ["autogroup:self"], which does not match tagged devices (the VM was joined with an auth key carrying tag:cloud, so the device has no user owner for autogroup:self to resolve against). Fixed by targeting dst: ["tag:cloud"] explicitly. See gotcha 7.
The rest of the deployment-time gotchas (OTel :8888/metrics as the canonical export-success signal, OTel v0.152 deprecation aliases, the Ansible fact-gathering ordering, rtk proxy as a bypass for tools whose raw output matters) all landed cleanly once identified.
For Spec 2
Direct carryovers to spec-2-roadmap:
- Keep the same workflow — the brainstorm → spec → plan → execute → validate cycle and the two-stage subagent review held up under live deployment pressure. It scales.
- The existing gotchas note is now an asset, not a rumor pile — eleven entries with concrete fixes. Lead the Spec 2 design with a pass through it.
- Pin OTel versions, but add a verify-before-run checklist item — the version bump from
0.119.0to0.152.1was a free win because it was caught early; build a one-liner check (“is the pinned upstream still current?”) into the deployment-pass kickoff. - Consider Cloud NAT to drop the external IPv4 — saves ~1/mo for NAT, and removes the last public-facing surface the VM has (egress only, but still). Worth it for the security story alone.
- Consider Secret Manager for the SigNoz / Tailscale keys — the current
secrets/gitignored files work, but GCP Secret Manager would centralize rotation and audit-trail the access. Modest cost (~cents/mo for two secrets), real operational upgrade.
Numbers
- ~16 implementation commits on the feature branch (the actual Terraform / Ansible / Makefile work) plus ~5 docs commits (design note, gotchas, deployment writeup) before merge to
main.mainis at commit8ca6b6e. - ~10 GCP resources provisioned: project APIs (4), custom VPC + subnet + firewall rule, service account, compute instance, IAM binding, local inventory file. Apply time on a clean project: ~90 seconds.
- ~25,000+ host-metric data points shipped to
ingest.eu2.signoz.cloudsince first export. Zero failed exports — theotelcol_exporter_send_failed_metric_pointscounter is absent from:8888/metrics. - ~$18/mo steady-state cost at e2-small + 20 GB pd-balanced + ephemeral external IPv4.