For Agents

Meta-knowledge from the Spec 1 effort: which workflow steps and which review practices actually paid off, and what to keep / drop next time. Read this before kicking off Spec 2 or any similar greenfield infra project.

Implementation retrospective for the GCP VM provisioning Spec 1, deployed live on 2026-05-25. Focus is on process — what to repeat, what to adjust — rather than the technical details (those live in gcp-vm-provisioning-design and spec-1-deployment-complete).

Workflow Used

RPI-style staged workflow, with subagent-driven-development on the implementation phase:

  1. Brainstorm — open-ended scoping pass, no commitments. Surfaced the two-spec split (VM provisioning vs app deploy) and the Tailscale-only access decision early.
  2. Spec — wrote gcp-vm-provisioning-design as the approval artifact. Read sequentially top-to-bottom; every decision had a rationale paragraph.
  3. Plan — task-list breakdown of the spec into atomic implementation tasks (Terraform modules, Ansible roles, Makefile targets, etc.).
  4. Executesubagent-driven development: a fresh subagent spawned per task, every output passed through a two-stage review (spec-compliance reviewer + code-quality reviewer), all on Opus. The two reviewers ran in parallel and the implementing subagent integrated both rounds of feedback before the work landed.
  5. Validate — pre-deployment paper-validation pass against the design note, then the live deployment pass against the real GCP project.

Key property: each subagent had only the task it owned in context — no cross-task contamination, no drift toward an implicit shared mental model that nobody had written down.

What Worked

  • Two-stage review caught real issues — the validation pass surfaced four design-stage bugs that a single reviewer would have shipped:
    • The API-enablement race against the IAM binding (see gotcha 6).
    • The ~-path expansion gap in file() (see gotcha 2).
    • The default VPC shipping with an open default-allow-ssh rule (see gotcha 5).
    • The OS Login override silently defeating instance-metadata SSH keys (see gotcha 4).
  • Upfront brainstorming kept scope tight — committing the two-spec split before writing a line of Terraform meant Spec 1 stayed focused on VM lifecycle and never grew app-deploy responsibilities by accretion. That discipline carried straight through to live deployment.
  • Pinning provider and OTel versions stopped surpriseshashicorp/google ~> 6.0 and otelcol-contrib v0.152.1 are pinned in code. No “the playbook used to work” mysteries on re-run.
  • Dedicated VPC + Tailscale-only made the security story trivial — there is literally no public SSH port to defend. The fail2ban unit is defense-in-depth, not the primary control. This kept the threat-model conversation a one-liner.

What Needed Adjustment Mid-Stream

  • SSH model pivot to Tailscale SSH — the design landed on key-based SSH over Tailscale (with a macOS Keychain step on the operator workstation). During the deployment pass this was switched to Tailscale SSH (tailscale up --ssh + an ACL rule). Much simpler operator UX, no Keychain dance, no ssh-agent lifecycle, no SSH keypair to manage. See Late Refinement — Tailscale SSH Replaces Key-Based SSH. The lesson: be willing to revisit auth-model decisions during deployment — the design-time choice is usually informed by a less-rich picture than the deployment-time one.
  • OTel collector version pin was stale — the design specified otelcol-contrib 0.119.0. By deployment day the actual current was 0.152.1. The pin was bumped live; the v0.152 release also renamed hostmetricshost_metrics and otlpotlp_grpc (see gotcha 10). Lesson: for fast-moving upstream components, verify the pinned version is still current as the first task of the deployment pass, not after the role fails to converge.
  • Ansible gather_facts ordering — the play definition needed gather_facts: false plus an explicit ansible.builtin.setup after wait_for_connection, otherwise the implicit fact-gather races VM boot and the run dies on the first connect with UNREACHABLE. Caught on the first live make configure. See gotcha 8.

Surprises During Live Run

Five deployment-time gotchas are catalogued under entries 7–11 of gcp-terraform-ansible-gotchas. The Tailscale ACL was the one that took two tries — the first attempt used dst: ["autogroup:self"], which does not match tagged devices (the VM was joined with an auth key carrying tag:cloud, so the device has no user owner for autogroup:self to resolve against). Fixed by targeting dst: ["tag:cloud"] explicitly. See gotcha 7.

The rest of the deployment-time gotchas (OTel :8888/metrics as the canonical export-success signal, OTel v0.152 deprecation aliases, the Ansible fact-gathering ordering, rtk proxy as a bypass for tools whose raw output matters) all landed cleanly once identified.

For Spec 2

Direct carryovers to spec-2-roadmap:

  • Keep the same workflow — the brainstorm → spec → plan → execute → validate cycle and the two-stage subagent review held up under live deployment pressure. It scales.
  • The existing gotchas note is now an asset, not a rumor pile — eleven entries with concrete fixes. Lead the Spec 2 design with a pass through it.
  • Pin OTel versions, but add a verify-before-run checklist item — the version bump from 0.119.0 to 0.152.1 was a free win because it was caught early; build a one-liner check (“is the pinned upstream still current?”) into the deployment-pass kickoff.
  • Consider Cloud NAT to drop the external IPv4 — saves ~1/mo for NAT, and removes the last public-facing surface the VM has (egress only, but still). Worth it for the security story alone.
  • Consider Secret Manager for the SigNoz / Tailscale keys — the current secrets/ gitignored files work, but GCP Secret Manager would centralize rotation and audit-trail the access. Modest cost (~cents/mo for two secrets), real operational upgrade.

Numbers

  • ~16 implementation commits on the feature branch (the actual Terraform / Ansible / Makefile work) plus ~5 docs commits (design note, gotchas, deployment writeup) before merge to main. main is at commit 8ca6b6e.
  • ~10 GCP resources provisioned: project APIs (4), custom VPC + subnet + firewall rule, service account, compute instance, IAM binding, local inventory file. Apply time on a clean project: ~90 seconds.
  • ~25,000+ host-metric data points shipped to ingest.eu2.signoz.cloud since first export. Zero failed exports — the otelcol_exporter_send_failed_metric_points counter is absent from :8888/metrics.
  • ~$18/mo steady-state cost at e2-small + 20 GB pd-balanced + ephemeral external IPv4.