Approved design for Spec 1 of the levandor-infra project: provision and deprovision a single Docker-ready, Artifact-Registry-ready Ubuntu VM on GCP via Terraform + Ansible, reachable over Tailscale, with base hardening, GitHub keys, and host telemetry to SigNoz Cloud.
Deployed (2026-05-25)
Spec 1 of 2. Approved, validated, and now live on GCP —
ops-vm(e2-small, us-central1-a) is provisioned, configured, joined to the tailnet, and shipping host metrics to SigNoz Cloud. See spec-1-deployment-complete for the deployed state, the late refinement to Tailscale SSH, and the telemetry / cost numbers. Spec 2 (the application-deploy layer) is previewed at the end. Source spec:docs/superpowers/specs/2026-05-22-gcp-vm-provisioning-design.mdin the repo.
Scope
Spec 1 (this design) — stand up and tear down one VM:
make provisionproduces a configured GCP VM reachable over the tailnet.make deprovisioncleanly removes the VM and every resource it created.- Docker-ready and Artifact-Registry-ready, base-hardened, GitHub keys installed, host metrics flowing to SigNoz Cloud.
- Passwordless operation after a one-time setup — no command ever prompts for a password.
Spec 2 (future, not in this design) — the application-deploy layer: Artifact Registry image pulls + docker compose, Cloudflare ZTNA ingress, batch jobs, app logs/traces to SigNoz, plus hardening (no external IP + Cloud NAT, GCP Secret Manager, Tailscale ACL tags).
Architecture
Two layers, one automatic handoff. Terraform provisions; Ansible configures. The handoff is a Terraform-generated Ansible inventory file — explicit and version-inspectable, requiring no extra plugins or Ansible-side GCP credentials.
Provision flow
flowchart TD M["make provision"] --> TF["terraform apply"] TF --> R["Creates: required APIs,<br/>dedicated VPC + subnet,<br/>service account + IAM,<br/>VM instance"] R --> SS["VM startup-script:<br/>install Tailscale,<br/>join tailnet as <vm_name>"] TF --> INV["Terraform writes<br/>ansible/inventory/hosts.ini<br/>(ansible_host = MagicDNS name)"] INV --> AP["ansible-playbook"] SS -.VM online on tailnet.-> AP AP --> ROLES["Roles in order:<br/>base → docker →<br/>github_keys → monitoring"] ROLES --> OTEL["OTel Collector ships<br/>host metrics to SigNoz Cloud"] style M fill:#264653,stroke:#2a9d8f,color:#fff style TF fill:#2d2d2d,stroke:#888,color:#fff style AP fill:#2d2d2d,stroke:#888,color:#fff style OTEL fill:#264653,stroke:#2a9d8f,color:#fff
make deprovision runs terraform destroy, removing every managed resource. The VM’s Tailscale node uses an ephemeral auth key, so it auto-removes from the tailnet shortly after the VM is destroyed.
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| GCP authentication | gcloud Application Default Credentials | Single dev, one machine; no key files to store or leak. |
| Terraform state | Local file | Simplest; documented migration path to a GCS backend. |
| TF → Ansible handoff | Terraform templates an inventory file | Explicit, version-inspectable; no extra plugins or Ansible-side GCP creds. |
| Trigger UX | Makefile targets | Discoverable via make help, no new dependencies. |
| VM model | Single VM via variables, reusable modules/vm/ | Matches current need; logic lives in a reusable module so the layout extends to other GCP resource types without a rewrite. |
| VM network access | Tailscale — no public SSH port | Dedicated custom VPC with no inbound rules; reachable only on the tailnet. |
| SSH key passphrase | macOS Keychain (design) → N/A in deployment | The deployed environment uses Tailscale SSH instead, removing the keychain prereq entirely. |
| SSH auth model | Key-based SSH over Tailscale (design) → Tailscale SSH (live) | Deployment late-refined to tailscale up --ssh + ACL — fewer moving parts; predictable sudo via google-sudoers is unchanged. |
| Host telemetry | OpenTelemetry Collector → SigNoz Cloud (eu2) | Managed SigNoz SaaS; host_metrics over OTLP/gRPC (renamed from hostmetrics in v0.152); nothing to self-host. Replaced node_exporter. |
| Image builds | CI → Artifact Registry (Spec 2) | The VM pulls images via its dedicated service account. |
Reusable module, single VM
The VM is provisioned as a single instance driven by variables, but the actual logic lives in
modules/vm/. Future GCP resource types become new root.tffiles backed by newmodules/— additive, not a restructure.
Repo Layout
The repo root is the Terraform root (/Users/levander/levandor/terraform).
terraform/
├── Makefile operator interface
├── main.tf providers + APIs + VPC + service account + vm module + inventory
├── variables.tf / outputs.tf
├── inventory.tftpl Ansible inventory template
├── startup-script.tftpl VM bootstrap: install + join Tailscale
├── terraform.tfvars.example
├── modules/vm/ reusable VM module (main/variables/outputs)
├── ansible/
│ ├── ansible.cfg
│ ├── site.yml runs the 4 roles
│ ├── inventory/hosts.ini GENERATED by Terraform (gitignored)
│ └── roles/ base, docker, github_keys, monitoring
├── docs/superpowers/specs/ design specs
└── secrets/ gitignored — GitHub deploy key + SigNoz ingestion key
Terraform Layer (provisioning)
- Providers:
hashicorp/google ~> 6.0(ADC auth, no creds in code),hashicorp/local(inventory file).required_version >= 1.6. - Local backend —
terraform.tfstatein the repo root, gitignored. README documents the one-block change to migrate to a GCS backend. - Root resources (
main.tf):google_project_service— enablescompute,iam,artifactregistryAPIs, withdisable_on_destroy = false.google_compute_network(custom mode) +google_compute_subnetwork— a dedicated VPC, no inbound firewall rules.google_service_account— dedicated SA for the VM (replaces the default SA), grantedroles/artifactregistry.reader(forward-looking for Spec 2).module vm—depends_onthegoogle_project_serviceresources.local_file— rendersansible/inventory/hosts.inifrominventory.tftpl;depends_onthe VM.
modules/vm/—google_compute_instance(Ubuntu 24.04 LTS, dedicated VPC/subnet, ephemeral external IP for egress only, dedicated SA, network tag, labels). Metadata:ssh-keys,enable-oslogin = FALSE, and astartup-scriptthat installs Tailscale and runstailscale up --ssh(Tailscale SSH enabled). Optionalgoogle_compute_firewallfor inboundudp:41641(faster direct Tailscale, default on). Notcp:22rule exists.- The SigNoz endpoint and ingestion key are not Terraform variables — they are Ansible-side config, so the key never enters
tfstateor VM metadata.
Ansible Layer (configuration)
ansible.cfglives inansible/:inventory = inventory/hosts.ini,roles_path = roles,host_key_checking = accept-new.become_ask_passis not set (sudo is passwordless).make configureruns Ansible withANSIBLE_CONFIG=ansible/ansible.cfgso the config is actually loaded.site.yml— single play,hosts: vm,become: true,gather_facts: falseat play level, with await_for_connectionpre-task (~300 s) to absorb VM boot + Tailscale join and then an explicitansible.builtin.setupcall to gather facts (see gotcha 8). Then four roles in order:- base — apt utilities,
unattended-upgrades(Automatic-Reboot false),fail2ban. - docker — Docker CE from the official apt repo (keyring +
signed-by), compose plugin, addsssh_userto thedockergroup, configures the Artifact Registry credential helper. - github_keys — copies a private deploy key from
secrets/to the VM, addsgithub.comtoknown_hosts; skipped automatically if no key is present. - monitoring — installs the OpenTelemetry Collector (
otelcol-contribv0.152.1,host_metricsreceiver) as a systemd service, exporting OTLP/gRPC (otlp_grpcexporter) to SigNoz Cloud.
- base — apt utilities,
- All tasks use built-in Ansible modules — no Galaxy collections.
Host Telemetry — OTel to SigNoz Cloud
The monitoring role installs otelcol-contrib v0.152.1 (the OpenTelemetry Collector contrib distribution, which bundles the host_metrics receiver) and runs it as a systemd service. It collects host metrics (CPU, memory, disk, filesystem, network, load), tags them via a resourcedetection processor (system + GCP attributes), and exports OTLP/gRPC to SigNoz Cloud at ingest.eu2.signoz.cloud:443 (TLS on, signoz-ingestion-key header). Live status: ~25k host-metric points exported, zero failures — see Telemetry Status — Flowing to SigNoz Cloud.
Why OTel replaced node_exporter
The
host_metricsreceiver covers the same host metrics node_exporter would, and OTel gives a single agent that Spec 2 can extend with afilelogreceiver and OTLP trace intake for app logs/traces. SigNoz Cloud (eu2, managed SaaS) means nothing to self-host. The ingestion key is read fromsecrets/signoz_ingestion_keyand templated into the collector config (restrictive permissions, not world-readable) — it is an Ansible-side secret only, never a Terraform variable.
Verifying export success
systemctl is-active otelcol-contribis necessary but not sufficient. Confirm metrics actually flow by grepping the collector’s own:8888/metricsforotelcol_exporter_sent_metric_points(>0) andotelcol_exporter_send_failed_metric_points(0 or absent). See gotcha 9.
Passwordless Operation
A hard requirement. Three places a prompt could appear, each eliminated:
- SSH key passphrase — design called for a one-time
ssh-add --apple-use-keychain ~/.ssh/id_ed25519to store the passphrase in the macOS Keychain, withUseKeychain yes+AddKeysToAgent yesin~/.ssh/config. Deployment refined this away: switching to Tailscale SSH removes the SSH keypair entirely, so there is no passphrase to supply. See Late Refinement — Tailscale SSH Replaces Key-Based SSH. - sudo password — GCP’s guest agent grants metadata-provisioned users passwordless sudo via the
google-sudoersgroup, so Ansiblebecomeneeds no password. - Host-key prompt —
accept-newis set inansible.cfg, andmake sshpasses-o StrictHostKeyChecking=accept-new. With Tailscale SSH, the tailnet itself authenticates the host, eliminating most of this concern in practice.
Provision / Deprovision Workflow
The Makefile is the operator interface. terraform apply / destroy prompt for confirmation (no -auto-approve); targets stop on the first error.
| Target | Action |
|---|---|
make help | List all targets. |
make preflight | Check terraform / gcloud / ansible / tailscale installed, ADC active, Tailscale connected. |
make init / fmt / validate / plan | Terraform lifecycle helpers. |
make provision | Full flow: vm-up then configure. |
make vm-up | terraform apply creates the VM, writes the inventory. |
make configure | ANSIBLE_CONFIG=… ansible-playbook site.yml — keys + telemetry + utilities. |
make deprovision | terraform destroy tears the VM down. |
make ssh | tailscale ssh ops@ops-vm over Tailscale. |
make verify | Post-provision smoke check over SSH. |
Secrets Handling
.gitignoreexcludessecrets/*,*.tfvars,*.tfstate*,.terraform/, andansible/inventory/hosts.ini.- GitHub deploy key — in
secrets/(gitignored), referenced by a path variable. One deploy key grants one repo; multi-repo access (Spec 2) needs a machine-user key. - Tailscale auth key — set in
terraform.tfvars(gitignored) as asensitivevariable, injected into the VM’s startup-script metadata. Carriestag:cloudso the SSH ACL can target it. Caveat: instance metadata is readable by project viewers and on-VM processes — acceptable for a personal project; GCP Secret Manager is the Spec 2 hardening path. Use a key with a short expiry. - SigNoz ingestion key — in
secrets/signoz_ingestion_key(gitignored), read by themonitoringrole and templated into the OTel config. Never a Terraform variable, so it never lands intfstateor instance metadata.
Verification
make verify runs a post-provision smoke check over SSH: tailscale status shows the node online; sudo -n true succeeds (passwordless sudo); docker --version + systemctl is-active docker; systemctl is-active otelcol-contrib and fail2ban; groups includes docker; and curl -s localhost:8888/metrics shows otelcol_exporter_sent_metric_points >0 with no send_failed. End-to-end acceptance (now achieved): a clean make provision yields a Tailscale-reachable, fully-configured VM with host metrics visible in the SigNoz Cloud dashboard and no password prompt at any step; make deprovision removes every resource.
Spec 2 Preview (out of scope here)
Spec 2 builds the application-deploy layer on this VM: pull images from Artifact Registry + run via docker compose; the Artifact Registry repo (Terraform) + a per-repo CI workflow template; app logs/traces to SigNoz via a filelog receiver + OTLP trace intake; a cloudflared role + Cloudflare ZTNA access; batch-job scheduling; and hardening (no external IP + Cloud NAT, GCP Secret Manager, tighter Tailscale ACL). Each item is additive to this layout.