Approved design for Spec 1 of the levandor-infra project: provision and deprovision a single Docker-ready, Artifact-Registry-ready Ubuntu VM on GCP via Terraform + Ansible, reachable over Tailscale, with base hardening, GitHub keys, and host telemetry to SigNoz Cloud.

Deployed (2026-05-25)

Spec 1 of 2. Approved, validated, and now live on GCPops-vm (e2-small, us-central1-a) is provisioned, configured, joined to the tailnet, and shipping host metrics to SigNoz Cloud. See spec-1-deployment-complete for the deployed state, the late refinement to Tailscale SSH, and the telemetry / cost numbers. Spec 2 (the application-deploy layer) is previewed at the end. Source spec: docs/superpowers/specs/2026-05-22-gcp-vm-provisioning-design.md in the repo.

Scope

Spec 1 (this design) — stand up and tear down one VM:

  • make provision produces a configured GCP VM reachable over the tailnet.
  • make deprovision cleanly removes the VM and every resource it created.
  • Docker-ready and Artifact-Registry-ready, base-hardened, GitHub keys installed, host metrics flowing to SigNoz Cloud.
  • Passwordless operation after a one-time setup — no command ever prompts for a password.

Spec 2 (future, not in this design) — the application-deploy layer: Artifact Registry image pulls + docker compose, Cloudflare ZTNA ingress, batch jobs, app logs/traces to SigNoz, plus hardening (no external IP + Cloud NAT, GCP Secret Manager, Tailscale ACL tags).

Architecture

Two layers, one automatic handoff. Terraform provisions; Ansible configures. The handoff is a Terraform-generated Ansible inventory file — explicit and version-inspectable, requiring no extra plugins or Ansible-side GCP credentials.

Provision flow

flowchart TD
    M["make provision"] --> TF["terraform apply"]
    TF --> R["Creates: required APIs,<br/>dedicated VPC + subnet,<br/>service account + IAM,<br/>VM instance"]
    R --> SS["VM startup-script:<br/>install Tailscale,<br/>join tailnet as &lt;vm_name&gt;"]
    TF --> INV["Terraform writes<br/>ansible/inventory/hosts.ini<br/>(ansible_host = MagicDNS name)"]
    INV --> AP["ansible-playbook"]
    SS -.VM online on tailnet.-> AP
    AP --> ROLES["Roles in order:<br/>base &rarr; docker &rarr;<br/>github_keys &rarr; monitoring"]
    ROLES --> OTEL["OTel Collector ships<br/>host metrics to SigNoz Cloud"]
    style M fill:#264653,stroke:#2a9d8f,color:#fff
    style TF fill:#2d2d2d,stroke:#888,color:#fff
    style AP fill:#2d2d2d,stroke:#888,color:#fff
    style OTEL fill:#264653,stroke:#2a9d8f,color:#fff

make deprovision runs terraform destroy, removing every managed resource. The VM’s Tailscale node uses an ephemeral auth key, so it auto-removes from the tailnet shortly after the VM is destroyed.

Key Design Decisions

DecisionChoiceRationale
GCP authenticationgcloud Application Default CredentialsSingle dev, one machine; no key files to store or leak.
Terraform stateLocal fileSimplest; documented migration path to a GCS backend.
TF → Ansible handoffTerraform templates an inventory fileExplicit, version-inspectable; no extra plugins or Ansible-side GCP creds.
Trigger UXMakefile targetsDiscoverable via make help, no new dependencies.
VM modelSingle VM via variables, reusable modules/vm/Matches current need; logic lives in a reusable module so the layout extends to other GCP resource types without a rewrite.
VM network accessTailscale — no public SSH portDedicated custom VPC with no inbound rules; reachable only on the tailnet.
SSH key passphrasemacOS Keychain (design) → N/A in deploymentThe deployed environment uses Tailscale SSH instead, removing the keychain prereq entirely.
SSH auth modelKey-based SSH over Tailscale (design) → Tailscale SSH (live)Deployment late-refined to tailscale up --ssh + ACL — fewer moving parts; predictable sudo via google-sudoers is unchanged.
Host telemetryOpenTelemetry Collector → SigNoz Cloud (eu2)Managed SigNoz SaaS; host_metrics over OTLP/gRPC (renamed from hostmetrics in v0.152); nothing to self-host. Replaced node_exporter.
Image buildsCI → Artifact Registry (Spec 2)The VM pulls images via its dedicated service account.

Reusable module, single VM

The VM is provisioned as a single instance driven by variables, but the actual logic lives in modules/vm/. Future GCP resource types become new root .tf files backed by new modules/ — additive, not a restructure.

Repo Layout

The repo root is the Terraform root (/Users/levander/levandor/terraform).

terraform/
├── Makefile                  operator interface
├── main.tf                   providers + APIs + VPC + service account + vm module + inventory
├── variables.tf / outputs.tf
├── inventory.tftpl           Ansible inventory template
├── startup-script.tftpl      VM bootstrap: install + join Tailscale
├── terraform.tfvars.example
├── modules/vm/               reusable VM module (main/variables/outputs)
├── ansible/
│   ├── ansible.cfg
│   ├── site.yml              runs the 4 roles
│   ├── inventory/hosts.ini   GENERATED by Terraform (gitignored)
│   └── roles/                base, docker, github_keys, monitoring
├── docs/superpowers/specs/   design specs
└── secrets/                  gitignored — GitHub deploy key + SigNoz ingestion key

Terraform Layer (provisioning)

  • Providers: hashicorp/google ~> 6.0 (ADC auth, no creds in code), hashicorp/local (inventory file). required_version >= 1.6.
  • Local backendterraform.tfstate in the repo root, gitignored. README documents the one-block change to migrate to a GCS backend.
  • Root resources (main.tf):
    • google_project_service — enables compute, iam, artifactregistry APIs, with disable_on_destroy = false.
    • google_compute_network (custom mode) + google_compute_subnetwork — a dedicated VPC, no inbound firewall rules.
    • google_service_account — dedicated SA for the VM (replaces the default SA), granted roles/artifactregistry.reader (forward-looking for Spec 2).
    • module vmdepends_on the google_project_service resources.
    • local_file — renders ansible/inventory/hosts.ini from inventory.tftpl; depends_on the VM.
  • modules/vm/google_compute_instance (Ubuntu 24.04 LTS, dedicated VPC/subnet, ephemeral external IP for egress only, dedicated SA, network tag, labels). Metadata: ssh-keys, enable-oslogin = FALSE, and a startup-script that installs Tailscale and runs tailscale up --ssh (Tailscale SSH enabled). Optional google_compute_firewall for inbound udp:41641 (faster direct Tailscale, default on). No tcp:22 rule exists.
  • The SigNoz endpoint and ingestion key are not Terraform variables — they are Ansible-side config, so the key never enters tfstate or VM metadata.

Ansible Layer (configuration)

  • ansible.cfg lives in ansible/: inventory = inventory/hosts.ini, roles_path = roles, host_key_checking = accept-new. become_ask_pass is not set (sudo is passwordless).
  • make configure runs Ansible with ANSIBLE_CONFIG=ansible/ansible.cfg so the config is actually loaded.
  • site.yml — single play, hosts: vm, become: true, gather_facts: false at play level, with a wait_for_connection pre-task (~300 s) to absorb VM boot + Tailscale join and then an explicit ansible.builtin.setup call to gather facts (see gotcha 8). Then four roles in order:
    • base — apt utilities, unattended-upgrades (Automatic-Reboot false), fail2ban.
    • docker — Docker CE from the official apt repo (keyring + signed-by), compose plugin, adds ssh_user to the docker group, configures the Artifact Registry credential helper.
    • github_keys — copies a private deploy key from secrets/ to the VM, adds github.com to known_hosts; skipped automatically if no key is present.
    • monitoring — installs the OpenTelemetry Collector (otelcol-contrib v0.152.1, host_metrics receiver) as a systemd service, exporting OTLP/gRPC (otlp_grpc exporter) to SigNoz Cloud.
  • All tasks use built-in Ansible modules — no Galaxy collections.

Host Telemetry — OTel to SigNoz Cloud

The monitoring role installs otelcol-contrib v0.152.1 (the OpenTelemetry Collector contrib distribution, which bundles the host_metrics receiver) and runs it as a systemd service. It collects host metrics (CPU, memory, disk, filesystem, network, load), tags them via a resourcedetection processor (system + GCP attributes), and exports OTLP/gRPC to SigNoz Cloud at ingest.eu2.signoz.cloud:443 (TLS on, signoz-ingestion-key header). Live status: ~25k host-metric points exported, zero failures — see Telemetry Status — Flowing to SigNoz Cloud.

Why OTel replaced node_exporter

The host_metrics receiver covers the same host metrics node_exporter would, and OTel gives a single agent that Spec 2 can extend with a filelog receiver and OTLP trace intake for app logs/traces. SigNoz Cloud (eu2, managed SaaS) means nothing to self-host. The ingestion key is read from secrets/signoz_ingestion_key and templated into the collector config (restrictive permissions, not world-readable) — it is an Ansible-side secret only, never a Terraform variable.

Verifying export success

systemctl is-active otelcol-contrib is necessary but not sufficient. Confirm metrics actually flow by grepping the collector’s own :8888/metrics for otelcol_exporter_sent_metric_points (>0) and otelcol_exporter_send_failed_metric_points (0 or absent). See gotcha 9.

Passwordless Operation

A hard requirement. Three places a prompt could appear, each eliminated:

  1. SSH key passphrasedesign called for a one-time ssh-add --apple-use-keychain ~/.ssh/id_ed25519 to store the passphrase in the macOS Keychain, with UseKeychain yes + AddKeysToAgent yes in ~/.ssh/config. Deployment refined this away: switching to Tailscale SSH removes the SSH keypair entirely, so there is no passphrase to supply. See Late Refinement — Tailscale SSH Replaces Key-Based SSH.
  2. sudo password — GCP’s guest agent grants metadata-provisioned users passwordless sudo via the google-sudoers group, so Ansible become needs no password.
  3. Host-key promptaccept-new is set in ansible.cfg, and make ssh passes -o StrictHostKeyChecking=accept-new. With Tailscale SSH, the tailnet itself authenticates the host, eliminating most of this concern in practice.

Provision / Deprovision Workflow

The Makefile is the operator interface. terraform apply / destroy prompt for confirmation (no -auto-approve); targets stop on the first error.

TargetAction
make helpList all targets.
make preflightCheck terraform / gcloud / ansible / tailscale installed, ADC active, Tailscale connected.
make init / fmt / validate / planTerraform lifecycle helpers.
make provisionFull flow: vm-up then configure.
make vm-upterraform apply creates the VM, writes the inventory.
make configureANSIBLE_CONFIG=… ansible-playbook site.yml — keys + telemetry + utilities.
make deprovisionterraform destroy tears the VM down.
make sshtailscale ssh ops@ops-vm over Tailscale.
make verifyPost-provision smoke check over SSH.

Secrets Handling

  • .gitignore excludes secrets/*, *.tfvars, *.tfstate*, .terraform/, and ansible/inventory/hosts.ini.
  • GitHub deploy key — in secrets/ (gitignored), referenced by a path variable. One deploy key grants one repo; multi-repo access (Spec 2) needs a machine-user key.
  • Tailscale auth key — set in terraform.tfvars (gitignored) as a sensitive variable, injected into the VM’s startup-script metadata. Carries tag:cloud so the SSH ACL can target it. Caveat: instance metadata is readable by project viewers and on-VM processes — acceptable for a personal project; GCP Secret Manager is the Spec 2 hardening path. Use a key with a short expiry.
  • SigNoz ingestion key — in secrets/signoz_ingestion_key (gitignored), read by the monitoring role and templated into the OTel config. Never a Terraform variable, so it never lands in tfstate or instance metadata.

Verification

make verify runs a post-provision smoke check over SSH: tailscale status shows the node online; sudo -n true succeeds (passwordless sudo); docker --version + systemctl is-active docker; systemctl is-active otelcol-contrib and fail2ban; groups includes docker; and curl -s localhost:8888/metrics shows otelcol_exporter_sent_metric_points >0 with no send_failed. End-to-end acceptance (now achieved): a clean make provision yields a Tailscale-reachable, fully-configured VM with host metrics visible in the SigNoz Cloud dashboard and no password prompt at any step; make deprovision removes every resource.

Spec 2 Preview (out of scope here)

Spec 2 builds the application-deploy layer on this VM: pull images from Artifact Registry + run via docker compose; the Artifact Registry repo (Terraform) + a per-repo CI workflow template; app logs/traces to SigNoz via a filelog receiver + OTLP trace intake; a cloudflared role + Cloudflare ZTNA access; batch-job scheduling; and hardening (no external IP + Cloud NAT, GCP Secret Manager, tighter Tailscale ACL). Each item is additive to this layout.