For Agents

Step-by-step guide for spinning up another VM using the levandor-infra Spec 1 Terraform + Ansible stack. Read Architecture reality first — there is a real constraint on running more than one VM in parallel today. Pair this with spec-1-operations-runbook for the post-provision verification surface.

How to provision more VMs using this Spec 1 infrastructure. Use this when an agent has been asked to spin up another VM via the /Users/levander/levandor/terraform repo.

When to use this guide

Use this guide when an agent is asked to:

  • Spin up another VM via this repo.
  • Replace the current ops-vm with a differently-sized or differently-configured one.
  • Provision the same single-VM stack in a different GCP project.

If the ask is about deploying an application onto the VM, that’s a different layer — see agent-guide-configure-app-deploy.

Architecture reality (read this first)

Spec 1 is a single-VM design today. module "vm" is instantiated once from top-level variables in main.tf — there is no for_each, no count, no map of VMs. The local_file.inventory resource and the inventory.tftpl template both assume a single host.

This has a concrete consequence: there is no zero-cost path to N VMs today. Two options:

  • (a) Tear down + re-apply with new vars. Run make deprovision to destroy the current VM, change terraform.tfvars (notably vm_name, machine_type, zone), then make provision. You lose the original ops-vm — this only works if the new VM replaces it.
  • (b) Refactor main.tf to a for_each over a map(object). Promote module "vm" to for_each = var.vms, where var.vms is a map(object({ machine_type, zone, ... })). The inventory template (inventory.tftpl) must be updated to iterate over the same map and emit one inventory line per VM, and local_file.inventory becomes a for_each too. This is a small, well-bounded refactor — list it as a follow-up Spec 1.5 if a second concurrent VM is actually needed.

No middle ground

There is no terraform workspace-based or -var-file-based shortcut that gives you two VMs concurrently while keeping Terraform state coherent for both. Either (a) you have one VM, or (b) you do the refactor.

Prerequisites checklist

Before make provision:

  • Tools installed: terraform (>= 1.6), gcloud, ansible, tailscale. make preflight verifies.
  • gcloud auth:
    • gcloud auth login (user-level)
    • gcloud auth application-default login (ADC — Terraform uses this)
    • gcloud config set project <project-id> (matches var.project_id in terraform.tfvars)
  • Tailscale auth key, generated in the Tailscale admin console as:
    • Reusable — so multiple VMs can use it during the refactor described above.
    • Ephemeral — the device auto-removes from the tailnet a few minutes after destroy.
    • Pre-authorized — so the VM joins without manual admin approval.
    • Tagged with tag:cloud — must match the tag used in the SSH ACL (see below).
  • Tailnet SSH ACL must include:
    "ssh": [
      {
        "action": "accept",
        "src":    ["autogroup:member"],
        "dst":    ["tag:cloud"],
        "users":  ["ops", "root"],
      },
    ],
    See gotcha 7autogroup:self will NOT work for tagged devices.
  • SSH public key present at ~/.ssh/id_ed25519.pub (the VM module validates this path — pathexpand() handles ~).
  • secrets/signoz_ingestion_key populated with a SigNoz Cloud ingestion key (mode 0600, gitignored).
  • terraform.tfvars filled with at minimum:
    project_id        = "<your-gcp-project-id>"
    tailscale_authkey = "tskey-auth-..."
    Other variables have sensible defaults — see Key decisions.

The happy path

make preflight   # verify tools, ADC, tailnet
make plan        # review resource count + diff
make provision   # vm-up + Ansible configure
make verify      # post-provision smoke check

make verify checks:

  • Tailscale is up on the VM.
  • Passwordless sudo works.
  • Docker is installed and the ops user is in the docker group.
  • docker, otelcol-contrib, and fail2ban systemd units are active.

For deeper inspection after provision, see spec-1-operations-runbook — health checks, logs, telemetry verification.

Key decisions an agent will need to make

Most have defaults in variables.tf that you only override when there’s a reason.

VariableDefaultNotes
region / zoneus-central1 / us-central1-aThe zone validation enforces zone starts with region.
machine_typee2-small (2 vCPU burstable, 2 GB RAM)If you ever host SigNoz on the VM, you’d want >= 4 GB — but SigNoz is Cloud in this stack, so e2-small is fine.
vm_nameops-vmAlso the Tailscale MagicDNS hostname. Change this if standing up a second VM via the refactor — names must be unique.
Tailscale tagtag:cloud (convention)Set when generating the auth key in the Tailscale admin console. Must match the dst in the SSH ACL.
OTel collector version0.152.1 in ansible/roles/monitoring/defaults/main.yml (otelcol_version)Verify it’s still current — see gotcha 10 for the v0.152 rename history.
enable_tailscale_directtrueOpens udp:41641 for direct (non-DERP) Tailscale connections. Harmless to leave on — Tailscale connection-attempt traffic only.
disk_size / disk_type20 GB / pd-balancedBump if hosting large container images or many apps.

Provisioning in a different GCP project

  1. Change project_id in terraform.tfvars.
  2. Confirm gcloud auth application-default print-access-token works and your ADC has IAM in the new project (Project Editor or the narrower [Owner of newly-created network/IAM resources] you’ve decided on).
  3. terraform validate does not talk to GCP. make plan will fail clearly if ADC can’t reach the project.
  4. APIs (compute.googleapis.com, iam.googleapis.com, artifactregistry.googleapis.com) auto-enable on first apply via google_project_service.this — but watch gotcha 6 (the explicit depends_on is already in main.tf).

Terraform state is local

The current setup keeps terraform.tfstate on the operator workstation. Two operators provisioning into the same project will diverge. The fix (GCS backend) is documented in gcp-vm-provisioning-design §7. If you’re doing this against shared infra, do the GCS backend migration first.

Tearing it down

make deprovision

This destroys: VM, boot disk, subnet, VPC, firewall rule(s), service account, IAM bindings.

  • The ephemeral Tailscale node auto-removes from the tailnet a few minutes after destroy — no manual revoke needed.
  • The Tailscale auth key was already consumed on first join; if it was single-use, it’s already dead. If it was reusable, leaving it valid is fine (rotate it on a schedule via the admin console).
  • The SigNoz ingestion key is unaffected.
  • APIs are not disabled on destroy (disable_on_destroy = false — see gotcha 1).

Common pitfalls

The full reference is gcp-terraform-ansible-gotchas. The three first-time-bite items most likely to hit a fresh provision:

  1. ACL tagged-device mismatchgotcha 7. If tailscale ssh ops@<vm-name> returns permission-denied immediately after the VM joins, the ACL almost certainly says dst: ["autogroup:self"] when it must say dst: ["tag:cloud"].
  2. Ansible gather_facts orderinggotcha 8. If make configure fails on the very first task against a freshly-booted VM with UNREACHABLE, the play is missing gather_facts: false + an explicit setup after wait_for_connection. The Spec 1 play already has this fix — preserve it on any refactor.
  3. OTel deprecation aliasesgotcha 10. If you bump otelcol_version to a newer release and copy a config snippet from an old tutorial, you’ll get deprecation warnings (or eventually errors). New names are host_metrics and otlp_grpc.

What this guide does NOT cover

  • Running multiple VMs simultaneously — requires the for_each refactor described in Architecture reality. Open a Spec 1.5 task before doing this.
  • Deploying applications onto the VM — Spec 2 territory. See agent-guide-configure-app-deploy for what’s possible today (manual) and what’s coming.
  • Moving Terraform state to GCS — covered as a deferred item in gcp-vm-provisioning-design §7. Do this before any multi-operator workflow.