Step-by-step guide for spinning up another VM using the levandor-infra Spec 1 Terraform + Ansible stack. Read Architecture reality first — there is a real constraint on running more than one VM in parallel today. Pair this with spec-1-operations-runbook for the post-provision verification surface.
How to provision more VMs using this Spec 1 infrastructure. Use this when an agent has been asked to spin up another VM via the /Users/levander/levandor/terraform repo.
When to use this guide
Use this guide when an agent is asked to:
Spin up another VM via this repo.
Replace the current ops-vm with a differently-sized or differently-configured one.
Provision the same single-VM stack in a different GCP project.
Spec 1 is a single-VM design today. module "vm" is instantiated once from top-level variables in main.tf — there is no for_each, no count, no map of VMs. The local_file.inventory resource and the inventory.tftpl template both assume a single host.
This has a concrete consequence: there is no zero-cost path to N VMs today. Two options:
(a) Tear down + re-apply with new vars. Run make deprovision to destroy the current VM, change terraform.tfvars (notably vm_name, machine_type, zone), then make provision. You lose the original ops-vm — this only works if the new VM replaces it.
(b) Refactor main.tf to a for_each over a map(object). Promote module "vm" to for_each = var.vms, where var.vms is a map(object({ machine_type, zone, ... })). The inventory template (inventory.tftpl) must be updated to iterate over the same map and emit one inventory line per VM, and local_file.inventory becomes a for_each too. This is a small, well-bounded refactor — list it as a follow-up Spec 1.5 if a second concurrent VM is actually needed.
No middle ground
There is no terraform workspace-based or -var-file-based shortcut that gives you two VMs concurrently while keeping Terraform state coherent for both. Either (a) you have one VM, or (b) you do the refactor.
Prerequisites checklist
Before make provision:
Tools installed:terraform (>= 1.6), gcloud, ansible, tailscale. make preflight verifies.
Docker is installed and the ops user is in the docker group.
docker, otelcol-contrib, and fail2ban systemd units are active.
For deeper inspection after provision, see spec-1-operations-runbook — health checks, logs, telemetry verification.
Key decisions an agent will need to make
Most have defaults in variables.tf that you only override when there’s a reason.
Variable
Default
Notes
region / zone
us-central1 / us-central1-a
The zone validation enforces zone starts with region.
machine_type
e2-small (2 vCPU burstable, 2 GB RAM)
If you ever host SigNoz on the VM, you’d want >= 4 GB — but SigNoz is Cloud in this stack, so e2-small is fine.
vm_name
ops-vm
Also the Tailscale MagicDNS hostname. Change this if standing up a second VM via the refactor — names must be unique.
Tailscale tag
tag:cloud (convention)
Set when generating the auth key in the Tailscale admin console. Must match the dst in the SSH ACL.
OTel collector version
0.152.1 in ansible/roles/monitoring/defaults/main.yml (otelcol_version)
Verify it’s still current — see gotcha 10 for the v0.152 rename history.
enable_tailscale_direct
true
Opens udp:41641 for direct (non-DERP) Tailscale connections. Harmless to leave on — Tailscale connection-attempt traffic only.
disk_size / disk_type
20 GB / pd-balanced
Bump if hosting large container images or many apps.
Provisioning in a different GCP project
Change project_id in terraform.tfvars.
Confirm gcloud auth application-default print-access-token works and your ADC has IAM in the new project (Project Editor or the narrower [Owner of newly-created network/IAM resources] you’ve decided on).
terraform validate does not talk to GCP. make plan will fail clearly if ADC can’t reach the project.
APIs (compute.googleapis.com, iam.googleapis.com, artifactregistry.googleapis.com) auto-enable on first apply via google_project_service.this — but watch gotcha 6 (the explicit depends_on is already in main.tf).
Terraform state is local
The current setup keeps terraform.tfstateon the operator workstation. Two operators provisioning into the same project will diverge. The fix (GCS backend) is documented in gcp-vm-provisioning-design §7. If you’re doing this against shared infra, do the GCS backend migration first.
Tearing it down
make deprovision
This destroys: VM, boot disk, subnet, VPC, firewall rule(s), service account, IAM bindings.
The ephemeral Tailscale node auto-removes from the tailnet a few minutes after destroy — no manual revoke needed.
The Tailscale auth key was already consumed on first join; if it was single-use, it’s already dead. If it was reusable, leaving it valid is fine (rotate it on a schedule via the admin console).
The SigNoz ingestion key is unaffected.
APIs are not disabled on destroy (disable_on_destroy = false — see gotcha 1).
Common pitfalls
The full reference is gcp-terraform-ansible-gotchas. The three first-time-bite items most likely to hit a fresh provision:
ACL tagged-device mismatch — gotcha 7. If tailscale ssh ops@<vm-name> returns permission-denied immediately after the VM joins, the ACL almost certainly says dst: ["autogroup:self"] when it must say dst: ["tag:cloud"].
Ansible gather_facts ordering — gotcha 8. If make configure fails on the very first task against a freshly-booted VM with UNREACHABLE, the play is missing gather_facts: false + an explicit setup after wait_for_connection. The Spec 1 play already has this fix — preserve it on any refactor.
OTel deprecation aliases — gotcha 10. If you bump otelcol_version to a newer release and copy a config snippet from an old tutorial, you’ll get deprecation warnings (or eventually errors). New names are host_metrics and otlp_grpc.
What this guide does NOT cover
Running multiple VMs simultaneously — requires the for_each refactor described in Architecture reality. Open a Spec 1.5 task before doing this.
Deploying applications onto the VM — Spec 2 territory. See agent-guide-configure-app-deploy for what’s possible today (manual) and what’s coming.
Moving Terraform state to GCS — covered as a deferred item in gcp-vm-provisioning-design §7. Do this before any multi-operator workflow.