Agent Guide — Provision a New VM

For LLM agents

How to provision a fresh VM using the levandor-infra Terraform + Ansible stack. Architecture reality: this is a single-VM design — module "vm" is instantiated ONCE in main.tf with no for_each. The most common use is replacing the existing polymarket-infra VM (after make deprovision, edit tfvars, make provision). Running two concurrent VMs requires a small refactor. For teardown details see agent-guide-deprovision-vm.

How to provision a VM using this infrastructure. Use this when an agent has been asked to spin up another VM via the /Users/levander/levandor/terraform repo, OR to recreate the current polymarket-infra after destroy.

When to use this guide

Use this guide when an agent is asked to:

Spin up another VM via this repo.
Replace the current ops-vm with a differently-sized or differently-configured one.
Provision the same single-VM stack in a different GCP project.

If the ask is about deploying an application onto the VM, that’s a different layer — see agent-guide-configure-app-deploy.

Architecture reality (read this first)

Single-VM design today. module "vm" is instantiated once from top-level variables in main.tf — no for_each, no count, no map of VMs. The local_file.inventory resource and the inventory.tftpl template both assume a single host.

The Spec 2a/2b stack (AR repo, WIF pool, CI service account, apps role) is also wired to that single VM. So:

(a) Tear down + re-apply with new vars. Run make deprovision to destroy the current VM, change terraform.tfvars (notably vm_name, machine_type, zone), then make provision. You lose the original polymarket-infra — this only works if the new VM replaces it. See agent-guide-deprovision-vm for the destroy ritual + GCP soft-delete + Tailscale name collision gotchas.
(b) Refactor main.tf to a for_each over a map(object). Promote module "vm" to for_each = var.vms, where var.vms is a map(object({ machine_type, zone, ... })). The inventory template (inventory.tftpl) must be updated to iterate over the same map and emit one inventory line per VM, and local_file.inventory becomes a for_each too. The apps role’s host: vm group would need to expand too (or stay single-target if only one VM hosts apps). This is a small but real refactor — list as Spec 1.5 if needed.

No middle ground

There is no terraform workspace-based or -var-file-based shortcut that gives you two VMs concurrently while keeping Terraform state coherent for both. Either (a) you have one VM, or (b) you do the refactor.

Prerequisites checklist

Before make provision:

Tools installed: terraform (>= 1.6), gcloud, ansible, tailscale. make preflight verifies.
gcloud auth:
- gcloud auth login (user-level)
- gcloud auth application-default login (ADC — Terraform uses this)
- gcloud config set project <project-id> (matches var.project_id in terraform.tfvars)
Tailscale auth key, generated in the Tailscale admin console as:
- Reusable — so multiple VMs can use it during the refactor described above.
- Ephemeral — the device auto-removes from the tailnet a few minutes after destroy.
- Pre-authorized — so the VM joins without manual admin approval.
- Tagged with tag:cloud — must match the tag used in the SSH ACL (see below).
Tailnet SSH ACL must include (at minimum) operator access to tag:cloud:
```
"ssh": [
  {
    "action": "accept",
    "src":    ["autogroup:member"],
    "dst":    ["tag:cloud"],
    "users":  ["ops", "root"],
  },
  {
    "action": "accept",
    "src":    ["tag:ci"],
    "dst":    ["tag:cloud"],
    "users":  ["ops"],
  },
],
```
Two entries: one for the operator (your laptop), one for GH Actions CI (tag:ci) so the CD workflow can deploy. See gotcha 7 — autogroup:self will NOT work for tagged devices.

Also in the acls: block, allow tag:ci network access to port 22:
```
{
  "action": "accept",
  "src":    ["tag:ci"],
  "dst":    ["tag:cloud:22"],
},
```
And tagOwners must list tag:ci (otherwise OAuth clients can’t mint keys for it — see gotcha 20).
SSH public key present at ~/.ssh/id_ed25519.pub (the VM module validates this path — pathexpand() handles ~).
secrets/signoz_ingestion_key populated with a SigNoz Cloud ingestion key (mode 0600, gitignored).
terraform.tfvars filled with at minimum:
```
project_id        = "<your-gcp-project-id>"
tailscale_authkey = "tskey-auth-..."
```
Other variables have sensible defaults — see Key decisions.

The happy path

make preflight   # verify tools, ADC, tailnet
make plan        # review resource count + diff
make provision   # vm-up + Ansible configure
make verify      # post-provision smoke check

make verify checks:

Tailscale is up on the VM.
Passwordless sudo works.
Docker is installed and the ops user is in the docker group.
docker, otelcol-contrib, and fail2ban systemd units are active.

For deeper inspection after provision, see spec-1-operations-runbook — health checks, logs, telemetry verification.

Key decisions an agent will need to make

Most have defaults in variables.tf that you only override when there’s a reason.

Variable	Default	Notes
`region` / `zone`	`europe-west3` / `europe-west3-a`	The zone validation enforces `zone` starts with `region`. (Was `us-central1` until the 2026-05-27 EU migration; current default is EU per Polymarket geofence.)
`machine_type`	`e2-small` (2 vCPU burstable, 2 GB RAM)	Comfortable for the current 4-service deploy (~650 MB used, ~1.3 GB available). Bump to `e2-medium` (4 GB) if memory headroom drops or if you stack more apps.
`vm_name`	`polymarket-infra`	Also the Tailscale MagicDNS hostname. Renamed from `ops-vm` on 2026-05-27 — change requires VM destroy + recreate (GCE names are immutable).
`ci_github_repos`	`["wowjeeez/polymarket-fetch"]`	List of `owner/repo` that get a WIF binding to impersonate `ci-pusher` for AR pushes. Validation regex: `^[^/]+/[^/]+$`.
Tailscale tag	`tag:cloud` (convention)	Set when generating the auth key in the Tailscale admin console. Must match the `dst` in the SSH ACL.
OTel collector version	`0.152.1` in `ansible/roles/monitoring/defaults/main.yml` (`otelcol_version`)	Verify it’s still current — see gotcha 10 for the v0.152 rename history.
`enable_tailscale_direct`	`true`	Opens `udp:41641` for direct (non-DERP) Tailscale connections. Harmless to leave on — Tailscale connection-attempt traffic only.
`disk_size` / `disk_type`	20 GB / `pd-balanced`	Bump if hosting large container images or many apps.

Provisioning in a different GCP project

Change project_id in terraform.tfvars.
Confirm gcloud auth application-default print-access-token works and your ADC has IAM in the new project (Project Editor or the narrower [Owner of newly-created network/IAM resources] you’ve decided on).
terraform validate does not talk to GCP. make plan will fail clearly if ADC can’t reach the project.
APIs (compute.googleapis.com, iam.googleapis.com, artifactregistry.googleapis.com) auto-enable on first apply via google_project_service.this — but watch gotcha 6 (the explicit depends_on is already in main.tf).

Terraform state is local

The current setup keeps terraform.tfstate on the operator workstation. Two operators provisioning into the same project will diverge. The fix (GCS backend) is documented in gcp-vm-provisioning-design §7. If you’re doing this against shared infra, do the GCS backend migration first.

Tearing it down

See the dedicated guide: agent-guide-deprovision-vm. It covers backup, the GCP soft-delete trap on the WIF pool/provider, and Tailscale node cleanup.

Common pitfalls

The full reference is gcp-terraform-ansible-gotchas. The three first-time-bite items most likely to hit a fresh provision:

ACL tagged-device mismatch — gotcha 7. If tailscale ssh ops@<vm-name> returns permission-denied immediately after the VM joins, the ACL almost certainly says dst: ["autogroup:self"] when it must say dst: ["tag:cloud"].
Ansible gather_facts ordering — gotcha 8. If make configure fails on the very first task against a freshly-booted VM with UNREACHABLE, the play is missing gather_facts: false + an explicit setup after wait_for_connection. The Spec 1 play already has this fix — preserve it on any refactor.
OTel deprecation aliases — gotcha 10. If you bump otelcol_version to a newer release and copy a config snippet from an old tutorial, you’ll get deprecation warnings (or eventually errors). New names are host_metrics and otlp_grpc.

What this guide does NOT cover

Running multiple VMs simultaneously — requires the for_each refactor described in Architecture reality. Open a Spec 1.5 task before doing this.
Deploying applications onto the VM — Spec 2 territory. See agent-guide-configure-app-deploy for what’s possible today (manual) and what’s coming.
Moving Terraform state to GCS — covered as a deferred item in gcp-vm-provisioning-design §7. Do this before any multi-operator workflow.

Levandor

Explorer

Agent Guide — Provision a New VM

When to use this guide

Architecture reality (read this first)

Prerequisites checklist

The happy path

Key decisions an agent will need to make

Provisioning in a different GCP project

Tearing it down

Common pitfalls

What this guide does NOT cover

Graph View

Table of Contents

Backlinks

Levandor

Explorer

Agent Guide — Provision a New VM

When to use this guide

Architecture reality (read this first)

Prerequisites checklist

The happy path

Key decisions an agent will need to make

Provisioning in a different GCP project

Tearing it down

Common pitfalls

What this guide does NOT cover

Related

Graph View

Table of Contents

Backlinks