Spec 1 Operations Runbook (ops-vm)

Day-2 operating instructions

Concrete commands and locations for running, inspecting, resizing, and tearing down the live ops-vm from Spec 1. Pair this with gcp-terraform-ansible-gotchas when something looks off.

Day-2 operations for the live ops-vm deployed under levandor-infra Spec 1. All commands assume the operator workstation has the tailnet up and gcloud ADC configured.

Access

The VM is Tailscale-only — no public SSH port anywhere. Two equivalent ways in, both passwordless:

ssh ops@ops-vm
tailscale ssh ops@ops-vm

The hostname ops-vm is Tailscale MagicDNS — no /etc/hosts entry needed.
Auth is Tailscale SSH — the VM-side tailscaled mints short-lived SSH credentials from the tailnet identity. No SSH keypair, no ssh-agent, no Keychain ceremony.
The make ssh target in the repo wraps the same call.

Why no port

The dedicated VPC has zero inbound rules other than udp:41641 for faster Tailscale direct connections. There is no tcp:22 ingress at all — the public internet cannot reach SSH on this VM by design.

Health Checks

Quick smoke checks for the three critical layers (systemd services, tailnet membership, telemetry flow):

systemctl is-active docker otelcol-contrib fail2ban
tailscale status
curl -s localhost:8888/metrics | grep otelcol_exporter_sent_metric_points

Expected:

All three units return active.
tailscale status lists the VM under tag:cloud with the operator workstation reachable.
The Prometheus-format counter otelcol_exporter_sent_metric_points is >0 and increasing between two reads a minute apart.

systemctl alone is not enough for OTel

A green systemctl is-active otelcol-contrib does not prove metrics are reaching SigNoz. Always confirm with the :8888/metrics counter — see gotcha 9.

Common Operations

Re-apply Ansible configuration

After editing a role or group_vars/, push the change to the live VM:

cd ~/levandor/terraform
make configure

This re-runs ansible-playbook against the live host via Tailscale SSH. Idempotent — only changed tasks execute.

Resize the VM

Bumping machine type (e.g. for Spec 2 container load):

Edit terraform.tfvars — set machine_type = "e2-medium" (or whatever target).
make plan — review the diff. A machine-type change requires a stop / start (Terraform handles it automatically).
make vm-up — applies the change. Brief outage during the restart.

Expect Tailscale to re-register the device on boot — no manual intervention needed.

Tear down

Full deprovision (removes VM, disk, VPC, firewall rules, service account, the lot):

make deprovision

Run make preflight afterward to confirm the project is back to a clean state. The Tailscale auth key is single-use and was consumed when the VM joined — no key revocation needed on teardown.

Logs & Debugging

OTel collector (telemetry pipeline):

sudo journalctl -u otelcol-contrib -f

fail2ban (SSH ban events — none expected on a Tailscale-only VM, but the unit is up as defense in depth):

sudo journalctl -u fail2ban -f

Docker containers (once Spec 2 lands app workloads):

docker ps
docker logs -f <container>

Where Things Live

Item	Location
Secrets (auth keys, etc.)	`~/levandor/terraform/secrets/` — gitignored, mode `0600`
Terraform state	`~/levandor/terraform/terraform.tfstate` — local + gitignored
OTel collector config (on VM)	`/etc/otelcol-contrib/config.yaml` — owner `root:otelcol-contrib`, mode `0640`
Ansible inventory (Terraform-generated)	`~/levandor/terraform/ansible/inventory/hosts.ini`
Repo root	`/Users/levander/levandor/terraform`

Cost Monitoring

Billing console:

https://console.cloud.google.com/billing/projects/aerobic-tesla-490112-r3

Open the Reports tab for the per-resource breakdown.
GCP billing data has a ~24-48h lag — the current day spend is not visible until tomorrow at the earliest.
Estimated steady-state cost: **~ $18/mo** at the current `e2-small` + 20 GB pd-balanced + ephemeral external IPv4 sizing. See [[spec-1-deployment-complete#cost--18--month|Cost — ~$ 18 / month]] for the breakdown.

Common Gotchas

The full reference is gcp-terraform-ansible-gotchas (eleven entries). The three most likely to bite on a re-run:

Tailscale SSH ACL must target the device tag, not autogroup:self — see gotcha 7. If tailscale ssh ops@ops-vm returns permission-denied, this is almost certainly the cause.
Ansible gather_facts ordering — see gotcha 8. If make configure against a fresh VM fails on the very first task with UNREACHABLE, the play is missing gather_facts: false plus an explicit ansible.builtin.setup after wait_for_connection.
OTel v0.152 receiver/exporter renames — see gotcha 10. If the collector starts with deprecation warnings and you copied config from older docs, the names are now host_metrics and otlp_grpc.

Secret Rotation

Secret	Rotation procedure	Frequency
Tailscale auth key	Single-use per machine — the key was consumed when `ops-vm` joined. Revoke in the Tailscale admin console (`https://login.tailscale.com/admin/settings/keys`) once the VM is up; no recurring rotation required while the device is alive.	One-shot, post-join
SigNoz ingestion key	Rotate in SigNoz Cloud → Settings → Ingestion Keys. Generate a new key, update `secrets/signoz-ingestion-key` locally, run `make configure` (which re-renders `/etc/otelcol-contrib/config.yaml`), confirm `:8888/metrics` shows new `sent_metric_points` after restart, then revoke the old key.	Convenient cadence — no immediate driver

Levandor

Explorer

Spec 1 Operations Runbook (ops-vm)

Access

Health Checks

Common Operations

Re-apply Ansible configuration

Resize the VM

Tear down

Logs & Debugging

Where Things Live

Cost Monitoring

Common Gotchas

Secret Rotation

Graph View

Table of Contents

Backlinks

Levandor

Explorer

Spec 1 Operations Runbook (ops-vm)

Access

Health Checks

Common Operations

Re-apply Ansible configuration

Resize the VM

Tear down

Logs & Debugging

Where Things Live

Cost Monitoring

Common Gotchas

Secret Rotation

Related

Graph View

Table of Contents

Backlinks