Day-2 operating instructions
Concrete commands and locations for running, inspecting, resizing, and tearing down the live
ops-vmfrom Spec 1. Pair this with gcp-terraform-ansible-gotchas when something looks off.
Day-2 operations for the live ops-vm deployed under levandor-infra Spec 1. All commands assume the operator workstation has the tailnet up and gcloud ADC configured.
Access
The VM is Tailscale-only — no public SSH port anywhere. Two equivalent ways in, both passwordless:
ssh ops@ops-vm
tailscale ssh ops@ops-vm- The hostname
ops-vmis Tailscale MagicDNS — no/etc/hostsentry needed. - Auth is Tailscale SSH — the VM-side
tailscaledmints short-lived SSH credentials from the tailnet identity. No SSH keypair, nossh-agent, no Keychain ceremony. - The
make sshtarget in the repo wraps the same call.
Why no port
The dedicated VPC has zero inbound rules other than
udp:41641for faster Tailscale direct connections. There is notcp:22ingress at all — the public internet cannot reach SSH on this VM by design.
Health Checks
Quick smoke checks for the three critical layers (systemd services, tailnet membership, telemetry flow):
systemctl is-active docker otelcol-contrib fail2ban
tailscale status
curl -s localhost:8888/metrics | grep otelcol_exporter_sent_metric_pointsExpected:
- All three units return
active. tailscale statuslists the VM undertag:cloudwith the operator workstation reachable.- The Prometheus-format counter
otelcol_exporter_sent_metric_pointsis >0 and increasing between two reads a minute apart.
systemctl alone is not enough for OTel
A green
systemctl is-active otelcol-contribdoes not prove metrics are reaching SigNoz. Always confirm with the:8888/metricscounter — see gotcha 9.
Common Operations
Re-apply Ansible configuration
After editing a role or group_vars/, push the change to the live VM:
cd ~/levandor/terraform
make configureThis re-runs ansible-playbook against the live host via Tailscale SSH. Idempotent — only changed tasks execute.
Resize the VM
Bumping machine type (e.g. for Spec 2 container load):
- Edit
terraform.tfvars— setmachine_type = "e2-medium"(or whatever target). make plan— review the diff. A machine-type change requires a stop / start (Terraform handles it automatically).make vm-up— applies the change. Brief outage during the restart.
Expect Tailscale to re-register the device on boot — no manual intervention needed.
Tear down
Full deprovision (removes VM, disk, VPC, firewall rules, service account, the lot):
make deprovisionRun make preflight afterward to confirm the project is back to a clean state. The Tailscale auth key is single-use and was consumed when the VM joined — no key revocation needed on teardown.
Logs & Debugging
OTel collector (telemetry pipeline):
sudo journalctl -u otelcol-contrib -ffail2ban (SSH ban events — none expected on a Tailscale-only VM, but the unit is up as defense in depth):
sudo journalctl -u fail2ban -fDocker containers (once Spec 2 lands app workloads):
docker ps
docker logs -f <container>Where Things Live
| Item | Location |
|---|---|
| Secrets (auth keys, etc.) | ~/levandor/terraform/secrets/ — gitignored, mode 0600 |
| Terraform state | ~/levandor/terraform/terraform.tfstate — local + gitignored |
| OTel collector config (on VM) | /etc/otelcol-contrib/config.yaml — owner root:otelcol-contrib, mode 0640 |
| Ansible inventory (Terraform-generated) | ~/levandor/terraform/ansible/inventory/hosts.ini |
| Repo root | /Users/levander/levandor/terraform |
Cost Monitoring
Billing console:
https://console.cloud.google.com/billing/projects/aerobic-tesla-490112-r3
- Open the Reports tab for the per-resource breakdown.
- GCP billing data has a ~24-48h lag — the current day spend is not visible until tomorrow at the earliest.
- Estimated steady-state cost: **~18/mo** at the current `e2-small` + 20 GB pd-balanced + ephemeral external IPv4 sizing. See [[spec-1-deployment-complete#cost--18--month|Cost — ~18 / month]] for the breakdown.
Common Gotchas
The full reference is gcp-terraform-ansible-gotchas (eleven entries). The three most likely to bite on a re-run:
- Tailscale SSH ACL must target the device tag, not
autogroup:self— see gotcha 7. Iftailscale ssh ops@ops-vmreturns permission-denied, this is almost certainly the cause. - Ansible
gather_factsordering — see gotcha 8. Ifmake configureagainst a fresh VM fails on the very first task withUNREACHABLE, the play is missinggather_facts: falseplus an explicitansible.builtin.setupafterwait_for_connection. - OTel v0.152 receiver/exporter renames — see gotcha 10. If the collector starts with deprecation warnings and you copied config from older docs, the names are now
host_metricsandotlp_grpc.
Secret Rotation
| Secret | Rotation procedure | Frequency |
|---|---|---|
| Tailscale auth key | Single-use per machine — the key was consumed when ops-vm joined. Revoke in the Tailscale admin console (https://login.tailscale.com/admin/settings/keys) once the VM is up; no recurring rotation required while the device is alive. | One-shot, post-join |
| SigNoz ingestion key | Rotate in SigNoz Cloud → Settings → Ingestion Keys. Generate a new key, update secrets/signoz-ingestion-key locally, run make configure (which re-renders /etc/otelcol-contrib/config.yaml), confirm :8888/metrics shows new sent_metric_points after restart, then revoke the old key. | Convenient cadence — no immediate driver |