Day-2 operating instructions

Concrete commands and locations for running, inspecting, resizing, and tearing down the live ops-vm from Spec 1. Pair this with gcp-terraform-ansible-gotchas when something looks off.

Day-2 operations for the live ops-vm deployed under levandor-infra Spec 1. All commands assume the operator workstation has the tailnet up and gcloud ADC configured.

Access

The VM is Tailscale-only — no public SSH port anywhere. Two equivalent ways in, both passwordless:

ssh ops@ops-vm
tailscale ssh ops@ops-vm
  • The hostname ops-vm is Tailscale MagicDNS — no /etc/hosts entry needed.
  • Auth is Tailscale SSH — the VM-side tailscaled mints short-lived SSH credentials from the tailnet identity. No SSH keypair, no ssh-agent, no Keychain ceremony.
  • The make ssh target in the repo wraps the same call.

Why no port

The dedicated VPC has zero inbound rules other than udp:41641 for faster Tailscale direct connections. There is no tcp:22 ingress at all — the public internet cannot reach SSH on this VM by design.

Health Checks

Quick smoke checks for the three critical layers (systemd services, tailnet membership, telemetry flow):

systemctl is-active docker otelcol-contrib fail2ban
tailscale status
curl -s localhost:8888/metrics | grep otelcol_exporter_sent_metric_points

Expected:

  • All three units return active.
  • tailscale status lists the VM under tag:cloud with the operator workstation reachable.
  • The Prometheus-format counter otelcol_exporter_sent_metric_points is >0 and increasing between two reads a minute apart.

systemctl alone is not enough for OTel

A green systemctl is-active otelcol-contrib does not prove metrics are reaching SigNoz. Always confirm with the :8888/metrics counter — see gotcha 9.

Common Operations

Re-apply Ansible configuration

After editing a role or group_vars/, push the change to the live VM:

cd ~/levandor/terraform
make configure

This re-runs ansible-playbook against the live host via Tailscale SSH. Idempotent — only changed tasks execute.

Resize the VM

Bumping machine type (e.g. for Spec 2 container load):

  1. Edit terraform.tfvars — set machine_type = "e2-medium" (or whatever target).
  2. make plan — review the diff. A machine-type change requires a stop / start (Terraform handles it automatically).
  3. make vm-up — applies the change. Brief outage during the restart.

Expect Tailscale to re-register the device on boot — no manual intervention needed.

Tear down

Full deprovision (removes VM, disk, VPC, firewall rules, service account, the lot):

make deprovision

Run make preflight afterward to confirm the project is back to a clean state. The Tailscale auth key is single-use and was consumed when the VM joined — no key revocation needed on teardown.

Logs & Debugging

OTel collector (telemetry pipeline):

sudo journalctl -u otelcol-contrib -f

fail2ban (SSH ban events — none expected on a Tailscale-only VM, but the unit is up as defense in depth):

sudo journalctl -u fail2ban -f

Docker containers (once Spec 2 lands app workloads):

docker ps
docker logs -f <container>

Where Things Live

ItemLocation
Secrets (auth keys, etc.)~/levandor/terraform/secrets/ — gitignored, mode 0600
Terraform state~/levandor/terraform/terraform.tfstate — local + gitignored
OTel collector config (on VM)/etc/otelcol-contrib/config.yaml — owner root:otelcol-contrib, mode 0640
Ansible inventory (Terraform-generated)~/levandor/terraform/ansible/inventory/hosts.ini
Repo root/Users/levander/levandor/terraform

Cost Monitoring

Billing console:

https://console.cloud.google.com/billing/projects/aerobic-tesla-490112-r3

  • Open the Reports tab for the per-resource breakdown.
  • GCP billing data has a ~24-48h lag — the current day spend is not visible until tomorrow at the earliest.
  • Estimated steady-state cost: **~18/mo** at the current `e2-small` + 20 GB pd-balanced + ephemeral external IPv4 sizing. See [[spec-1-deployment-complete#cost--18--month|Cost — ~18 / month]] for the breakdown.

Common Gotchas

The full reference is gcp-terraform-ansible-gotchas (eleven entries). The three most likely to bite on a re-run:

  1. Tailscale SSH ACL must target the device tag, not autogroup:self — see gotcha 7. If tailscale ssh ops@ops-vm returns permission-denied, this is almost certainly the cause.
  2. Ansible gather_facts ordering — see gotcha 8. If make configure against a fresh VM fails on the very first task with UNREACHABLE, the play is missing gather_facts: false plus an explicit ansible.builtin.setup after wait_for_connection.
  3. OTel v0.152 receiver/exporter renames — see gotcha 10. If the collector starts with deprecation warnings and you copied config from older docs, the names are now host_metrics and otlp_grpc.

Secret Rotation

SecretRotation procedureFrequency
Tailscale auth keySingle-use per machine — the key was consumed when ops-vm joined. Revoke in the Tailscale admin console (https://login.tailscale.com/admin/settings/keys) once the VM is up; no recurring rotation required while the device is alive.One-shot, post-join
SigNoz ingestion keyRotate in SigNoz Cloud → Settings → Ingestion Keys. Generate a new key, update secrets/signoz-ingestion-key locally, run make configure (which re-renders /etc/otelcol-contrib/config.yaml), confirm :8888/metrics shows new sent_metric_points after restart, then revoke the old key.Convenient cadence — no immediate driver