Deployed (2026-05-25)

Spec 1 of levandor-infra is complete and live on GCP. The ops-vm is provisioned, configured, joined to the tailnet, and shipping host metrics to SigNoz Cloud with zero export failures. Spec 2 (application-deploy layer) is now the active workstream.

Production state of the GCP VM provisioning effort, plus the one late refinement to the auth model and the telemetry / cost numbers observed in the live environment.

What is Running

  • VMops-vm on GCE, machine type e2-small, zone us-central1-a, Ubuntu 24.04 LTS.
  • Disk — 20 GB pd-balanced boot disk.
  • Networking — dedicated custom-mode VPC (no default-allow-ssh), ephemeral external IPv4 for egress only, no public ingress rules. Inbound udp:41641 open for faster direct Tailscale connectivity.
  • Tailnet — joined under tag:cloud via ephemeral auth key; reachable on MagicDNS as ops-vm.
  • Runtime — Docker CE + compose plugin, fail2ban active, unattended-upgrades configured (no auto-reboot).
  • Telemetry agentotelcol-contrib v0.152.1 running as a systemd service.
  • Operator accessmake ssh over Tailscale; no public SSH port anywhere.

Late Refinement — Tailscale SSH Replaces Key-Based SSH

The approved Spec 1 design had key-based SSH over Tailscale. During the deployment pass that was switched to Tailscale SSH (tailscale up --ssh + an SSH ACL rule).

Why: drops the macOS Keychain prereq entirely. The VM-side tailscaled mints short-lived SSH credentials from the tailnet identity, so the operator workstation no longer needs an SSH keypair, an ssh-agent, or the one-time ssh-add --apple-use-keychain ceremony. One less moving part in the passwordless-operation chain.

ACL rule (lives in the Tailscale admin policy file):

"ssh": [
  {
    "action": "accept",
    "src":    ["autogroup:member"],
    "dst":    ["tag:cloud"],
    "users":  ["ops", "root"],
  },
],

ACL dst must match the device tag

The VM is joined with an auth key carrying tag:cloud, so the SSH ACL dst must be tag:cloud, not autogroup:self. autogroup:self does not match tagged devices — see the gotcha.

The monitoring role and all other Ansible behavior are unchanged — only the transport for make ssh and Ansible’s SSH connection moved from key-based to Tailscale SSH. Passwordless sudo via the google-sudoers group still covers become.

Telemetry Status — Flowing to SigNoz Cloud

  • Endpointingest.eu2.signoz.cloud:443 (OTLP/gRPC, TLS on, signoz-ingestion-key header).
  • Pipelinehost_metrics receiver → resourcedetection (system + GCP) → batchotlp_grpc exporter.
  • Throughput observed — ~25,000 host-metric data points exported so far.
  • Failureszero. otelcol_exporter_send_failed_metric_points is absent from :8888/metrics (i.e. the failure counter has never incremented).

How to verify live:

make ssh
curl -s localhost:8888/metrics | grep -E 'otelcol_exporter_(sent|send_failed)_metric_points'

otelcol_exporter_sent_metric_points increases over time (good); the send_failed counter should be absent or zero. See the gotcha for why this is the canonical signal.

Cost — ~$18 / month

Rough monthly cost at the current config:

ItemCost
e2-small (2 vCPU burstable, 2 GB RAM, 24/7)~$13 / mo
20 GB pd-balanced boot disk~$2 / mo
Ephemeral external IPv4 (for egress only)~$3 / mo
Total~$18 / mo

Network egress and SigNoz Cloud usage are inside their respective free tiers at this volume.

Spec 2 Preview (next workstream)

The application-deploy layer is now the active focus. Planned scope:

  • Artifact Registry + per-repo CI workflow template for image pushes; VM pulls via its dedicated SA (already wired via roles/artifactregistry.reader).
  • docker compose based app deploys with a small Ansible role to land the compose file and run docker compose up -d.
  • App logs and traces to SigNoz via a filelog receiver and OTLP trace intake on the existing otelcol-contrib.
  • cloudflared role + Cloudflare ZTNA tunnels for public ingress (no GCP load balancer needed).
  • Batch jobs (systemd timers or k8s-cronjob-equivalent on the single VM).
  • Hardening pass — drop the external IP in favor of Cloud NAT, migrate Tailscale + SigNoz keys to GCP Secret Manager.

See the Spec 2 stub at the end of Spec 2 Preview (out of scope here).