For Agents

Step-by-step guide for getting an application running on the live ops-vm. Spec 2 — the proper deploy layer — is not built yet. Read Reality check first. Everything in this guide is the manual / partial path that works today, plus what’s coming when Spec 2 lands.

How to configure an application to be deployed onto the [[spec-1-deployment-complete|ops-vm]]. Use this when an agent is asked to deploy, configure, or run an application on the VM.

When to use this guide

Use this guide when an agent is asked to:

  • Deploy a containerized app onto the ops-vm.
  • Configure / run / restart / inspect an existing app on the VM.
  • Set up app-level telemetry into the SigNoz pipeline already provisioned on the VM.

If the ask is about provisioning more VMs, that’s agent-guide-provision-new-vm.

Reality check (read this first)

Spec 2 is not built yet

The proper application-deploy layer — Artifact Registry repo, per-repo CI workflow, docker compose deploy mechanism, Cloudflare ZTNA, batch jobs, app logs/traces — is not started. The forward-looking design lives in spec-2-roadmap. Until Spec 2 lands, deployments are manual / partial. Treat what you do today as a stopgap that should convert cleanly into the Spec 2 manifest model later.

What this means in practice:

  • No Artifact Registry repo exists yet. You either pull from a public registry (Docker Hub, GHCR public images) or use a GitHub deploy key for a private repo + build on the VM (small, single-VM-friendly only).
  • No deploy mechanism — you do docker pull / docker compose up -d by hand over SSH, or wrap it in your own one-off script.
  • No Cloudflare ZTNA on the VM. Anything you deploy is reachable only from the tailnet (good for internal tooling, blocking for anything that needs a public URL).

What you have today

A real Docker-ready Ubuntu VM with the pieces wired:

  • Docker CE + compose plugin installed and the docker daemon active.
  • The ops user is in the docker group — no sudo needed for docker commands when logged in as ops.
  • OTel collector (otelcol-contrib v0.152.1) running as systemd, shipping host metrics to ingest.eu2.signoz.cloud:443. Status: zero export failures (see Telemetry Status — Flowing to SigNoz Cloud).
  • Service account attached to the VM with roles/artifactregistry.reader — forward-looking; pulls from AR will Just Work once an AR repo exists, no extra config.
  • fail2ban + unattended-upgrades baseline hardening.
  • Tailscale SSHssh ops@ops-vm or tailscale ssh ops@ops-vm work passwordless from any tailnet member with users: ["ops","root"] in the SSH ACL.

Manual deploy path (today)

The pattern that works right now and will convert cleanly into the Spec 2 manifest model later:

  1. SSH into the VM:
    make ssh
    # or:  ssh ops@ops-vm
  2. Pull the image:
    • Public image: docker pull <image>:<tag> — Just Works.
    • Private repo (today, before Spec 2 AR): if secrets/github_deploy_key is configured on the operator workstation, scp the deploy key onto the VM into ~/.ssh/, git clone the repo, then docker build locally on the VM. Acceptable for a single-VM stopgap, not scalable — that’s why Spec 2 introduces Artifact Registry.
  3. Lay down the compose file in a consistent location:
    sudo mkdir -p /opt/apps/<app-name>
    sudo chown ops:ops /opt/apps/<app-name>
    # place docker-compose.yml and .env into /opt/apps/<app-name>/
  4. Start the stack:
    cd /opt/apps/<app-name>
    docker compose -p <app-name> up -d
    The -p <app-name> flag is important — see What to do well today.
  5. (Optional) systemd unit for auto-start on reboot. Write a small unit at /etc/systemd/system/app-<name>.service that runs docker compose -p <app-name> up -d from the app directory, enable --now it. Stopgap until Spec 2’s Ansible role lands a proper template.

What to do well today, even before Spec 2

Choices made now influence how cleanly the migration to Spec 2 manifests will go. None of these are hard, all of them help.

  • Keep per-app files consistent with what a deploy manifest will want. Convention: /opt/apps/<app-name>/{docker-compose.yml,.env} — one folder per app, predictable. When Spec 2’s Ansible role lands, it can target the same paths.
  • Use named compose projects (docker compose -p <app-name> …). The project name becomes the container/network/volume prefix; teardown via docker compose -p <app-name> down is clean and total. Anonymous projects (no -p) collide unpredictably when multiple apps share a host.
  • Bind container ports to localhost only unless you have a specific reason — 127.0.0.1:8080:8080 rather than 8080:8080. There is no public ingress on this VM (no firewall rules to expose ports anyway), but locking to localhost makes the intent explicit and survives any future firewall rule slip.
  • Route app telemetry through the existing OTel collector. Two routes (cheap to set up — see App telemetry to SigNoz):
    • Point app SDKs at localhost:4317 (gRPC OTLP) once you extend the collector config with an otlp receiver.
    • Add a filelog receiver to scrape container stdout from /var/lib/docker/containers/*/.
  • Document the app in the vault — see For agents adding a new app today.

App telemetry to SigNoz

The current collector config only handles host metrics. To take in logs and traces from apps:

  1. Edit ansible/roles/monitoring/templates/config.yaml.j2 to add:
    • An otlp receiver bound to 127.0.0.1:4317 (gRPC) — apps point their OTLP SDK at localhost:4317.
    • A filelog receiver scraping /var/lib/docker/containers/*/*.log if you want stdout-as-logs without instrumenting the app.
  2. Add matching pipelines (logs: and traces:) — both export to the same otlp_grpc exporter and the same ingest.eu2.signoz.cloud:443 endpoint that host metrics already use. The existing SigNoz ingestion key authenticates all three signal types.
  3. make configure to apply on the live VM.
  4. Verify via :8888/metricsotelcol_exporter_sent_log_records and otelcol_exporter_sent_spans should increment. See gotcha 9 for the canonical signal pattern.

OTel v0.152 names

Use the current component names — host_metrics and otlp_grpc, not the older hostmetrics / otlp. See gotcha 10.

What you must NOT do

  • Do not open public ingress. Tailscale-only is a design principle (see gcp-vm-provisioning-design). Any service that needs a public URL must go through cloudflared / Cloudflare ZTNA — and that’s Spec 2 territory. Do not add firewall rules to the VPC to expose container ports. Do not put the VM behind a GCP load balancer.
  • Do not commit secrets. App .env files, deploy keys, OAuth tokens — anything sensitive belongs in secrets/ (gitignored, mode 0600) on the operator workstation and is pushed to the VM out-of-band (scp, make configure with templated paths) — never into git.
  • Do not bypass Ansible roles to permanently configure the VM. If you apt install something by hand on ops-vm, it will be silently wiped the moment someone runs make deprovision + make provision (or hits a clean reprovision after a refactor). The right path: extend the matching Ansible role, then make configure. The wrong path makes the deployment non-reproducible.

What’s coming in Spec 2

Forward-looking — when these land, the manual steps above are replaced by:

  • Artifact Registry repo — Terraform-managed, in the same GCP project. The VM SA already has roles/artifactregistry.reader.
  • Per-repo CI workflow template (GitHub Actions workflow_call) — each app repo invokes it to build and push images tagged with the commit SHA.
  • Deploy mechanism — likely docker compose pull driven by central manifests or per-repo compose files (the pick is still open — see Deploy mechanism).
  • cloudflared Ansible role + Cloudflare Access policies — replaces “Tailscale-only” for apps that need a public URL, no GCP load balancer needed.
  • App logs and traces to SigNoz — collector config extended with filelog + OTLP receivers; the App telemetry to SigNoz section above is the manual preview of this.
  • Batch-job scheduling — systemd timers on the VM for light/frequent jobs, Cloud Run Jobs for heavier/spikier work.

Full roadmap: spec-2-roadmap.

For agents adding a new app today

Before doing the manual deploy, write a short note in this project’s vault — one note per app — documenting:

  1. Image source — public registry image, or git clone URL + build path on the VM.
  2. Ports — what the container listens on, what gets bound to 127.0.0.1:<port>.
  3. Secrets / env — what’s in .env, where the source-of-truth lives on the operator workstation, who has access.
  4. Telemetry plan — host metrics only (default — nothing to do), or OTLP traces from the app, or filelog from container stdout.
  5. Manual deploy steps — the exact docker pull / docker compose -p <name> up -d sequence, plus any pre-steps (DB migration, etc.).

When Spec 2 lands, these notes become the direct input to per-app deploy manifests. The note format above maps 1:1 to what a manifest needs. Don’t skip this — re-deriving “what’s running and why” from a live VM is harder than capturing it at deploy time.