Spec 2a now has an approved design

The first slice of Spec 2 has moved from seed material to an approved design — see gcp-app-deploy-design for the full Spec 2a artifact (core app deploy, EU region migration, first-app shape, WIF). This roadmap remains the umbrella for the still-deferred slices 2b (app telemetry), 2c (Cloudflare ZTNA), 2d (Cloud Run Jobs), and the hardening pass.

For Agents

Forward-looking design overview for the application-deploy layer on top of the live ops-vm. Status: Spec 2a is in design (approved artifact at gcp-app-deploy-design); Specs 2b/2c/2d + hardening are still seed material. The original section content below remains the umbrella context for the deferred slices.

Spec 2 layers application deployment on top of the live [[spec-1-deployment-complete|ops-vm]]. Spec 1 produced a Docker-ready, Artifact-Registry-ready, Tailscale-connected VM with host telemetry shipping to SigNoz Cloud. Spec 2 turns that into a runtime for the actual applications.

Goal

Layer application deployment on top of the live ops-vm:

  • Pull images from Artifact Registry — the VM-side service account already has roles/artifactregistry.reader from Spec 1, so authenticated pulls work without extra credentials.
  • Run images via docker compose — compose-file driven, one stack per app, each with its own systemd unit (or a single docker compose up -d orchestrated by an Ansible role).
  • Expose HTTP services via Cloudflare ZTNAcloudflared Ansible role + Cloudflare Access policies, no GCP load balancer needed, no GCP firewall ports opened.
  • Run scheduled batch jobs — systemd timers on the VM for light work, Cloud Run Jobs for heavier or spiky work.
  • Ship app logs and traces to SigNoz Cloud — not just host metrics. Extend the existing otelcol-contrib with a filelog receiver for container stdout and OTLP receivers for app-native traces.

Components

Artifact Registry

  • A Terraform-managed AR repository in the same GCP project (aerobic-tesla-490112-r3), Docker format, regional.
  • A reusable GitHub Actions workflow template (workflow_call) that each app repo invokes from its own .github/workflows/ — builds the image, tags with the commit SHA, pushes to <region>-docker.pkg.dev/<project>/<repo>/<image>:<sha>.
  • The VM already has roles/artifactregistry.reader on its service account (provisioned by Spec 1 — forward-looking).

AR credential helper

  • Install docker-credential-gcr (or gcloud auth configure-docker) on the VM so docker pull <region>-docker.pkg.dev/... Just Works via the VM-attached service account token.
  • Belongs in the docker Ansible role (extend the existing role from Spec 1).

Deploy mechanism

Two viable shapes — pick one early in the spec phase:

  • App-repo-ships-compose-file — each app repo contains its own docker-compose.yml; a small VM-side script clones the repo (read-only deploy key) and runs docker compose up -d. Maximizes app autonomy; deploy state lives next to the app code.
  • Central deploy manifest in this repolevandor/terraform/deploys/<app>.yml references AR images by tag; deploys are applied centrally. Centralizes state; app teams need access to this repo to deploy.

Trade-off summary: app autonomy vs central state. The central-manifest model is easier to audit and rollback; the app-repo model is easier for individual app teams to evolve.

Cloudflare ZTNA

  • cloudflared Ansible role — install the tunnel binary, register the tunnel against the user levander Cloudflare account, run as a systemd unit.
  • Cloudflare Access policies gate per-app — internal apps get a friendly public URL (<app>.<zone>) without opening any GCP firewall ports.
  • No GCP load balancer cost. Latency penalty is the Cloudflare hop but workable for internal-app traffic.

Batch jobs

Two destinations depending on workload shape:

  • systemd timers on the VM — light, frequent jobs (e.g. nightly database backups, periodic cleanup tasks). Lives in the VM, observed by the existing OTel collector.
  • Cloud Run Jobs — heavier or spiky jobs (e.g. a one-off batch import, a periodic export that needs a fat CPU for ten minutes). Decouples spiky load from the always-on VM.

Decision per-job in the spec phase, not a single platform-wide pick.

App telemetry

Extend the existing otelcol-contrib config:

  • Logsfilelog receiver consuming container stdout. Two sourcing options: journald (if Docker is configured with --log-driver=journald) or direct file scraping under /var/lib/docker/containers/*/.
  • Traces — OTLP receiver on :4317 (gRPC) and :4318 (HTTP) for app-native traces; apps point their SDK at localhost:4317.
  • New pipelines: logs and traces, both sending to the same ingest.eu2.signoz.cloud:443 endpoint with the existing ingestion key (same auth pattern as the host-metrics pipeline).

Open Design Decisions

These surface during the brainstorm phase and need pinning down before the spec is written.

  • GitHub deploy key strategy — one deploy key per repo (more keys to manage, blast radius per key is small) vs one machine-user key for many repos (one key to manage, larger blast radius). Lean toward per-repo deploy keys for read-only access; revisit if the operational overhead grows.
  • External IP vs Cloud NAT — the live VM has an ephemeral external IPv4 (~3.65/mo external IP charge, adds ~$1/mo for NAT, and removes the last public-facing GCP attack surface entirely (egress only, but still). Probably worth it once Spec 2 settles; cleanest to schedule as a follow-up after the initial app deploy lands.
  • Auth-key storage — current secrets/ gitignored files at mode 0600 work fine, but GCP Secret Manager would centralize rotation and produce an audit trail. Modest cost (cents/mo for two secrets). Belongs in the hardening pass at the end of Spec 2 rather than blocking the initial deploys.

Status

Spec 2a: in design — approved artifact at gcp-app-deploy-design (committed 2026-05-25). Decisions locked: europe-west3 region, central manifest in this Terraform repo, Ansible-driven apps role with service and job runtime shapes, WIF for CI auth, manual make deploy APP=<name>. Existing us-central1 VM scheduled for destroy + reprovision in EU as part of Spec 2a. First app: wowjeeez/polymarket-fetch (telegram agent + 15-minute snapshot job).

Spec 2b / 2c / 2d / Hardening: not started. Each is additive to the Spec 2a layout — no restructuring required. Open design decisions below (deploy key strategy, Cloud NAT, Secret Manager) belong to the hardening pass at the end of Spec 2.