Spec 2a now has an approved design
The first slice of Spec 2 has moved from seed material to an approved design — see gcp-app-deploy-design for the full Spec 2a artifact (core app deploy, EU region migration, first-app shape, WIF). This roadmap remains the umbrella for the still-deferred slices 2b (app telemetry), 2c (Cloudflare ZTNA), 2d (Cloud Run Jobs), and the hardening pass.
For Agents
Forward-looking design overview for the application-deploy layer on top of the live
ops-vm. Status: Spec 2a is in design (approved artifact at gcp-app-deploy-design); Specs 2b/2c/2d + hardening are still seed material. The original section content below remains the umbrella context for the deferred slices.
Spec 2 layers application deployment on top of the live [[spec-1-deployment-complete|ops-vm]]. Spec 1 produced a Docker-ready, Artifact-Registry-ready, Tailscale-connected VM with host telemetry shipping to SigNoz Cloud. Spec 2 turns that into a runtime for the actual applications.
Goal
Layer application deployment on top of the live ops-vm:
- Pull images from Artifact Registry — the VM-side service account already has
roles/artifactregistry.readerfrom Spec 1, so authenticated pulls work without extra credentials. - Run images via
docker compose— compose-file driven, one stack per app, each with its own systemd unit (or a singledocker compose up -dorchestrated by an Ansible role). - Expose HTTP services via Cloudflare ZTNA —
cloudflaredAnsible role + Cloudflare Access policies, no GCP load balancer needed, no GCP firewall ports opened. - Run scheduled batch jobs — systemd timers on the VM for light work, Cloud Run Jobs for heavier or spiky work.
- Ship app logs and traces to SigNoz Cloud — not just host metrics. Extend the existing
otelcol-contribwith afilelogreceiver for container stdout and OTLP receivers for app-native traces.
Components
Artifact Registry
- A Terraform-managed AR repository in the same GCP project (
aerobic-tesla-490112-r3), Docker format, regional. - A reusable GitHub Actions workflow template (workflow_call) that each app repo invokes from its own
.github/workflows/— builds the image, tags with the commit SHA, pushes to<region>-docker.pkg.dev/<project>/<repo>/<image>:<sha>. - The VM already has
roles/artifactregistry.readeron its service account (provisioned by Spec 1 — forward-looking).
AR credential helper
- Install
docker-credential-gcr(orgcloud auth configure-docker) on the VM sodocker pull <region>-docker.pkg.dev/...Just Works via the VM-attached service account token. - Belongs in the
dockerAnsible role (extend the existing role from Spec 1).
Deploy mechanism
Two viable shapes — pick one early in the spec phase:
- App-repo-ships-compose-file — each app repo contains its own
docker-compose.yml; a small VM-side script clones the repo (read-only deploy key) and runsdocker compose up -d. Maximizes app autonomy; deploy state lives next to the app code. - Central deploy manifest in this repo —
levandor/terraform/deploys/<app>.ymlreferences AR images by tag; deploys are applied centrally. Centralizes state; app teams need access to this repo to deploy.
Trade-off summary: app autonomy vs central state. The central-manifest model is easier to audit and rollback; the app-repo model is easier for individual app teams to evolve.
Cloudflare ZTNA
cloudflaredAnsible role — install the tunnel binary, register the tunnel against the userlevanderCloudflare account, run as a systemd unit.- Cloudflare Access policies gate per-app — internal apps get a friendly public URL (
<app>.<zone>) without opening any GCP firewall ports. - No GCP load balancer cost. Latency penalty is the Cloudflare hop but workable for internal-app traffic.
Batch jobs
Two destinations depending on workload shape:
- systemd timers on the VM — light, frequent jobs (e.g. nightly database backups, periodic cleanup tasks). Lives in the VM, observed by the existing OTel collector.
- Cloud Run Jobs — heavier or spiky jobs (e.g. a one-off batch import, a periodic export that needs a fat CPU for ten minutes). Decouples spiky load from the always-on VM.
Decision per-job in the spec phase, not a single platform-wide pick.
App telemetry
Extend the existing otelcol-contrib config:
- Logs —
filelogreceiver consuming container stdout. Two sourcing options: journald (if Docker is configured with--log-driver=journald) or direct file scraping under/var/lib/docker/containers/*/. - Traces — OTLP receiver on
:4317(gRPC) and:4318(HTTP) for app-native traces; apps point their SDK atlocalhost:4317. - New pipelines:
logsandtraces, both sending to the sameingest.eu2.signoz.cloud:443endpoint with the existing ingestion key (same auth pattern as the host-metrics pipeline).
Open Design Decisions
These surface during the brainstorm phase and need pinning down before the spec is written.
- GitHub deploy key strategy — one deploy key per repo (more keys to manage, blast radius per key is small) vs one machine-user key for many repos (one key to manage, larger blast radius). Lean toward per-repo deploy keys for read-only access; revisit if the operational overhead grows.
- External IP vs Cloud NAT — the live VM has an ephemeral external IPv4 (~3.65/mo external IP charge, adds ~$1/mo for NAT, and removes the last public-facing GCP attack surface entirely (egress only, but still). Probably worth it once Spec 2 settles; cleanest to schedule as a follow-up after the initial app deploy lands.
- Auth-key storage — current
secrets/gitignored files at mode0600work fine, but GCP Secret Manager would centralize rotation and produce an audit trail. Modest cost (cents/mo for two secrets). Belongs in the hardening pass at the end of Spec 2 rather than blocking the initial deploys.
Status
Spec 2a: in design — approved artifact at gcp-app-deploy-design (committed 2026-05-25). Decisions locked: europe-west3 region, central manifest in this Terraform repo, Ansible-driven apps role with service and job runtime shapes, WIF for CI auth, manual make deploy APP=<name>. Existing us-central1 VM scheduled for destroy + reprovision in EU as part of Spec 2a. First app: wowjeeez/polymarket-fetch (telegram agent + 15-minute snapshot job).
Spec 2b / 2c / 2d / Hardening: not started. Each is additive to the Spec 2a layout — no restructuring required. Open design decisions below (deploy key strategy, Cloud NAT, Secret Manager) belong to the hardening pass at the end of Spec 2.