Spec 2a entry point

This is the approved design for the first slice of Spec 2 — core app deploy, EU region migration, and the first app (wowjeeez/polymarket-fetch). Spec 2b (app telemetry), 2c (Cloudflare ZTNA), 2d (Cloud Run Jobs), and a hardening pass remain follow-ups under the Spec 2 roadmap umbrella.

Spec 2a layers application deployment on top of the live [[spec-1-deployment-complete|ops-vm]] and migrates the deployment from us-central1 to europe-west3 (Frankfurt) so the whole stack lives in EU. The mechanism is a central manifest in this Terraform repo, Ansible-driven; CI authenticates to GCP via Workload Identity Federation (no long-lived keys). Two runtime shapes — service (daemon) and job (systemd-timer-driven one-shot) — cover the dual nature of the first app and any future shape.

Source spec: docs/superpowers/specs/2026-05-25-app-deploy-design.md in the repo.

For Agents

  • Region: europe-west3 (zone europe-west3-a). The existing us-central1 VM will be destroyed and reprovisioned in EU as part of this work.
  • First app: wowjeeez/polymarket-fetch — a long-running telegram agent (kind service) plus a 15-minute snapshot job (kind job).
  • Deploy trigger: manual make deploy APP=<name> (auto-deploy deferred).
  • CI auth: Workload Identity Federation (OIDC), scoped per repo via IAM bindings on ci-pusher@…iam.gserviceaccount.com.
  • App secrets: secrets/<app>.env files, gitignored, mode 0600, copied by Ansible to /opt/apps/<app>/.env.
  • Status: approved, ready for implementation planning. Cross-link: spec-2-roadmap.

Context

Spec 1 produced a Docker-ready, Artifact-Registry-ready, Tailscale-connected VM with host metrics shipping to SigNoz Cloud — live as ops-vm in aerobic-tesla-490112-r3 (currently us-central1-a). Spec 2a:

  • Adds Artifact Registry + Workload Identity Federation to the Terraform layer.
  • Adds an apps Ansible role + central deploys/apps.yml manifest.
  • Migrates the VM to europe-west3-a.
  • Ships the first app (polymarket-fetch) end-to-end (CI → AR → VM service + job).

Goals

  • make deploy APP=<name> deploys an app from Artifact Registry onto the VM.
  • The deploy mechanism supports both services (daemon, docker compose up -d) and jobs (one-shot, docker compose run --rm driven by a systemd timer).
  • CI in app repos pushes images to AR via Workload Identity Federation — no long-lived service-account keys anywhere.
  • Per-app env/secrets stay in the existing secrets/ pattern from Spec 1.
  • All GCP resources live in europe-west3; the existing us-central1 VM is destroyed and reprovisioned in EU.

Non-Goals (deferred)

  • Spec 2b — app logs and traces via OTel filelog + OTLP receivers.
  • Spec 2c — Cloudflare ZTNA + cloudflared role for public app URLs.
  • Spec 2d — Cloud Run Jobs for heavy/spiky batch.
  • Hardening — GCP Secret Manager for centralized rotation; no-public-IP + Cloud NAT.
  • Auto-deploy — CI triggering a VM-side deploy automatically; manual make deploy for now.

Decisions (locked in brainstorm)

DecisionChoiceRationale
GCP regioneurope-west3 (Frankfurt) + zone europe-west3-aEU data locality; low latency from Central Europe; large/mature region.
Existing VMDestroy + reprovision in EUSingle-region story; VM is 2 days old with no app data — base services only.
Deploy mechanismCentral manifest in this terraform repo, Ansible-drivenSolo operator; central audit + rollback; no per-repo deploy keys.
Runtime shapesservice and jobCovers polymarket-fetch dual nature and any future app shape.
CI → AR authWorkload Identity FederationNo long-lived keys; OIDC; scoped per repo via IAM condition.
App env/secretsLocal secrets/<app>.env per app, Ansible-copiedConsistent with Spec 1 pattern; gitignored; mode 0600.
AR repo structureOne Docker repo apps in europe-west3, multiple images underneathSimple for a personal infra repo.
Image taggingGit SHA + :latest floatingManifest pins :latest; ExecStartPre=docker pull ensures latest on (re)start; SHA available for traceability.
Deploy triggerManual: make deploy APP=<name>Auto-deploy is a Spec 2 hardening item.
Reusable workflow locationwowjeeez/terraform/.github/workflows/build-push.ymlSingle source of truth; app repos call via uses:. Requires pushing this terraform repo to GitHub as wowjeeez/terraform.

Architecture

Spec 2a build-push-deploy flow

graph TD
    Push["git push main (app repo)"] --> Auth["WIF auth, id-token write"]
    Auth --> Build["docker buildx build, tag sha + latest"]
    Build --> AR["Artifact Registry: europe-west3-docker.pkg.dev/proj/apps"]
    Op["operator: make deploy APP=name"] --> Ansible["ansible --tags apps --extra-vars deploy_only=name"]
    Ansible --> Compose["render /opt/apps/name/ compose + .env"]
    Ansible --> Unit["render systemd unit (service or job+timer)"]
    Unit --> Pull["ExecStartPre=docker pull"]
    Pull --> AR
    Pull --> Run["docker compose up -d, or docker compose run --rm"]
    style AR fill:#264653,stroke:#2a9d8f,color:#fff
    style Run fill:#2d2d2d,stroke:#888,color:#fff

Plain-text version:

GitHub Actions (app repo push to main)
  permissions: id-token: write
  google-github-actions/auth@v2 (WIF) then short-lived GCP token
  docker buildx build then tag :SHA + :latest then push to
      europe-west3-docker.pkg.dev/PROJ/apps/IMAGE
 
operator: make deploy APP=NAME
  ansible-playbook --tags apps --extra-vars deploy_only=NAME
      copy secrets/NAME.env to /opt/apps/NAME/.env  (0600 ops:ops)
      template /opt/apps/NAME/docker-compose.yml
      template /etc/systemd/system/apps-NAME.service|timer
      systemctl daemon-reload + enable + restart
      ExecStartPre=docker pull IMAGE
        ExecStart=docker compose up -d              (kind: service)
                  docker compose run --rm SVC       (kind: job, triggered by timer)

Repo Layout

.                                       repo root = Terraform root
Makefile                                extended: + deploy, + redeploy-all
README.md                               extended with app-deploy operator UX
CLAUDE.md                               extended: pointers to new guides
main.tf                                 extended: + module ar_wif + outputs
variables.tf                            extended: ar_repo_name, ci_github_repos
outputs.tf                              extended: wif_provider, ci_sa_email, ar_repo_url
modules/
  vm/                                   unchanged
  ar_wif/                               NEW: AR repo + WIF pool + provider + CI SA + IAM
    main.tf
    variables.tf
    outputs.tf
ansible/
  site.yml                              extended: load deploys/apps.yml, add apps role
  roles/
    base/                               unchanged
    docker/                             extended: + AR credential helper task
    github_keys/                        unchanged
    monitoring/                         unchanged
    apps/                               NEW: renders compose + systemd per app
      tasks/main.yml
      defaults/main.yml
      handlers/main.yml
      templates/
        docker-compose.yml.j2
        service.unit.j2
        job.unit.j2
        job.timer.j2
deploys/                                NEW
  apps.yml                              the central manifest (apps: list)
.github/workflows/
  build-push.yml                        NEW: reusable workflow (workflow_call)
secrets/
  signoz_ingestion_key                  existing
  github_deploy_key                     existing, optional
  polymarket-telegram-agent.env         NEW pattern, gitignored
  polymarket-snapshot.env               NEW pattern, gitignored

terraform.tfvars is also extended (gitignored). New keys: ci_github_repos (list of owner/repo strings); region/zone updated to EU.

Terraform Layer

Variables (new/changed)

VariableDefaultNotes
regioneurope-west3CHANGED from us-central1.
zoneeurope-west3-aCHANGED from us-central1-a.
ar_repo_nameappsDocker-format AR repo name.
ci_github_repos[]List of owner/repo strings; each gets a WIF binding to impersonate the CI SA.

modules/ar_wif/

Reusable module creating:

  • google_artifact_registry_repository.this — Docker format, region from input.
  • google_iam_workload_identity_pool.github — pool github-pool.
  • google_iam_workload_identity_pool_provider.github_oidc — OIDC provider with:
    • issuer_uri = "https://token.actions.githubusercontent.com"
    • attribute_mapping:
      • google.subject = assertion.sub
      • attribute.repository = assertion.repository
      • attribute.actor = assertion.actor
    • attribute_condition = "assertion.repository_owner == 'wowjeeez'" — restricts the pool to repos owned by the user, defense in depth on top of per-repo IAM bindings.
  • google_service_account.ci_pusherci-pusher@PROJ.iam.gserviceaccount.com.
  • google_artifact_registry_repository_iam_memberci_pusher granted roles/artifactregistry.writer on the apps repo.
  • google_service_account_iam_member (for_each over ci_github_repos) — grants roles/iam.workloadIdentityUser on ci_pusher to principalSet://iam.googleapis.com/projects/NUM/locations/global/workloadIdentityPools/github-pool/attribute.repository/OWNER/REPO.
  • Outputs: wif_provider_resource_name, ci_service_account_email, ar_host (e.g. europe-west3-docker.pkg.dev), ar_repo_url (full path europe-west3-docker.pkg.dev/PROJ/apps).

main.tf wiring

  • module "ar_wif" instantiated once with var.ar_repo_name, var.ci_github_repos, var.region.
  • depends_on API enablement.
  • Outputs re-exported at the root.

API enablement

The existing google_project_service for_each already enables iam.googleapis.com and artifactregistry.googleapis.com. Spec 2a extends the list with iamcredentials.googleapis.com (required for WIF token exchange) and sts.googleapis.com (Security Token Service).

Ansible Layer

docker role (extended)

Append to ansible/roles/docker/tasks/main.yml:

  • Add the Google Cloud apt repo + key (keyring/signed-by pattern).
  • Install google-cloud-cli.
  • Run gcloud auth configure-docker europe-west3-docker.pkg.dev --quiet as the ops user. Writes the credential helper into ~ops/.docker/config.json so subsequent docker pull AR_HOST/... authenticates via the VM’s attached service account (roles/artifactregistry.reader from Spec 1).

apps role (new)

ansible/roles/apps/ reads the apps: list (loaded via site.yml’s vars_files: [deploys/apps.yml]).

tasks/main.yml per app entry:

  • file: create /opt/apps/NAME/ (mode 0755, owner ops).
  • copy: secrets/NAME.env to /opt/apps/NAME/.env (mode 0600, owner ops).
  • template: render docker-compose.yml from templates/docker-compose.yml.j2. Compose body contains the image, env_file: .env, restart policy (services), and any compose_extra overrides.
  • template: render the systemd units:
    • kind: service produces /etc/systemd/system/apps-NAME.service:
      • Type=oneshot, RemainAfterExit=yes
      • WorkingDirectory=/opt/apps/NAME
      • User=ops
      • ExecStartPre=/usr/bin/docker pull IMAGE
      • ExecStart=/usr/bin/docker compose up -d
      • ExecStop=/usr/bin/docker compose down
      • [Install] WantedBy=multi-user.target
    • kind: job produces /etc/systemd/system/apps-NAME.service:
      • Type=oneshot
      • WorkingDirectory=/opt/apps/NAME
      • User=ops
      • ExecStartPre=/usr/bin/docker pull IMAGE
      • ExecStart=/usr/bin/docker compose run --rm SERVICE_NAME
    • and apps-NAME.timer:
      • [Timer] OnCalendar=SCHEDULE, Persistent=true
      • [Install] WantedBy=timers.target
  • systemd: daemon_reload: yes, enabled: yes, state: started/restarted for the unit (and the timer for kind: job).
  • Honor deploy_only=NAME extra-var (skip entries not matching).
  • Pre-validation: command: docker compose -f /opt/apps/NAME/docker-compose.yml config after templating, fail the play on a bad compose file before activating systemd.

Handler: daemon-reload triggered by changes to systemd unit files.

defaults/main.yml: empty apps: [] default (the real list is in deploys/apps.yml); defaults for template knobs (e.g. compose default restart: unless-stopped for services).

site.yml change

  • vars_files: ["{{ playbook_dir }}/../deploys/apps.yml"] at play level (or include_vars inside the apps role).
  • Add apps as the final role, after monitoring.

Makefile additions

deploy:
	$(ANSIBLE) ansible/site.yml --tags apps --extra-vars "deploy_only=$(APP)"
 
redeploy-all:
	$(ANSIBLE) ansible/site.yml --tags apps

Usage: make deploy APP=polymarket-telegram-agent.

CI - Reusable Workflow

.github/workflows/build-push.yml (in this terraform repo, called via workflow_call):

on:
  workflow_call:
    inputs:
      image_name:         {required: true,  type: string}
      dockerfile:         {required: false, type: string, default: Dockerfile}
      context:            {required: false, type: string, default: "."}
      wif_provider:       {required: true,  type: string}
      ci_service_account: {required: true,  type: string}
      ar_host:            {required: true,  type: string}
      ar_project:         {required: true,  type: string}
      ar_repo:            {required: true,  type: string}
 
jobs:
  build-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ inputs.wif_provider }}
          service_account: ${{ inputs.ci_service_account }}
      - uses: google-github-actions/setup-gcloud@v2
      - run: gcloud auth configure-docker ${{ inputs.ar_host }} --quiet
      - uses: docker/setup-buildx-action@v3
      - name: Build and push
        run: |
          IMAGE=${{ inputs.ar_host }}/${{ inputs.ar_project }}/${{ inputs.ar_repo }}/${{ inputs.image_name }}
          docker buildx build --push -t $IMAGE:${{ github.sha }} -t $IMAGE:latest -f ${{ inputs.dockerfile }} ${{ inputs.context }}

App repos add .github/workflows/release.yml:

on:
  push:
    branches: [main]
 
jobs:
  telegram-agent:
    uses: wowjeeez/terraform/.github/workflows/build-push.yml@main
    permissions: {contents: read, id-token: write}
    with:
      image_name: polymarket-telegram-agent
      dockerfile: docker/telegram-agent.Dockerfile
      wif_provider: projects/NUM/locations/global/workloadIdentityPools/github-pool/providers/github
      ci_service_account: ci-pusher@aerobic-tesla-490112-r3.iam.gserviceaccount.com
      ar_host: europe-west3-docker.pkg.dev
      ar_project: aerobic-tesla-490112-r3
      ar_repo: apps
 
  snapshot:
    uses: wowjeeez/terraform/.github/workflows/build-push.yml@main
    permissions: {contents: read, id-token: write}
    with:
      image_name: polymarket-snapshot
      dockerfile: docker/snapshot.Dockerfile
      wif_provider: projects/NUM/locations/global/workloadIdentityPools/github-pool/providers/github
      ci_service_account: ci-pusher@aerobic-tesla-490112-r3.iam.gserviceaccount.com
      ar_host: europe-west3-docker.pkg.dev
      ar_project: aerobic-tesla-490112-r3
      ar_repo: apps

The WIF provider, CI service account, AR host/project/repo are taken from terraform output after make vm-up — they are identifiers, not secrets, safe to commit into the app repos.

Migration of the Existing VM (US to EU)

Because the VM moves from us-central1 to europe-west3:

  1. (If not already done) Revoke the existing Tailscale auth key in the admin console; generate a fresh reusable + ephemeral + pre-authorized + tag:cloud-tagged auth key. Paste into terraform.tfvars.
  2. make deprovision — the us-central1 VM is destroyed; the ephemeral Tailscale node auto-removes from the tailnet shortly after.
  3. Edit terraform.tfvars: set region = "europe-west3", zone = "europe-west3-a". Keep vm_name = "ops-vm" (so MagicDNS resolution stays unchanged: ops-vm still resolves on the tailnet).
  4. make plan && make vm-up && make configure.
  5. make verify confirms VM up + all services active. Host metrics resume to SigNoz from the new region.

The new VM has a different external IP (irrelevant — SSH is via Tailscale). MagicDNS resolution of ops-vm automatically points at the new node once it joins.

First App - Operator Steps (end-to-end)

  1. In wowjeeez/polymarket-fetch: author two Dockerfiles (e.g. docker/telegram-agent.Dockerfile, docker/snapshot.Dockerfile). Author .github/workflows/release.yml calling the reusable workflow.

  2. In this terraform repo: add "wowjeeez/polymarket-fetch" to ci_github_repos in terraform.tfvars. make plan && make vm-up (WIF IAM binding update; no VM downtime).

  3. Push polymarket-fetch:main then CI builds + pushes both images to AR.

  4. Append two entries to deploys/apps.yml:

    apps:
      - name: polymarket-telegram-agent
        kind: service
        image: europe-west3-docker.pkg.dev/aerobic-tesla-490112-r3/apps/polymarket-telegram-agent:latest
        env_file: secrets/polymarket-telegram-agent.env
     
      - name: polymarket-snapshot
        kind: job
        image: europe-west3-docker.pkg.dev/aerobic-tesla-490112-r3/apps/polymarket-snapshot:latest
        env_file: secrets/polymarket-snapshot.env
        schedule: "*/15 * * * *"
  5. Create secrets/polymarket-telegram-agent.env and secrets/polymarket-snapshot.env locally.

  6. make deploy APP=polymarket-telegram-agent then make deploy APP=polymarket-snapshot.

  7. make ssh; verify with systemctl status apps-polymarket-telegram-agent, systemctl list-timers apps-polymarket-snapshot.timer, and docker ps.

Error Handling & Idempotency

  • The apps role is idempotent: re-running with no manifest changes is a no-op. With changes, compose/unit files re-render and the handler restarts the affected unit.
  • ExecStartPre=docker pull means every (re)start picks up :latest. There’s a brief window where an old container coexists with a new pull — acceptable for this scale.
  • A failed docker pull (network blip, AR transient) prevents start; systemd reports the failure; journalctl -u apps-NAME shows the error.
  • A bad compose file is caught by docker compose config at deploy time before systemd activates.
  • WIF token-exchange failures in CI surface as a clear google-github-actions/auth@v2 step error with the GCP message — common causes: missing id-token: write, repo not added to ci_github_repos, WIF provider URI mismatch.
  • All Ansible role changes are idempotent — re-running make configure is safe.

Verification

  • terraform plan after Spec 2a shows: AR repo, WIF pool + provider, CI SA, IAM bindings (per repo), expanded API enablement.
  • terraform validate and terraform fmt -check clean.
  • ansible-playbook --syntax-check ansible/site.yml clean.
  • Post-migration make verify confirms VM healthy + OTel metrics flowing from europe-west3.
  • Post-deploy: systemctl is-active apps-polymarket-telegram-agent returns active; docker ps shows the running container; the agent posts something to Telegram (confirmed in chat).
  • Snapshot job: systemctl list-timers apps-polymarket-snapshot.timer shows the next-firing schedule; after the first scheduled run, journalctl -u apps-polymarket-snapshot.service shows the run output.
  • WIF end-to-end: pushing a commit to polymarket-fetch:main triggers the workflow; the auth step exchanges the OIDC token; build+push completes; new image tag visible in gcloud artifacts docker images list europe-west3-docker.pkg.dev/PROJ/apps.

What’s Deferred (Spec 2b / 2c / 2d / Hardening)

  • Spec 2b — app telemetry: extend monitoring role with filelog and OTLP receivers; pipelines for logs and traces shipping to SigNoz.
  • Spec 2ccloudflared Ansible role + Cloudflare Access policies; per-app public URL gating.
  • Spec 2d — Cloud Run Jobs (Terraform) for heavy/spiky batch.
  • Hardening — GCP Secret Manager (rotation + audit); no-public-IP + Cloud NAT (drops the external IPv4 charge, removes the last public-facing surface).
  • Auto-deploy — webhook or pull-based reconciler that triggers make deploy on AR push.

Each is additive to this layout — no restructuring required.