Agent Guide — Configure an App Deploy on ops-vm

For LLM Agents

This is a procedure. Every command runs verbatim. Every value is either taken from this guide or fetched from terraform output — never guessed. Read Before you start and What’s live first; then jump to whichever procedure section matches your task.

Before you start — required context

You are deploying a Dockerized app onto the live ops-vm in aerobic-tesla-490112-r3 / europe-west3-a via the Spec 2a deploy stack: app repo → CI (WIF) → Artifact Registry → make deploy APP=<name> → Ansible apps role renders compose + systemd → container runs on the VM.

Before any commands:

cd /Users/levander/levandor/terraform.
git status clean; on main (or a feature branch off main).
gcloud auth application-default print-access-token returns a token. If not: gcloud auth application-default login.
tailscale status shows your host online. If not: sudo tailscale up.
terraform output shows all of ar_host, ar_repo_url, wif_provider_resource_name, ci_service_account_email. If any missing, the deploy layer isn’t applied — run make vm-up first.

Always re-fetch values via terraform output, never copy-paste from this guide

Project IDs, WIF provider numeric prefixes, AR hosts — they may have changed since this guide was written. Treat the values below as examples; the source of truth is terraform output.

What’s live

Resource	Current value	How to look up
GCP project	`aerobic-tesla-490112-r3`	`gcloud config get-value project`
Region / Zone	`europe-west3` / `europe-west3-a`	`terraform output region zone`
VM (GCE name)	`ops-vm`	`terraform output vm_name`
VM (Tailscale name)	`ops-vm-1` (collision with stale node — see gcp-terraform-ansible-gotchas #15)	`tailscale status \| grep ops-vm`
AR repo URL (image tag prefix)	`europe-west3-docker.pkg.dev/aerobic-tesla-490112-r3/apps`	`terraform output ar_repo_url`
AR host (for `gcloud auth configure-docker`)	`europe-west3-docker.pkg.dev`	`terraform output ar_host`
WIF provider resource	`projects/1004705535649/locations/global/workloadIdentityPools/github-pool/providers/github`	`terraform output wif_provider_resource_name`
CI service account	`ci-pusher@aerobic-tesla-490112-r3.iam.gserviceaccount.com`	`terraform output ci_service_account_email`
Central manifest	`deploys/apps.yml`	—
Per-app secrets	`secrets/<name>.env` (gitignored, mode 0600)	—
Reusable CI workflow path	`wowjeeez/terraform/.github/workflows/build-push.yml@main`	—

Concept overview

GitHub push (app repo on main)
   │
   └─ .github/workflows/release.yml
        ├─ permissions: id-token: write  ◄── REQUIRED for WIF
        └─ uses: wowjeeez/terraform/.github/workflows/build-push.yml@main
             │
             ├─ google-github-actions/auth@v2   (WIF, no long-lived keys)
             ├─ gcloud auth configure-docker europe-west3-docker.pkg.dev
             └─ docker buildx build --push -t <repo>:<sha> -t <repo>:latest

operator (you): make deploy APP=<name>
   │
   └─ ansible-playbook ansible/site.yml --tags apps --extra-vars deploy_only=<name>
        ├─ Validate apps manifest schema
        ├─ systemd-analyze calendar <schedule>  (for kind: job)
        └─ apps role _present.yml:
             ├─ create /opt/apps/<name>/
             ├─ copy secrets/<name>.env → /opt/apps/<name>/.env (0600, ops:ops)
             ├─ render /opt/apps/<name>/docker-compose.yml
             ├─ docker compose config -q     (validate)
             ├─ render /etc/systemd/system/apps-<name>.{service,timer}
             ├─ systemctl daemon-reload
             └─ enable + restart unit (service) or start timer (job)

Procedure: deploy a brand-new app

Numbered steps. Substitute <owner>/<repo>, <app_name>, <image_name> (usually = repo name), <service_or_job>, <schedule> where indicated.

Step 1 — Authorize the app repo for CI

In terraform.tfvars, append <owner>/<repo> to ci_github_repos:

ci_github_repos = [
  "wowjeeez/polymarket-fetch",
  "wowjeeez/<repo>",
]

Apply:

make vm-up

Expect only google_service_account_iam_member.wif_repo_binding["<owner>/<repo>"] to be added (and local_file.inventory recreated — harmless). 0 VM changes, 0 network changes. If the plan shows anything else, stop and inspect.

Re-fetch outputs to confirm:

terraform output ar_host ar_repo_url wif_provider_resource_name ci_service_account_email

Step 2 — Author the Dockerfile(s) in the app repo

One Dockerfile per image. Multistage where possible. Pattern:

# syntax=docker/dockerfile:1
 
FROM rust:1-bookworm AS builder
WORKDIR /build
COPY . .
RUN cargo build --release -p <crate-name>
 
FROM debian:bookworm-slim AS runtime
RUN apt-get update \
    && apt-get install -y --no-install-recommends ca-certificates \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /build/target/release/<binary-name> /usr/local/bin/<binary-name>
ENTRYPOINT ["/usr/local/bin/<binary-name>"]

ENTRYPOINT (not CMD) so the compose command: overrides args cleanly. Use ca-certificates whenever the binary makes outbound TLS calls (Polymarket APIs, Supabase, etc.).

.dockerignore traps

If the app uses sqlx::migrate!("path/to/migrations"), the migrations directory must be in the Docker build context. Check .dockerignore doesn’t exclude it. (See gcp-terraform-ansible-gotchas #17.)

Step 3 — Author `.github/workflows/release.yml` in the app repo

name: Release
 
on:
  push:
    branches: [main]
    paths:
      - "crates/<crate-name>/**"
      - "Cargo.toml"
      - "Cargo.lock"
      - "Dockerfile.<image_name>"
      - ".github/workflows/release.yml"
  workflow_dispatch: {}
 
permissions:
  contents: read
  id-token: write
 
jobs:
  build-push-<image_name>:
    uses: wowjeeez/terraform/.github/workflows/build-push.yml@main
    with:
      image_name: <image_name>
      dockerfile: Dockerfile.<image_name>
      context: "."
      wif_provider: "projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/github-pool/providers/github"
      ci_service_account: "ci-pusher@<PROJECT_ID>.iam.gserviceaccount.com"
      ar_host: "europe-west3-docker.pkg.dev"
      ar_project: "<PROJECT_ID>"
      ar_repo: "apps"

Fill <PROJECT_NUMBER>, <PROJECT_ID>, and the WIF provider string from terraform output (do not copy from this guide).

permissions: id-token: write at the workflow level is mandatory

Reusable workflows inherit only the permissions the caller grants. Without id-token: write on the calling workflow, the WIF auth step fails with Error: getIDToken call failed.

Step 4 — Push the app repo, watch CI

cd <app-repo-dir>
git add .github/workflows/release.yml Dockerfile.<image_name>
git commit -m "ci: containerize <image_name> and push to AR via WIF"
git push origin main

Watch:

gh run watch -R <owner>/<repo>

Common failure modes and fixes:

workflow was not found / failed to parse called workflow — the wowjeeez/terraform repo’s private-reusable-workflow access setting is off. Fix: gh api repos/wowjeeez/terraform/actions/permissions/access -X PUT -f access_level=user.
Build fails on sqlx::migrate! with “No such file or directory” — the migrations dir is excluded by .dockerignore. Fix: remove the exclusion or !-include the migrations subpath.
Error: getIDToken call failed — missing permissions: id-token: write on the calling workflow.

Confirm the image landed (requires gcloud auth login — the user auth, not ADC):

gcloud artifacts docker images list \
  europe-west3-docker.pkg.dev/<PROJECT_ID>/apps/<image_name> \
  --include-tags --limit=3

You should see :latest and :<commit_sha> tags.

Step 5 — Add the app to the central manifest

Edit deploys/apps.yml. Append one entry per intended service or job. Required keys per entry: name, kind (service or job), image, env_file, optional state (defaults to present), compose_extra (optional knobs). For kind: job, schedule is required.

Service entry (long-running daemon):

  - name: <app_name>
    kind: service
    image: europe-west3-docker.pkg.dev/<PROJECT_ID>/apps/<image_name>:latest
    env_file: secrets/<app_name>.env
    state: present
    compose_extra:
      restart: always
      command:
        - <arg1>
        - <arg2>

Job entry (scheduled one-shot):

  - name: <job_name>
    kind: job
    image: europe-west3-docker.pkg.dev/<PROJECT_ID>/apps/<image_name>:latest
    env_file: secrets/<job_name>.env
    state: present
    schedule: "*:0/30"
    compose_extra:
      command:
        - <arg1>
        - <arg2>

schedule: is systemd OnCalendar syntax, NOT cron

Examples that work: "*:0/15" (every 15 min), "*:0/30" (every 30 min), "hourly", "daily", "Mon..Fri 09:00". Cron’s "*/15 * * * *" is rejected by systemd-analyze calendar and the timer never fires. See gcp-terraform-ansible-gotchas #12. The role validates each kind: job schedule via systemd-analyze calendar before any unit file is written, so a typo fails the play fast.

Optional compose_extra knobs the apps role supports:

restart (services only): always, unless-stopped, on-failure, etc.
ports: list of "<host>:<container>" strings.
volumes: list of "<host_path>:<container_path>" strings. Pre-create the host path with the right owner before deploy — the role does NOT create arbitrary bind-mount source dirs.
command: list of argv tokens passed to the container CMD (override-only — the Dockerfile’s ENTRYPOINT still wraps).

Step 6 — Create the env file

touch secrets/<app_name>.env
chmod 0600 secrets/<app_name>.env

Populate with KEY=value lines (no export, no quoting plain strings). File is gitignored.

If the app talks to Supabase Postgres, use the pooler URL, not the direct connection:

SUPABASE_DB_URL=postgresql://postgres.<PROJECT_REF>:<PASSWORD>@aws-1-<REGION>.pooler.supabase.com:5432/postgres

The direct connection db.<ref>.supabase.co:5432 is IPv6-only for projects created after Jan 2024; Docker’s default bridge network is IPv4-only and the container can’t reach it (Network is unreachable (os error 101)). The session pooler at port 5432 supports IPv4 and accepts sqlx’s prepared statements. The transaction pooler at port 6543 is IPv4 too but does not support prepared statements — pgbouncer transaction mode breaks sqlx. See gcp-terraform-ansible-gotchas #18.

Pooler shard number

Newer Supabase projects use aws-1-<region> rather than aws-0-<region> in the pooler hostname. Confirm with nslookup from the VM if aws-0-... returns tenant/user not found from the pooler.

Step 7 — Pre-create any bind-mount paths on the VM

If your manifest entry has compose_extra.volumes, create the host path first. The apps role does not auto-create arbitrary bind sources.

/usr/bin/ssh ops@ops-vm-1 'sudo mkdir -p /var/lib/<app_data_dir> && sudo chown ops:ops /var/lib/<app_data_dir> && sudo chmod 0755 /var/lib/<app_data_dir>'

Step 8 — Deploy

make deploy APP=<app_name>

Watch the Ansible play. Expect tasks: Validate apps manifest schema → Validate job schedules (for jobs) → Apply each app entry (skips non-matching entries) → for the matched entry: create app directory, copy env file, render docker-compose.yml, validate compose file, render systemd service unit, optionally render systemd timer (jobs only), flush handlers, daemon-reload, enable and (re)start service unit or enable and start timer (jobs).

PLAY RECAP should show failed=0 and changed >= 5 for a fresh deploy.

If you see ok=3 changed=0 and no _present.yml tasks ran

That was the include_tasks-tag-propagation bug fixed in cccd438 (Phase 1 follow-up). Make sure ansible/roles/apps/tasks/main.yml uses the include_tasks: { file: ..., apply: { tags: [apps] } } shape, not the plain include_tasks: <file> shape. Without apply: tags: [apps], --tags apps runs the dispatcher but every task inside _present.yml/_absent.yml is silently skipped.

Step 9 — Verify

Service (long-running):

/usr/bin/ssh ops@ops-vm-1 'sudo systemctl is-active apps-<app_name>.service'
/usr/bin/ssh ops@ops-vm-1 'sudo docker ps --filter "name=apps-<app_name>" --format "table {{.Names}}\t{{.Status}}"'
/usr/bin/ssh ops@ops-vm-1 'sudo docker logs --tail 30 apps-<app_name> 2>&1'

Job:

/usr/bin/ssh ops@ops-vm-1 'sudo systemctl list-timers apps-<job_name>.timer --no-pager'
/usr/bin/ssh ops@ops-vm-1 'sudo journalctl -u apps-<job_name>.service -n 50 --no-pager'

Always use /usr/bin/ssh

The interactive shell’s ssh alias (_kaku_wrapped_ssh in the operator’s zsh) errors out when called from non-interactive contexts. The absolute path bypasses the wrapping. (Discovered during Phase 2 smoke-check.)

Replace ops-vm-1 with ops-vm once the Tailscale name collision is cleaned up — see gcp-terraform-ansible-gotchas #15.

Procedure: redeploy an existing app (new image)

CI auto-builds + pushes :latest on every push to main matching the workflow’s paths: filter. Redeploy the running unit:

make deploy APP=<app_name>

The role re-renders compose/unit (idempotent — no changes if templates match), runs ExecStartPre=docker pull <image> which fetches the new :latest, then for kind: service does state: restarted unconditionally — so the container always restarts with the latest image. For kind: job, the timer keeps the same schedule and the next firing uses the new image.

If you need to redeploy without waiting for CI (e.g., to apply an env file change you just edited locally), the same command works — the role copies secrets/<name>.env → /opt/apps/<name>/.env on every run, and the restart picks up the new env.

Procedure: redeploy every app

make redeploy-all

Runs the apps role across every state: present entry in deploys/apps.yml. Service-kind apps are bounced unconditionally (the role’s state: restarted is not change-gated). Use sparingly — prefer make deploy APP=<name> for single-app redeploys.

Procedure: remove an app (clean teardown)

In deploys/apps.yml, change the entry’s state: present to state: absent. Keep the rest of the entry — the role needs the keys to know what to remove.
make deploy APP=<app_name>. The role’s _absent.yml flow stops the unit/timer (failed_when: false so missing units don’t error), removes /etc/systemd/system/apps-<name>.service and .timer, removes /opt/apps/<name>/, and notifies daemon-reload.

Confirm:

/usr/bin/ssh ops@ops-vm-1 'sudo systemctl status apps-<app_name>.service 2>&1 | head -5'

Expect Unit apps-<app_name>.service could not be found.

Once verified, delete the manifest entry entirely. If the env file isn’t needed elsewhere: rm secrets/<app_name>.env.

Troubleshooting cheat sheet

Symptom	Root cause	Fix
`make configure` / `make deploy` hangs at `Wait for the VM to be reachable over SSH`	Tailscale name collision after reprovision (`ops-vm` exists offline + `ops-vm-1` is the new node)	Patch `ansible/inventory/hosts.ini` `ansible_host=ops-vm-1`. Permanent cleanup: delete the stale node in Tailscale admin console. (gcp-terraform-ansible-gotchas #15)
`apt cache update` fails with `NO_PUBKEY` during `make configure` (docker role)	GPG key file saved with `.gpg` extension but content is ASCII-armored; apt parses by extension	Save as `.asc` and update `signed-by=`. (gcp-terraform-ansible-gotchas #16)
`Add the Google Cloud SDK apt repository` task fails	Same as above for the gcloud apt key	Fixed at `4e6a0f1` — pull latest. (gcp-terraform-ansible-gotchas #16)
`make deploy` reports `ok=3 changed=0`, nothing applies	`include_tasks` doesn’t propagate parent tags to included tasks	Dispatcher must use `include_tasks: { file: ..., apply: { tags: [apps] } }`. Fixed at `cccd438`.
`apps : Validate job schedules` fails with `Failed to parse calendar specification`	`schedule:` in manifest is cron syntax, not OnCalendar	Use OnCalendar: `"*:0/15"`, `"hourly"`, `"daily"`, etc. (gcp-terraform-ansible-gotchas #12)
Container starts then crash-loops with `Network is unreachable (os error 101)` connecting to Postgres	Supabase direct connection `db.<ref>.supabase.co:5432` is IPv6-only; Docker bridge is IPv4-only	Switch `SUPABASE_DB_URL` to the session pooler (`aws-1-<region>.pooler.supabase.com:5432`, user `postgres.<ref>`). Not the transaction pooler at `:6543` — sqlx breaks under pgbouncer transaction mode. (gcp-terraform-ansible-gotchas #18)
Pooler returns `tenant/user postgres.<ref> not found`	Wrong shard number in pooler hostname (`aws-0-` vs `aws-1-`)	Try the other shard. Newer Supabase projects are on `aws-1-`.
`docker pull` fails with `unauthorized: authentication required` on the VM	VM SA missing `roles/artifactregistry.reader`, OR `gcloud auth configure-docker <ar_host>` not run for the right host	The terraform + docker role configure both. Run `make configure` to re-apply.
`terraform apply` fails on `wif_repo_binding` with `Resource not found`	Wrong `owner/repo` format in `ci_github_repos` (must match the regex `^[^/]+/[^/]+$`)	Fix the entry.
CI workflow fails immediately with `failed to parse called workflow ... workflow was not found`	Private reusable workflow access not enabled on `wowjeeez/terraform`	`gh api repos/wowjeeez/terraform/actions/permissions/access -X PUT -f access_level=user`
CI fails with `Permission 'iam.serviceAccounts.getOpenIdToken' denied`	WIF `attribute_condition` (`assertion.repository_owner == 'wowjeeez'`) rejected the calling repo’s owner	Confirm the app repo is under the `wowjeeez` owner; update the module’s `github_owner` input if you’re using a different account.
Telegram bot logs show `409 Conflict: terminated by other getUpdates request`	Another bot instance is polling Telegram with the same token (your local `cargo run`, a stale deploy, etc.)	Stop the duplicate. Telegram bot tokens are single-listener.

Quick reference

Path / value	What
`/Users/levander/levandor/terraform/`	Repo root
`deploys/apps.yml`	Manifest
`secrets/<app_name>.env`	Per-app env, gitignored, 0600
`ansible/roles/apps/`	Role: schema validate, state dispatch, templates
`ansible/inventory/hosts.ini`	Templated by terraform; patch if Tailscale name collides
`make deploy APP=<name>`	Deploy one app
`make redeploy-all`	Redeploy every present entry
`make vm-up`	terraform apply (IAM-only after first run)
`make configure`	Ansible only (no terraform)
`make verify`	Smoke check (may need inventory patch for ops-vm-1)
`/usr/bin/ssh ops@ops-vm-1`	Direct SSH bypassing zsh ssh wrapping
`wowjeeez/terraform/.github/workflows/build-push.yml@main`	Reusable CI workflow path

Always re-fetch concrete values via terraform output — don’t copy strings from this guide into production code.

Levandor

Explorer

Agent Guide — Configure an App Deploy on ops-vm

Before you start — required context

What’s live

Concept overview

Procedure: deploy a brand-new app

Step 1 — Authorize the app repo for CI

Step 2 — Author the Dockerfile(s) in the app repo

Step 3 — Author `.github/workflows/release.yml` in the app repo

Step 4 — Push the app repo, watch CI

Step 5 — Add the app to the central manifest

Step 6 — Create the env file

Step 7 — Pre-create any bind-mount paths on the VM

Step 8 — Deploy

Step 9 — Verify

Procedure: redeploy an existing app (new image)

Procedure: redeploy every app

Procedure: remove an app (clean teardown)

Troubleshooting cheat sheet

Quick reference

Graph View

Table of Contents

Backlinks

Levandor

Explorer

Agent Guide — Configure an App Deploy on ops-vm

Before you start — required context

What’s live

Concept overview

Procedure: deploy a brand-new app

Step 1 — Authorize the app repo for CI

Step 2 — Author the Dockerfile(s) in the app repo

Step 3 — Author .github/workflows/release.yml in the app repo

Step 4 — Push the app repo, watch CI

Step 5 — Add the app to the central manifest

Step 6 — Create the env file

Step 7 — Pre-create any bind-mount paths on the VM

Step 8 — Deploy

Step 9 — Verify

Procedure: redeploy an existing app (new image)

Procedure: redeploy every app

Procedure: remove an app (clean teardown)

Troubleshooting cheat sheet

Quick reference

Related

Graph View

Table of Contents

Backlinks

Step 3 — Author `.github/workflows/release.yml` in the app repo