GCP / Terraform / Ansible Gotchas

Reusable gotchas discovered during the validation pass on gcp-vm-provisioning-design (Spec 1), expanded with additional traps caught during the live deployment of ops-vm (later renamed polymarket-infra), during Spec 2a Phase 1–3 (the deploy layer — see gcp-app-deploy-design), during Spec 2b (OTel pipeline), and during the CD workflow rollout. Each is relevant to any future GCP / Terraform / Ansible / OTel / Rust / Supabase / Tailscale work, not just this project — check this list before writing similar infra code.

For Agents

Thirty-nine independent traps, each with the symptom, the cause, and the fix. They are framework/provider/tool behaviors, not project-specific bugs. Gotchas 1–6 came from the Spec 1 design-validation pass; 7–11 from the live Spec 1 deployment; 12–14 from Spec 2a Phase 1 (code); 15–16 from Spec 2a Phase 2 (EU migration); 17–18 from Spec 2a Phase 3 (first-app deploy); 19–20 from the CD workflow rollout; 21–22 from Spec 2b (OTel SDK + structured logging); 23 from a destroy/recreate cycle hitting GCP soft-delete; 24 from a multi-app CD dispatch hitting a concurrency-group race; 25 from a local-vs-GitHub-secret divergence during a redeploy that silently kept stale env values on the VM; 26–27 from the 2026-06-09 babylon (Rust MCP server) deploy onto polymarket-infra; 28–33 from the 2026-06-09 babylon compose refactor (4 failed deploy iterations before the stack came up clean); 34–37 from the 2026-06-18 SigNoz dashboard housekeeping + NATS Prometheus metrics path; 38 from the 2026-06-25 crypto emit EMIT_MAX_FIRES cap clear after the Pattern A → Pattern B cutover; 39 from the 2026-06-30 fk-dev provision (tailscaled-restart handler drops the controlling ansible SSH session over tailnet).

1. `google_project_service` disables APIs project-wide on destroy

disable_on_destroy defaults to true

google_project_service defaults to disable_on_destroy = true. Running terraform destroy will then disable the API project-wide — affecting every other resource and tool in the project, not just what Terraform manages.

Fix: set disable_on_destroy = false on every google_project_service resource.

2. Terraform’s `file()` does not expand `~`

file() takes a literal path

Terraform’s file() function does not expand ~ to the home directory. file("~/.ssh/id_ed25519.pub") fails with a no-such-file error.

Fix: wrap the path in pathexpand() — file(pathexpand(var.ssh_public_key_path)). Apply this anywhere a user-supplied path may contain ~.

3. `ansible.cfg` is only auto-loaded from the CWD

ansible.cfg in a subdirectory is ignored

Ansible auto-loads ansible.cfg only from the current working directory (or the path in $ANSIBLE_CONFIG). An ansible.cfg inside an ansible/ subdirectory is silently ignored when ansible-playbook is invoked from the repo root — all its settings (inventory path, roles_path, host_key_checking) are quietly lost.

Fix: invoke Ansible with ANSIBLE_CONFIG=ansible/ansible.cfg ansible-playbook …, or run from inside the ansible/ directory.

4. GCP OS Login silently overrides `ssh-keys` metadata

OS Login wins over instance metadata SSH keys

If GCP OS Login is enabled — via project metadata or an org policy (constraints/compute.requireOsLogin) — it silently overrides the ssh-keys instance metadata. Metadata-key-based SSH then fails with no obvious error.

Fix: set enable-oslogin = "FALSE" in the instance metadata so metadata-key SSH is honored. Caveat: if OS Login is force-enabled by org policy, the instance setting is overridden and SSH must move to OS Login — a hard limitation.

5. The GCP `default` VPC ships an open SSH firewall rule

default-allow-ssh is open to 0.0.0.0/0

The GCP default VPC network ships with a default-allow-ssh firewall rule open to 0.0.0.0/0 — every VM placed in the default network is exposed to the public internet on tcp:22.

Fix: create a dedicated custom-mode VPC (google_compute_network with auto_create_subnetworks = false) so the VM does not inherit default-allow-ssh. Add only the inbound rules you actually need (for a Tailscale-only design, none).

6. `google_compute_instance` can race API enablement

Instance created before the Compute API is ready

Terraform may create a google_compute_instance before the google_project_service resource has finished enabling the Compute API, causing an intermittent apply failure on fresh projects.

Fix: add an explicit depends_on from the instance (or the VM module) to the google_project_service resources, so the API is guaranteed enabled first.

7. Tailscale SSH ACL autogroup-self does NOT match tagged devices

autogroup:self doesn’t match tagged devices

A Tailscale SSH ACL rule with dst: ["autogroup:self"] looks like the obvious way to say “my own devices,” but it does not match a node that was joined with a tagged auth key. The VM accepts no SSH sessions and tailscale ssh returns a permission-denied with no obvious cause.

Cause: autogroup:self is scoped to devices owned by the connecting user’s identity. Tagged devices (joined via auth keys carrying tag:<name>) have no user owner — the tag IS their identity — so autogroup:self excludes them by construction.

Fix: for a VM joined with a tagged auth key, target the tag explicitly in the ACL — dst: ["tag:<name>"] (e.g. tag:cloud). Keep src as autogroup:member (or a narrower group) and list the on-VM Unix users in users.

8. Ansible `gather_facts` runs before `pre_tasks`

Default fact-gathering races the VM boot

Ansible’s default fact-gathering step runs before pre_tasks execute. If the play targets a freshly provisioned VM that is still booting and joining the tailnet, the implicit setup happens before any wait_for_connection pre-task gets a chance to delay it — the run fails on the very first connect with UNREACHABLE and no useful output. The pre_tasks you added to absorb boot time never run.

Fix: set gather_facts: false at the play level, then explicitly run ansible.builtin.setup as a normal task after wait_for_connection succeeds. The play definition becomes:

- hosts: vm
  gather_facts: false
  become: true
  pre_tasks:
    - name: Wait for SSH
      ansible.builtin.wait_for_connection:
        timeout: 300
    - name: Gather facts now that the host is reachable
      ansible.builtin.setup:
  roles: [...]

9. OTel collector :8888/metrics is the canonical export-success signal

systemctl active does not mean metrics are leaving the host

systemctl is-active otelcol-contrib only proves the process is running. The collector can be happily up while the exporter is failing (auth wrong, endpoint unreachable, TLS error) — you’ll see nothing in the SigNoz dashboard and wonder why. Logs occasionally mention errors but are noisy and easy to misread.

Fix: the collector’s own internal telemetry is the source of truth. otelcol-contrib exposes Prometheus-format metrics on :8888/metrics by default. The two counters to watch:

otelcol_exporter_sent_metric_points — must be >0 and increasing over time. This confirms the exporter is succeeding end-to-end.
otelcol_exporter_send_failed_metric_points — must be 0 or absent. Any non-zero value means the exporter is dropping data; investigate the configured endpoint, credentials, and TLS.

Quick check from the VM:

curl -s localhost:8888/metrics | grep -E 'otelcol_exporter_(sent|send_failed)_metric_points'

Wire this into smoke checks instead of relying on systemctl is-active alone.

10. OTel Collector v0.152 renamed common receivers and exporters

hostmetrics → host_metrics, otlp → otlp_grpc

OpenTelemetry Collector v0.152 renamed several long-standing component IDs:

The host-metrics receiver hostmetrics → host_metrics.

The OTLP gRPC exporter otlp → otlp_grpc (the new explicit name disambiguates it from otlphttp).

Older config files continue to work — the collector accepts the old names with a deprecation warning at startup — but the warning is easy to miss and the names will be removed eventually.

Fix: update the collector config to the new names whenever you touch it; consult the collector’s --help and the release notes for the version you’re deploying before copying a snippet from older tutorials. Pin the collector version in the Ansible role so behavior is reproducible.

11. `rtk proxy` bypasses RTK’s token-saving output filter

RTK rewrites commands and filters output — sometimes you need the raw thing

RTK (Rust Token Killer) transparently rewrites tool invocations (git status → rtk git status, etc.) and filters their output to save tokens. That filtering is great for chatty dev commands but actively harmful when a tool’s raw output is load-bearing — Terraform’s plan / apply diff, an Ansible run’s task-by-task report, or a raw JSON body from curl can be truncated or reformatted in ways that hide what you need to see.

Fix: prefix any such command with rtk proxy … to execute it unfiltered. Examples:

rtk proxy terraform plan -out=tfplan
rtk proxy terraform apply tfplan
rtk proxy ansible-playbook -i ansible/inventory/hosts.ini ansible/site.yml
rtk proxy curl -s https://api.example.com/foo | jq .

Keep the rewrite hook enabled for everyday commands; reach for rtk proxy whenever the exact raw output matters.

12. systemd `OnCalendar=` does not accept cron 5-field syntax

OnCalendar is not crontab

Set schedule: "*/15 * * * *" in deploys/apps.yml for a kind: job app, the render succeeds and the unit and timer files land on the VM, but systemctl start apps-<name>.timer returns non-zero and systemctl status shows Failed to parse calendar specification from systemd-analyze. The timer never fires.

Cause: systemd’s OnCalendar= directive uses its own calendar event grammar (DayOfWeek Year-Month-Day HH:MM:SS, plus shorthands like hourly/daily/weekly/Mon..Fri 09:00). It does not accept the 5-field crontab grammar (m h dom mon dow).

Fix: document the manifest schedule: field as systemd OnCalendar syntax, with the example "*:0/15" (every 15 minutes — equivalent to */15 * * * * in cron). In the role tasks, add a fail-fast validation that runs systemd-analyze calendar "<schedule>" against every kind: job entry (and state: present) before any unit file is written. Caught at code-quality review of T7 in the Spec 2a Phase 1 cycle; fixed in commits de2f6a8 (docs) and 2488168/d30342e (validation task).

13. Jinja2 filter `|` binds tighter than comparison `==`

a == 'absent' | ternary(...) is NOT what you think

An Ansible task like include_tasks: "{{ (item.state | default('present')) == 'absent' | ternary('_absent.yml', '_present.yml') }}" runs but every iteration evaluates the expression to 'False' instead of '_present.yml' or '_absent.yml'. Include fails with “Invalid options”.

Cause: in Jinja2, the filter pipe | has higher precedence than comparison ==. The unparenthesized expression a == 'absent' | ternary(...) parses as a == ('absent' | ternary(...)) — the truthy string 'absent' is piped to ternary, producing '_absent.yml', and the comparison a == '_absent.yml' is always False for normal state values. This is the opposite of operator precedence in most other languages and easy to misremember.

Fix: always wrap comparison operands explicitly when the result is piped into a filter:

"{{ ((item.state | default('present')) == 'absent') | ternary('_absent.yml', '_present.yml') }}"

A quick render-check inline avoids the trap:

from jinja2 import Environment
env = Environment()
env.filters['ternary'] = lambda c,t,f: t if c else f
print(env.from_string(EXPR).render(item={'state': 'absent'}))

Caught at code-quality review of T8 in the Spec 2a Phase 1 cycle; fixed in commit d30342e.

14. ansible-core 2.20 rejects Jinja expressions as `include_tasks` filenames at parse time

2.20 tightened module-option templating
ansible-playbook --syntax-check ansible/site.yml fails with:
[ERROR]: Invalid options for ansible.builtin.include_tasks: {{ ((item.state | default('present'))
Origin: ansible/roles/apps/tasks/main.yml:27:3
The expression itself is valid Jinja and would render correctly at runtime; 2.20 rejects it before runtime evaluation.

Cause: ansible-core 2.20 tightened its handling of module option templating. The include_tasks action accepts only a plain filename or a single variable reference as its argument; complex Jinja expressions are no longer permitted inline.

Fix: compute the filename in a vars: block and reference the resulting variable:

- name: Apply each app entry (filtered by deploy_only when set)
  ansible.builtin.include_tasks: "{{ apps_tasks_file }}"
  loop: "{{ apps }}"
  loop_control:
    label: "{{ item.name }}"
  vars:
    apps_tasks_file: "{{ ((item.state | default('present')) == 'absent') | ternary('_absent.yml', '_present.yml') }}"
  when: (deploy_only is not defined) or (deploy_only == item.name)
  tags: [apps]

The vars: block is evaluated per loop iteration; include_tasks then sees a plain variable reference, which 2.20 accepts. Caught at T9 syntax-check in the Spec 2a Phase 1 cycle; fixed in commit 78a01ec.

15. Tailscale name collision after VM reprovision

terraform destroy does NOT remove a node from the tailnet

Re-provision the VM (destroy + apply in a new region, for example), and make configure fails on Wait for the VM to be reachable over SSH with ssh: connect to host ops-vm port 22: Operation timed out — even though the new VM is healthy and terraform output vm_external_ip returns a real address. tailscale status shows TWO entries: the old destroyed VM (offline, last seen Nm ago) and the new one with a -1 suffix (ops-vm-1).

Cause: Tailscale registers a node identity when it joins; destroying the underlying VM doesn’t unregister it. When the new VM’s cloud-init runs tailscale up, the desired hostname is taken (by the offline old node), so Tailscale auto-appends -1. The Ansible inventory still says ansible_host=ops-vm, which MagicDNS resolves to the OFFLINE old node’s IP — connection times out.

Immediate fix (unblock the current run): patch ansible/inventory/hosts.ini so ansible_host=ops-vm-1 (whatever suffix Tailscale assigned — check with tailscale status | grep ops-vm). Re-run make configure / make deploy.

Permanent cleanup: in the Tailscale admin console at https://login.tailscale.com/admin/machines, delete the offline ops-vm entry. On the new VM, run sudo tailscale logout && sudo tailscale up --authkey=<key> --hostname=ops-vm --advertise-tags=tag:cloud --ssh to re-register under the now-free name. Then on next terraform apply the templated inventory regenerates to ops-vm and Ansible lines back up.

Prevention going forward: before terraform destroy, manually delete the VM’s Tailscale node first (admin console or tailscale logout on the VM). Makes the name available immediately for the next provision.

Caught during Spec 2a Phase 2 EU migration; fixed at runtime via the inventory patch. The terraform-templated inventory at local_file.inventory will regenerate to ops-vm on the next apply, so the patch is ephemeral — re-apply each time until the Tailscale-side cleanup happens.

16. Google Cloud APT GPG key needs the `.asc` extension (not `.gpg`)

signed-by=<keyfile> is format-sensitive

Ansible’s Add the Google Cloud SDK apt repository task hangs in retry loops and ultimately fails with W:GPG error: ... NO_PUBKEY C0BA5CE6DC6315A3 and E:The repository 'https://packages.cloud.google.com/apt cloud-sdk InRelease' is not signed. The previous Download the Google Cloud apt GPG key task succeeds (file lands on disk), but apt refuses to use it.

Cause: https://packages.cloud.google.com/apt/doc/apt-key.gpg serves an ASCII-armored GPG key (despite the .gpg suffix in the URL). Modern apt (Debian 12 / Ubuntu 24.04 era) interprets a signed-by=/etc/apt/keyrings/<name>.gpg reference as a binary keyring file and fails to parse the armored content. The Docker GPG key URL also serves armored content, but that role saves it as docker.asc, which apt correctly recognizes.

Fix: save the Google Cloud key as .asc, not .gpg, and update the signed-by= to match:

- name: Download the Google Cloud apt GPG key
  ansible.builtin.get_url:
    url: https://packages.cloud.google.com/apt/doc/apt-key.gpg
    dest: /etc/apt/keyrings/google-cloud.asc   # NOT .gpg
    mode: "0644"
 
- name: Add the Google Cloud SDK apt repository
  ansible.builtin.apt_repository:
    repo: "deb [signed-by=/etc/apt/keyrings/google-cloud.asc] https://packages.cloud.google.com/apt cloud-sdk main"
    filename: google-cloud-sdk
    state: present

Alternative is to gpg --dearmor before saving, but matching the Docker pattern (.asc extension) is simpler and consistent within the same role.

Caught during Spec 2a Phase 2’s first make configure against the new EU VM; fixed in commit 4e6a0f1.

17. `.dockerignore` excludes the `sqlx::migrate!()` migration directory

sqlx embeds the migration directory at compile time
A Rust app using sqlx::migrate!("../../supabase/migrations") (or any path-based variant) needs the migrations directory in the Docker build context because the macro reads it at cargo build time. If .dockerignore excludes the path, the build fails with:
error: error canonicalizing migration directory /build/crates/<crate>/../../supabase/migrations: No such file or directory (os error 2)
  --> crates/<crate>/src/db.rs:19:5
   |
19 |     sqlx::migrate!("../../supabase/migrations")
This is a compile-time error, not a runtime one — there is no graceful recovery.

Cause: sqlx::migrate!() is a procedural macro. It runs in the compiler, against the host filesystem of the build environment. In a Docker build, that’s the build context — files outside the context are invisible. .dockerignore filters the context before it’s sent to the daemon.

Fix: either remove the relevant directory from .dockerignore entirely, or use a !-include AFTER the exclusion to re-include only the migrations subdirectory:

# WRONG — kills the migrations dir
supabase/

# RIGHT (option 1) — drop the exclusion entirely
# (no supabase/ entry)

# CAVEAT (option 2) — re-include requires parent NOT to be excluded
# supabase/
# !supabase/migrations/   ← does NOT work: docker can't re-include from an excluded parent dir

Docker’s .dockerignore does not allow re-including files when a parent directory is fully excluded. The clean fix is to drop the parent exclusion. Then list specific subdirs you actually want to exclude (e.g. supabase/.branches/, supabase/.temp/) if any.

Caught during Spec 2a Phase 3 (polymarket-fetch first deploy); fixed in commit 834bc66 on wowjeeez/polymarket-fetch.

18. Supabase direct connection is IPv6-only; Docker bridge is IPv4-only

Network is unreachable (os error 101) connecting to Supabase Postgres from a container
A Rust app deployed in Docker connects to Polymarket APIs fine but crash-loops trying to reach Supabase Postgres:
error: connecting to Postgres pool: error communicating with database:
Network is unreachable (os error 101): Network is unreachable (os error 101)
The same app, run directly on a Mac with IPv6, works fine.

Cause: Supabase projects created after January 2024 use IPv6-only for the direct connection (db.<project_ref>.supabase.co:5432). Docker’s default bridge network is IPv4-only, so the container has no route to the IPv6 endpoint.

Fix: switch to a Supabase pooler URL (IPv4-capable). Two pooler modes:

Mode	Port	sqlx-compatible?	Notes
Session	`5432`	YES	Full Postgres protocol — supports prepared statements, transactions, `LISTEN/NOTIFY`. Use this.
Transaction	`6543`	NO for sqlx	pgbouncer transaction mode strips prepared statements; sqlx breaks.

URL format (session pooler):

postgresql://postgres.<PROJECT_REF>:<PASSWORD>@aws-<SHARD>-<REGION>.pooler.supabase.com:5432/postgres

Where:

<PROJECT_REF> — your project’s ref (the slug in the dashboard URL).
<PASSWORD> — the same DB password you already use.
<SHARD> — 0 or 1. Newer projects are on aws-1-. If aws-0-... returns tenant/user not found, try aws-1-.... Confirm via nslookup aws-N-<region>.pooler.supabase.com from the VM.
<REGION> — the AWS region for your project (e.g. eu-west-1 if the Supabase project is in eu-west-1).

Alternative fixes (more invasive, not recommended for a single-VM deploy):

Enable IPv6 in Docker’s daemon (/etc/docker/daemon.json ipv6: true + fixed-cidr-v6) and in the GCE VPC subnet (stack_type = "IPV4_IPV6"). Reaches IPv6 directly. Skips the pooler. More moving parts.
Use Cloud SQL Auth Proxy or similar IPv4-only bridge. Over-engineered for this scale.

Caught during Spec 2a Phase 3 first-app deploy; fixed by rewriting secrets/polymarket-fetch.env to use aws-1-eu-west-1.pooler.supabase.com:5432 with the postgres.<ref> tenant-prefixed user. The pooler shard was discovered by running pg_isready from a one-off container on the VM against each candidate hostname.

19. Userspace-mode `tailscaled` on a GH Actions runner has no MagicDNS

Could not resolve hostname <tailnet-name>: name resolution

The tailscale/github-action@v3 action joins the runner to the tailnet in userspace networking mode (no /dev/net/tun). In that mode, the runner’s tailscaled does NOT inject a resolver into /etc/resolv.conf. As a result, the runner’s getent hosts, ssh user@<magicdns>, tailscale ip <magicdns> (which itself queries DNS!), and any other “look up that hostname” call all fail with “name resolution” errors. ICMP/UDP-equivalent traffic still works because tailscaled handles those via its own dialer.

Cause: userspace networking means tailscaled exposes a SOCKS5/HTTP proxy and an internal dialer; it does NOT replace the kernel’s network stack. The OS resolver has no idea about *.ts.net or MagicDNS shortnames.

Fix(es):

Resolve via the daemon’s local API, not DNS. tailscale status --json returns each peer’s HostName and TailscaleIPs[]. Grep for the peer and write the IPv4 into /etc/hosts:

VM_IP=$(tailscale status --json | jq -r '.Peer // {} | to_entries[] | .value | select(.HostName == "polymarket-infra") | .TailscaleIPs[] | select(test("\\."))' | head -1)
echo "$VM_IP polymarket-infra" | sudo tee -a /etc/hosts

Retry up to ~30s because the netmap fetch is async after tailscale up returns — first reads after join can return Peer: null.
Have a fallback IP so a transient netmap delay doesn’t fail the workflow:
```
if [ -z "$VM_IP" ]; then VM_IP="$VM_FALLBACK_IP"; fi
```
The fallback only stays correct until the VM is destroyed + recreated (since tailnet IPs are assigned per-node). If you reprovision, bump the fallback in the workflow file.

Caught during the 2026-05-27 CD workflow rollout. Fixed in .github/workflows/deploy-apps.yml “Resolve VM tailnet IP into /etc/hosts” step.

20. Tailscale OAuth client 403 — must have tag in client Tags AND tagOwners

Status: 403, Message: "calling actor does not have enough permissions to perform this function"

The tailscale/github-action@v3 retries tailscale up 5 times, each returning 403. The retry loop swallows the error, the step is reported as ”✓ success”, and downstream steps operate on a dead daemon. Easy to miss without explicit log inspection.

Cause: Tailscale OAuth clients can ONLY mint auth keys for tags they’re explicitly scoped to. A 403 means the client is trying to use a tag (e.g. tag:ci) that isn’t in its allowed set. Two configuration points must both be true:

OAuth client’s “Tags” field at admin → Settings → OAuth clients must include the tag. If the field is empty or has a different tag, the client can’t issue keys with --advertise-tags=tag:ci.
tagOwners in the tailnet ACL must declare the tag exists:
```
"tagOwners": {
  "tag:cloud": ["autogroup:admin"],
  "tag:ci":    ["autogroup:admin"],
}
```
Without this, the tag isn’t a recognized identity in the tailnet — the OAuth client UI may not even let you assign it.

Fix: make both true. Visit Settings → OAuth clients, regenerate the client with the right Tags; verify tagOwners in Settings → Access controls. If you can’t select the tag in the dropdown, fix the ACL first.

Diagnostic to surface the silent failure: add a debug step after the Tailscale action that prints tailscale status and the runner’s identity. If Self.Tags is null or empty, the join didn’t actually happen — go straight to the OAuth/ACL audit.

Caught during the 2026-05-27 CD workflow rollout (5 attempts, all 403, all swallowed).

21. OTel Rust SDK panics on init if Tokio runtime isn’t entered yet

panicked: there is no reactor running, must be called from the context of a Tokio 1.x runtime

The hyper-util-based OTLP gRPC exporter (opentelemetry_otlp::SpanExporter::builder().with_tonic()...build()) spawns its batch worker via tokio::spawn at construction time. If you call observability::init() from synchronous main() BEFORE entering the #[tokio::main] runtime, you get an immediate panic on startup. The container crash-loops; systemd keeps restarting it forever.

Cause: Tokio’s spawn() requires a current runtime. The OTLP exporter’s with_tonic() builds an internal client that immediately wants to schedule background work. If init runs in plain sync main, there’s no runtime to spawn on.

Fix: Build/enter the runtime FIRST, init the SDK SECOND:

fn main() -> anyhow::Result<()> {
    let rt = tokio::runtime::Builder::new_multi_thread().enable_all().build()?;
    let _enter = rt.enter();
    let _guard = observability::init("my-service", "info", false)?;
    rt.block_on(run())
}

Or, if you can keep #[tokio::main], call init() from inside the async body:

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let _guard = observability::init(...)?;   // already inside runtime
    run().await
}

Why this is easy to miss in CI: if the test harness never sets OTEL_EXPORTER_OTLP_ENDPOINT, the SDK falls back to a no-op path that doesn’t spawn anything. The runtime-required branch only runs in production. Future OTel-related commits’ local smoke must set the endpoint explicitly to force the active branch.

Caught during Spec 2a Phase 3 redeploy after the observability module was lifted to a shared crate (commit c211835 panicked; fix at 4305e25).

22. Format-string `tracing` calls produce opaque log bodies in SigNoz

body.message: "bot loop starting; allowed_chats=[\"-5279801901\"]", attributes_string: {}

When you write tracing::info!("msg with X={:?}", x), the entire formatted string lands in the log record’s body.message field and attributes_string stays empty. SigNoz can’t filter on x, can’t extract it as a column, can’t groupBy it in a dashboard. The Logs view becomes a wall of opaque text instead of a structured table.

Cause: tracing macros support two styles:

// (A) format-string style — equivalent to printf
tracing::info!("msg with X={:?}", x);
 
// (B) structured-field style — what the OTel bridge consumes
tracing::info!(x = ?x, "msg with X");

Only (B) produces structured attributes that opentelemetry-appender-tracing::OpenTelemetryTracingBridge extracts as OTel LogRecord attributes. Style (A) ends up as one big body string.

Fix: prefer style (B) everywhere. Cheat sheet:

?x → emits via Debug, attribute name = x
%x → emits via Display, attribute name = x
name = ?expr / name = %expr → custom attribute name
Multi-field: tracing::info!(run_id = %run_id, rows = n, "scheduled run complete")

Side benefit: wrap hot async fns in #[tracing::instrument] to create spans automatically. Logs emitted INSIDE those spans then carry trace_id + span_id automatically, enabling one-click trace-to-logs drill-down in SigNoz.

Caught during Spec 2b verification on 2026-05-27. Bot’s logs were all style-A; wallet-monitor’s were style-B and came through clean. Cleanup commits: c97a085 (32 sites converted across polymarket-fetch/{bot,main,scheduled_run,wallet_monitor}.rs + pm-arb/validate.rs, 9 #[instrument] annotations added).

23. GCP Workload Identity Pool + OIDC Provider have 30-day soft-delete

Error: Error creating WorkloadIdentityPool: googleapi: Error 409: Requested entity already exists

After terraform destroy removes the WIF pool (google_iam_workload_identity_pool) and its OIDC provider, recreating with the same name FAILS for 30 days. The resource appears destroyed (your state list is empty) but GCP holds the name reserved.

Cause: IAM has soft-delete retention on several resource types — WIF pools, OIDC providers, service accounts. Names are not freed instantly; they sit in a “deleted” state for ~30 days before purge.

Fix(es) without waiting 30 days:

:undelete via REST API — bring the soft-deleted resource back:

TOKEN=$(gcloud auth application-default print-access-token)
curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://iam.googleapis.com/v1/projects/<proj>/locations/global/workloadIdentityPools/github-pool:undelete" \
  -d '{}'

For the OIDC provider (child of the pool):

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://iam.googleapis.com/v1/projects/<proj>/locations/global/workloadIdentityPools/github-pool/providers/github:undelete" \
  -d '{}'

Pool and provider can be undeleted independently; the provider is gated on the pool being active.

terraform import the resurrected resources — they’re not in terraform state, so terraform apply will try to create them again and hit the same 409. Bring them back into state first:

terraform import 'module.ar_wif.google_iam_workload_identity_pool.github' \
  projects/<proj>/locations/global/workloadIdentityPools/github-pool
terraform import 'module.ar_wif.google_iam_workload_identity_pool_provider.github_oidc' \
  projects/<proj>/locations/global/workloadIdentityPools/github-pool/providers/github

Then terraform apply succeeds; state is consistent with reality.

Use a different name — if you don’t care about the original name, change var.ar_repo_name or whatever feeds the resource name. New name → no collision. Ugly but works.

Service accounts have the same retention; rename pattern (${var.vm_name}-sa) means changing vm_name avoids that one automatically.

Caught during the 2026-05-27 EU migration (make deprovision of us-central1 → make provision in europe-west3). Fixed with :undelete + terraform import of pool and provider.

24. GitHub Actions concurrency group with `cancel-in-progress: false` still cancels the intermediate PENDING run

Dispatching 3+ runs into one concurrency group silently cancels the middle one
The deploy-apps.yml reusable CD workflow (wowjeeez/terraform) serializes deploys so two Ansible runs never hit polymarket-infra at once:
concurrency:
  group: deploy-apps
  cancel-in-progress: false
Dispatch all three polymarket-fetch app deploys back-to-back —
gh workflow run deploy-apps.yml -f app=polymarket-fetch-scheduled-run
gh workflow run deploy-apps.yml -f app=polymarket-fetch-bot
gh workflow run deploy-apps.yml -f app=polymarket-fetch-wallet-monitor
— and the middle run (bot) shows completed / cancelled without ever running, while the first runs and the third queues.

Cause: a GitHub Actions concurrency group permits at most one in-progress run + one pending run at a time. With cancel-in-progress: false, a newly-queued run does NOT cancel the running one — but it DOES replace (cancel) any already-pending run in the same group. When the 3rd dispatch (wallet-monitor) arrived, it bumped the 2nd (bot, then pending) into cancelled. Net result: run #1 ran, run #3 queued, run #2 was silently dropped.

Fix / operating rule: fire deploys sequentially — dispatch one app, wait for it to finish, then dispatch the next:

RUN_ID=$(gh run list --workflow=deploy-apps.yml -L1 --json databaseId -q '.[0].databaseId')
gh run watch "$RUN_ID" --exit-status

Or keep at most 2 dispatches in flight (one running + one pending). Never fan out 3+ dispatches into the same concurrency group at once, or the middle ones get cancelled.

When to dispatch individually

gh workflow run deploy-apps.yml -f app=all is a single run that deploys everything, but it also bounces pm-arb-validate. When you need to deploy only the three polymarket-fetch services and leave pm-arb alone, you must dispatch them individually — and therefore sequentially, per the rule above.

Caught during the 2026-05-29 polymarket-fetch redeploy (three back-to-back dispatches; bot cancelled before running).

25. `secrets/<app>.env` is gitignored and IGNORED by CD — the GitHub Actions repo secret is the real source of truth

Local edits to secrets/<app>.env have zero effect on deploy-apps.yml runs
Operator adds POLYMARKET_PROXIES=... to the local /Users/levander/levandor/terraform/secrets/polymarket-fetch.env, dispatches multiple per-app deploys via gh workflow run deploy-apps.yml, and the new env var never reaches the container. Container logs show the new None-branch WARN (POLYMARKET_PROXIES not set: data-pipeline requests will use single VM egress IP) at startup despite the local file having all 5 keys correctly. SSH into the VM and cat /opt/apps/polymarket-fetch-wallet-monitor/.env shows the file is 294 bytes, dated May 27 — three days stale, with only the original 4 keys. Every deploy looked successful but the env file on the VM never changed.

Cause: the local secrets/polymarket-fetch.env is gitignored (per the repo’s secrets/ ignore pattern) and is not what the GitHub Actions workflow uses. The CD workflow wowjeeez/terraform/.github/workflows/deploy-apps.yml has a “Materialize per-app env file” step (lines 47–53) that writes secrets/polymarket-fetch.env on the runner from the GitHub Actions repo secret POLYMARKET_FETCH_ENV:
- name: Materialize per-app env file
  env:
    PMF_ENV: ${{ secrets.POLYMARKET_FETCH_ENV }}
  run: |
    mkdir -p secrets
    install -m 0600 /dev/null secrets/polymarket-fetch.env
    printf '%s' "$PMF_ENV" >> secrets/polymarket-fetch.env
The Ansible apps role’s _present.yml then copies that materialized file to /opt/apps/<name>/.env on the VM, where docker-compose env_file: reads it. The runner has no access to your local disk; it materializes its own copy from the repo secret every run. So local edits are dev-only / for local Ansible runs and have no effect on GH-Actions-driven deploys. Worse, Ansible’s copy module is content-hashed and idempotent, so when the materialized file matches what’s already on the VM, the file’s mtime stays at May 27 across many “successful” deploys — masking the fact that nothing was being updated.

Fix / operating rule: when you need to change env values for any app deployed via deploy-apps.yml, update the GitHub Actions repo secret, not the local file. The reliable mechanic (no value echoing, sources existing content from the live VM so working credentials aren’t overwritten by a stale local copy):

TMP=/tmp/pmf_new.env
trap 'rm -f "$TMP"' EXIT
ssh ops@<vm> sudo cat /opt/apps/<app>/.env > "$TMP"
echo >> "$TMP"
echo 'NEW_KEY=value' >> "$TMP"
gh secret set POLYMARKET_FETCH_ENV -R wowjeeez/terraform < "$TMP"
gh workflow run deploy-apps.yml -R wowjeeez/terraform --ref main -f app=<app>

gh secret set reads from stdin if neither -b nor -f is provided, so the new value never appears in shell history or terminal scrollback. Sourcing from the live VM (not from a possibly-stale local file) avoids accidentally clobbering working credentials.

Tip

The local secrets/polymarket-fetch.env is still useful for local make configure / direct Ansible runs against the VM (the apps role’s copy task reads {{ playbook_dir }}/../{{ item.env_file }} from wherever Ansible is running). It’s only invisible to GitHub Actions deploys. Keep them in sync if you use both deploy paths.

Verification: after the secret is updated and the deploy completes, SigNoz should show the structured INFO log data-pipeline rotating proxy pool enabled with proxies=N (added in commit 45dfd6c) instead of the POLYMARKET_PROXIES not set WARN, confirming the new env var actually reached the container.

Caught during the 2026-05-30 polymarket-fetch proxy-pool rollout (multiple GH-Actions deploys appeared green; container kept logging POLYMARKET_PROXIES not set until the GH secret itself was updated).

26. Cold Rust `cargo build --release` on e2-small (2 vCPU shared, 2 GB RAM) takes ~35 minutes

The 30-minute workflow timeout copy-pasted from smaller crates will cancel mid-build

A clean cargo build --release of an axum + OpenTelemetry + sqlx + aws-lc-rs Rust workspace on a GCP e2-small VM (2 vCPU shared, 2 GB RAM) takes ~35 minutes wall-clock — including the OS prereqs apt install, rustup install, repo clone, and the build itself. A deploy-<app>.yml workflow copy-pasted from a smaller Rust crate (crypto_shortterm) with timeout-minutes: 30 will be cancelled mid-build with no useful diagnostic. The first babylon deploy attempt hit this exact failure.

Cause: e2-small is a shared-core machine type — vCPU is bursting/shared, not dedicated — and Rust release builds of crates with C/C++ build steps (aws-lc-sys here) saturate both the CPU burst budget and the 2 GB RAM. There is nothing wrong with the build; it just genuinely takes that long on this VM size.

Fix: for any new Rust role on this VM size, set the workflow timeout to at least 90 minutes:

jobs:
  deploy:
    runs-on: ubuntu-latest
    timeout-minutes: 90

The role itself can gate on creates: against the built binary so re-runs short-circuit when no rebuild is needed:

- name: Build babylon-server
  ansible.builtin.command: cargo build --release --bin babylon-server
  args:
    chdir: /opt/babylon/repo
    creates: /opt/babylon/repo/target/release/babylon-server

So only the first deploy (or one after the binary is deleted / pin changes) actually takes 35 minutes; subsequent runs return in seconds.

Cross-applies to any cold-compile workload on shared-core GCP VMs

The same principle applies to other CPU-heavy first-run workloads on e2-small/e2-micro: large npm ci, go build of big workspaces, container image builds. Budget generously on the first run.

Caught during the 2026-06-09 babylon deploy (workflow cancelled at minute 30 with cargo still partway through compiling aws-lc-sys). Bumped timeout-minutes to 90 and the next run completed in ~34 minutes; subsequent re-runs short-circuit in <30 s.

27. `libssl-dev` is NOT needed when the Rust TLS backend is `rustls` + `aws-lc-rs`

Don't reflexively copy the apt prereq list between Rust roles

Two Rust services in this codebase have slightly different OS build prereqs and the difference is not visible from the Cargo.toml at a glance:

Crate TLS backend Native crate doing the C build apt prereqs
crypto_shortterm (and many others) OpenSSL via openssl-sys openssl-sys libssl-dev + pkg-config + build-essential
babylon rustls with the aws-lc-rs feature aws-lc-sys cmake + pkg-config + build-essential (no libssl-dev)

If you copy crypto_shortterm’s prereq list onto a rustls + aws-lc-rs crate, the role still works — but you’re installing ~100 MB of libssl-dev + headers for no reason. Worse, if you do the reverse (copy babylon’s list onto an openssl-sys crate), the build fails late with openssl-sys complaining about missing openssl/ssl.h.

Crate	TLS backend	Native crate doing the C build	apt prereqs
`crypto_shortterm` (and many others)	OpenSSL via `openssl-sys`	`openssl-sys`	`libssl-dev` + `pkg-config` + `build-essential`
`babylon`	`rustls` with the `aws-lc-rs` feature	`aws-lc-sys`	`cmake` + `pkg-config` + `build-essential` (no `libssl-dev`)

Fix: before adding apt prereqs for a Rust role, check the crate’s Cargo.toml for which TLS path it takes:

reqwest = { features = ["rustls-tls"] } or rustls + aws-lc-rs → install cmake (for aws-lc-sys), skip libssl-dev.
reqwest = { features = ["default-tls"] } (or any direct openssl / openssl-sys / native-tls dep) → install libssl-dev, skip cmake.
Some crates pull both via different transitive dependencies — install both cmake AND libssl-dev if you see both openssl-sys and aws-lc-sys in cargo tree.

build-essential (gcc, g++, make) and pkg-config are required either way; those are safe to copy across roles unconditionally.

When in doubt, cargo tree

cargo tree -i openssl-sys and cargo tree -i aws-lc-sys from the project root tell you definitively which C build will run during cargo build --release. Run those before writing the Ansible role’s apt task on a new Rust crate.

Caught during the 2026-06-09 babylon deploy (initial role draft included libssl-dev reflexively from the crypto_shortterm template; cleanup pass dropped it after confirming aws-lc-sys was the only native TLS dep and the build succeeded without libssl-dev).

28. RTK token-optimizer’s stdout decoration can leak into stdin when piping to `gh secret set`

Decorated output piped into a secret value

A GitHub Actions secret stored an emoji-decorated string (🔍 1 in 1F: 📄 terraform.t...) instead of the source value — roughly 2× the source length, starting with multibyte UTF-8 sequences (0xF0 = 4-byte emoji prefix). The compose-rendered env file then contained the decorated string, and the consumer (tailscaled inside the Tailscale sidecar) failed to authenticate.

Cause: Bash commands run through the RTK Claude Code hook by default. RTK’s wrapper of gh (or piped commands feeding into it) reformats output, and the decorated bytes get piped into stdin of gh secret set. The stored secret value is the decoration, not the source.

Fix: Invoke rtk proxy gh secret set … to bypass RTK’s filtering — the proxy subcommand runs the raw command without transforms.

Detection: If a secret value mysteriously doesn’t authenticate but the local source looks correct, hexdump the rendered env file on the remote: head -c 30 /path/to/env | od -An -c | head -1. Multibyte sequences (octal 360, 237, 224, …) indicate UTF-8 emoji bytes in the value. Confirm with md5sum of the source value (tr -d '\n') vs the remote — a mismatch confirms.

29. GLIBC mismatch when running a host-built Rust binary in a `debian:bookworm-slim` container

glibc 2.36 (bookworm) cannot run a binary linked against glibc 2.39 (noble)

A Rust binary built on the VM (Ubuntu 24.04 noble, glibc 2.39) crashes at startup inside a debian:bookworm-slim container (Debian 12, glibc 2.36) with version GLIBC_2.39 not found (required by /usr/local/bin/<bin>).

Fix: Match the runtime image to the build distro. For Ubuntu 24.04 hosts, use ubuntu:24.04 as the runtime base image. Bind-mounting the binary across libc versions doesn’t work; the only alternatives are static linking (musl target) or matching the libc.

Heuristic for picking the image: On the VM, ldd --version | head -1 shows the host’s glibc; pick a runtime image with the same major.minor.

30. Babylon’s funnel-CLI safety check fails inside a container without the `tailscale` binary

Hard refusal to start unless tailscale funnel status is parseable

babylon-server crashes at startup with cannot verify perimeter: tailscale funnel status unavailable (CLI missing or parse failure).

Cause: Babylon’s startup guard shells out to tailscale funnel status to confirm Funnel is off (a hard-coded refusal to start with Funnel enabled — prevents accidental public exposure). When babylon runs in its own container without the tailscale CLI in PATH, the check can’t run, and babylon refuses to start.

Fix: Set BABYLON_ALLOW_FUNNEL=1 in the container’s environment when babylon’s perimeter is enforced by a separate process (e.g. a Tailscale sidecar with tailscale serve, not funnel). Babylon logs WARN BABYLON_ALLOW_FUNNEL=1: skipping funnel check to confirm the bypass.

When NOT to use the bypass: If babylon is the direct tailscaled host process (not separated via sidecar), keep the check — it’s the only thing preventing a misconfigured node from exposing the MCP hub via Funnel to the public internet.

31. Tailscale sidecar `hostname:` in compose shadows the service’s compose-internal DNS

hostname: rewrites the container's /etc/hosts for its OWN name

The Tailscale sidecar’s tailscale serve proxied HTTPS traffic but the upstream backend returned 502. Investigation showed the sidecar resolved the backend’s compose service name to its own IP (172.22.0.4 — the sidecar’s) instead of the backend’s (172.22.0.2).

Cause: In docker-compose, setting hostname: <name> on a service adds that name to the container’s /etc/hosts pointing at the container’s own IP. If hostname matches another compose service’s name, the entry shadows compose’s bridge-network DNS for that name inside the container with the conflicting hostname. The sidecar resolves the target as itself → loopback connection refused → 502.

Fix: Don’t set hostname: on the sidecar at all (let Docker use the container name). Use TS_HOSTNAME env var to control the tailnet identity (<name>.<tailnet>.ts.net) — that’s purely Tailscale-side and doesn’t touch Docker DNS.

Pattern:

babylon-tailscale:
  container_name: babylon-tailscale
  environment:
    - TS_HOSTNAME=babylon

Tailscale node = babylon.<tailnet>.ts.net. Docker DNS for sibling service babylon in the same compose is preserved.

32. Tailscale auth key + explicit `--advertise-tags=tag:X` can fail even when the key was minted with tag:X

Explicit advertise validates against the minting user's tagOwners, not the key

A Tailscale sidecar with TS_EXTRA_ARGS=--advertise-tags=tag:cloud (using a reusable preauth key minted with tag:cloud as the device tag) was rejected with requested tags [tag:cloud] are invalid or not permitted. The same key works for direct tailscale up --authkey=<key> (no --advertise-tags flag) — the node inherits tag:cloud from the key’s mint configuration.

Cause: Tailscale validates explicit --advertise-tags against the user’s tagOwners policy at registration time. The auth key’s embedded permissions don’t count for this check; only the user identity that minted the key matters. If that user isn’t a tagOwner for tag:cloud, explicit advertise fails — even when the same key happily produces tagged devices via implicit inheritance.

Fix: Drop --advertise-tags=tag:cloud from TS_EXTRA_ARGS and let the node inherit tag:cloud from the auth key’s embedded config. The resulting node is still tagged correctly (verifiable: appears under tagged-devices in tailscale status from other tailnet peers).

33. Ansible `-e "key=$VAR"` shlex-parses the value; shell-special bytes break it

failed at splitting arguments for any value containing {, }, unbalanced quotes, etc.

ansible-playbook ... -e "babylon_ts_authkey=$VAR" fails with failed at splitting arguments, either an unbalanced jinja2 block or quotes: babylon_ts_authkey=***.

Cause: Ansible’s extra-vars parser shlex-splits the value when it’s not in JSON/YAML form. Any value containing {, }, unbalanced quotes, or shell-special metacharacters trips the parser.

Fix: Use -e @<file>.json. Build the JSON via jq to guarantee proper escaping for any string value:

jq -n --arg key "$VAR" '{babylon_ts_authkey: $key}' > /tmp/vars.json
chmod 0600 /tmp/vars.json
ansible-playbook ... -e @/tmp/vars.json
rm -f /tmp/vars.json

jq correctly escapes any byte sequence in the value; Ansible reads the JSON-typed value without shlex parsing.

34. SigNoz `create_dashboard` HTML-escapes `&` in titles

Dashboard titled NATS & JetStream is stored as NATS & JetStream

The SigNoz MCP / HTTP create_dashboard mutation HTML-escapes the title field before persisting. The escaped form renders literally in the sidebar (NATS & JetStream) — not pretty, and a foot-gun if you key dashboards by title in code.

Cause: SigNoz’s API runs the dashboard title through an HTML-escape pass on create. The same pass is NOT applied on update_dashboard, so an after-the-fact title rewrite restores the plain &.

Fix (preferred): Use +, and, ,, or — instead of & in dashboard titles. The 2026-06-18 NATS dashboard was created as polymarket-infra — NATS + JetStream for this reason.

Fix (recovery): If a title already shipped with &, call update_dashboard with the corrected title — it lands as plain &. Already applied to polymarket-infra — Host & Tailscale.

Surfaced 2026-06-18 during the SigNoz dashboard housekeeping pass.

35. SigNoz host metrics are stored under dotted canonical names but legacy widgets reference underscore form

system.memory.usage and system_memory_usage look like different metrics in the UI

SigNoz’s Metric Explorer auto-complete surfaces both dotted and underscore forms for the same metric. Old dashboard widgets reference the underscore form (system_memory_usage); the canonical OTel form is dotted (system.memory.usage). Both work because SigNoz normalizes — but they read like duplicates in the query builder.

Cause: SigNoz normalizes dotted-OTel metric names by stripping/replacing . for backend storage but retains both forms in the lookup index for backwards compatibility with older widgets that were authored before the OTel SDK switched to dotted names.

Operating rule: For new dashboards, use the dotted canonical form (matches OTel emit, matches the contrib docs). Don’t churn-rewrite existing widgets — they work fine on the underscore form and the diff is pure noise.

Surfaced 2026-06-18 during the SigNoz dashboard housekeeping pass.

36. OTel `host_metrics` `*.utilization` metrics are NOT emitted by default

Host dashboard CPU and FS utilization panels were silently empty since Spec 1

The OTel host_metrics receiver’s cpu, memory, filesystem, and paging scrapers ship a curated default set of metrics. The system.*.utilization series (CPU %, FS %, mem %, paging %) are NOT in that default set — they have to be explicitly opted into per scraper.

Cause: OTel-contrib’s hostmetricsreceiver defaults are conservative. Each scraper has a per-metric enable/disable matrix documented in receiver/hostmetricsreceiver/internal/scraper/<scraper>/documentation.md. The *.utilization series default to disabled: true because they’re derived (not raw) and add cardinality. The README at the receiver level does not list this — easy to miss when wiring host metrics for the first time.

Fix: In ansible/roles/monitoring/templates/config.yaml.j2, opt in per scraper:

host_metrics:
  scrapers:
    cpu:
      metrics:
        system.cpu.utilization:
          enabled: true
    memory:
      metrics:
        system.memory.utilization:
          enabled: true
    filesystem:
      metrics:
        system.filesystem.utilization:
          enabled: true
    paging:
      metrics:
        system.paging.utilization:
          enabled: true

Operating rule: When wiring a SigNoz host-resources dashboard, audit the queries against the contrib scraper docs (linked from the receiver README) and explicitly enable any non-default metric in the collector config. Don’t trust “the metric exists in the SDK” to mean “the collector is shipping it.” Verify via curl :8888/metrics | grep system_cpu_utilization after make configure.

Surfaced 2026-06-18 during the SigNoz dashboard housekeeping pass.

37. NATS `:8222` is JSON, not Prometheus — needs `prometheus-nats-exporter`

OTel prometheus receiver scraping 127.0.0.1:8222 will fail to parse

NATS server’s monitoring port (:8222) returns JSON at /varz, /connz, /subz, /routez, /gatewayz, /leafz, /jsz. It is NOT a Prometheus text endpoint. Pointing the OTel collector’s prometheus receiver at it produces parse errors and no metrics.

Cause: NATS server ships JSON-only monitoring. Prometheus translation is delegated to a separate exporter project: prometheus-nats-exporter.

Fix: Run natsio/prometheus-nats-exporter:0.17.3 as a sidecar in the same compose project as NATS. It connects to NATS over HTTP (e.g. http://nats:8222), polls the enabled JSON endpoints, and exposes Prometheus-format metrics on its own port. Bind it to 127.0.0.1:<port> on the host and have the OTel collector scrape it just like it scrapes [[observability-flow|tailscaled on :5252]].

nats-exporter:
  image: natsio/prometheus-nats-exporter:0.17.3
  command:
    - "-port=7777"
    - "-varz"
    - "-connz"
    - "-subz"
    - "-routez"
    - "-gatewayz"
    - "-leafz"
    - "-jsz=all"
    - "http://nats:8222"
  ports:
    - "127.0.0.1:7777:7777"

Critical: -jsz=all is required to get JetStream stream/consumer-level metrics (gnatsd_jetstream_*). Without it the per-stream and per-consumer series are absent and the JetStream dashboard panels stay empty.

Do not expose the exporter publicly. -connz leaks every NATS client IP, -subz leaks subject names. Loopback-only on the host with localhost-scraping OTel is the safe pattern.

Surfaced 2026-06-18 during the SigNoz dashboard housekeeping pass when wiring the new NATS + JetStream dashboard.

38. In-process counters hydrated from the DB only re-read on boot — TRUNCATE alone won’t reset them

DB mutation without a container bounce is a no-op for in-process safety gates

When a process hydrates a safety-critical counter from the DB once at startup and then uses an in-memory copy for the rest of its lifetime, mutating the DB row count (TRUNCATE, DELETE) does NOT roll the in-memory counter back. The process continues to fire whatever guard the in-memory counter triggers — including silent “paused” states that short-circuit before any output side-effect.

Cause: the hydration is one-shot at boot — a design choice that gives the process restart-safety (e.g. preventing re-fire of already-emitted signals on restart per BLOCKER-3). The same one-shot design means there’s no DB-watch mechanism to roll the counter back when rows disappear.

Symptom: the process keeps logging the cap-reached message at ~1Hz after you’ve already TRUNCATEd the table. Looks like the TRUNCATE didn’t land. It did — the counter just doesn’t know about it.

Fix: after any DB mutation that’s supposed to clear an in-process counter, always restart the container/process. For the crypto emitter:

sudo systemctl restart apps-shortterm-crypto-algo-emit.service

Confirm the reset via the boot log:

INFO emit: guardrail hydrated from DB hydrated=0

Surfaces particularly during Pattern A → Pattern B cutovers when the new container inherits residue rows that the original Pattern A schedule had already accumulated up against the cap. Captured 2026-06-25 — apps-shortterm-crypto-algo-emit had been spinning for ~22 hours on EMIT_MAX_FIRES reached, signal emission paused max_fires=50 after the cutover because 50 pre-cutover rows in crypto_shortterm.emitted_signals were carried over. See crypto-emit-max-fires-cap-clear-2026-06-25 for the full session.

When designing in-process counters

If you’re adding a new safety-counter that hydrates from a DB on boot, document the “must restart to reset” semantics next to the env var name. Operators who think in DB-state terms will assume TRUNCATE is sufficient — it’s not.

gcp-vm-provisioning-design
spec-1-deployment-complete
levandor-infra
babylon-deploy-notes — where #26 and #27 surfaced, plus the Litestream + GCE-native ADC and Tailscale serve patterns from the same deploy
crypto-emit-max-fires-cap-clear-2026-06-25 — where #38 surfaced (Pattern A → B cutover residue + in-process hydration)

Levandor

Explorer

GCP / Terraform / Ansible Gotchas

1. google_project_service disables APIs project-wide on destroy

2. Terraform’s file() does not expand ~

3. ansible.cfg is only auto-loaded from the CWD

4. GCP OS Login silently overrides ssh-keys metadata

5. The GCP default VPC ships an open SSH firewall rule

6. google_compute_instance can race API enablement

7. Tailscale SSH ACL autogroup-self does NOT match tagged devices

8. Ansible gather_facts runs before pre_tasks

9. OTel collector :8888/metrics is the canonical export-success signal

10. OTel Collector v0.152 renamed common receivers and exporters

11. rtk proxy bypasses RTK’s token-saving output filter

12. systemd OnCalendar= does not accept cron 5-field syntax

13. Jinja2 filter | binds tighter than comparison ==

14. ansible-core 2.20 rejects Jinja expressions as include_tasks filenames at parse time

15. Tailscale name collision after VM reprovision

16. Google Cloud APT GPG key needs the .asc extension (not .gpg)

17. .dockerignore excludes the sqlx::migrate!() migration directory

18. Supabase direct connection is IPv6-only; Docker bridge is IPv4-only

19. Userspace-mode tailscaled on a GH Actions runner has no MagicDNS

20. Tailscale OAuth client 403 — must have tag in client Tags AND tagOwners

21. OTel Rust SDK panics on init if Tokio runtime isn’t entered yet

22. Format-string tracing calls produce opaque log bodies in SigNoz

23. GCP Workload Identity Pool + OIDC Provider have 30-day soft-delete

24. GitHub Actions concurrency group with cancel-in-progress: false still cancels the intermediate PENDING run

25. secrets/<app>.env is gitignored and IGNORED by CD — the GitHub Actions repo secret is the real source of truth

26. Cold Rust cargo build --release on e2-small (2 vCPU shared, 2 GB RAM) takes ~35 minutes

27. libssl-dev is NOT needed when the Rust TLS backend is rustls + aws-lc-rs

28. RTK token-optimizer’s stdout decoration can leak into stdin when piping to gh secret set

29. GLIBC mismatch when running a host-built Rust binary in a debian:bookworm-slim container

30. Babylon’s funnel-CLI safety check fails inside a container without the tailscale binary

31. Tailscale sidecar hostname: in compose shadows the service’s compose-internal DNS

32. Tailscale auth key + explicit --advertise-tags=tag:X can fail even when the key was minted with tag:X

33. Ansible -e "key=$VAR" shlex-parses the value; shell-special bytes break it

34. SigNoz create_dashboard HTML-escapes & in titles

35. SigNoz host metrics are stored under dotted canonical names but legacy widgets reference underscore form

36. OTel host_metrics *.utilization metrics are NOT emitted by default

37. NATS :8222 is JSON, not Prometheus — needs prometheus-nats-exporter

38. In-process counters hydrated from the DB only re-read on boot — TRUNCATE alone won’t reset them

Related

Graph View

Table of Contents

Backlinks

1. `google_project_service` disables APIs project-wide on destroy

2. Terraform’s `file()` does not expand `~`

3. `ansible.cfg` is only auto-loaded from the CWD

4. GCP OS Login silently overrides `ssh-keys` metadata

5. The GCP `default` VPC ships an open SSH firewall rule

6. `google_compute_instance` can race API enablement

8. Ansible `gather_facts` runs before `pre_tasks`

11. `rtk proxy` bypasses RTK’s token-saving output filter

12. systemd `OnCalendar=` does not accept cron 5-field syntax

13. Jinja2 filter `|` binds tighter than comparison `==`

14. ansible-core 2.20 rejects Jinja expressions as `include_tasks` filenames at parse time

16. Google Cloud APT GPG key needs the `.asc` extension (not `.gpg`)

17. `.dockerignore` excludes the `sqlx::migrate!()` migration directory

19. Userspace-mode `tailscaled` on a GH Actions runner has no MagicDNS

22. Format-string `tracing` calls produce opaque log bodies in SigNoz

24. GitHub Actions concurrency group with `cancel-in-progress: false` still cancels the intermediate PENDING run

25. `secrets/<app>.env` is gitignored and IGNORED by CD — the GitHub Actions repo secret is the real source of truth

26. Cold Rust `cargo build --release` on e2-small (2 vCPU shared, 2 GB RAM) takes ~35 minutes

27. `libssl-dev` is NOT needed when the Rust TLS backend is `rustls` + `aws-lc-rs`

28. RTK token-optimizer’s stdout decoration can leak into stdin when piping to `gh secret set`

29. GLIBC mismatch when running a host-built Rust binary in a `debian:bookworm-slim` container

30. Babylon’s funnel-CLI safety check fails inside a container without the `tailscale` binary

31. Tailscale sidecar `hostname:` in compose shadows the service’s compose-internal DNS

32. Tailscale auth key + explicit `--advertise-tags=tag:X` can fail even when the key was minted with tag:X

33. Ansible `-e "key=$VAR"` shlex-parses the value; shell-special bytes break it

34. SigNoz `create_dashboard` HTML-escapes `&` in titles

36. OTel `host_metrics` `*.utilization` metrics are NOT emitted by default

37. NATS `:8222` is JSON, not Prometheus — needs `prometheus-nats-exporter`