Reusable gotchas discovered during the validation pass on gcp-vm-provisioning-design (Spec 1), expanded with additional traps caught during the live deployment of ops-vm. Each is relevant to any future GCP / Terraform / Ansible / OTel / RTK work, not just this project — check this list before writing similar infra code.

For Agents

Eleven independent traps, each with the symptom, the cause, and the fix. They are framework/provider/tool behaviors, not project-specific bugs. The fixes are baked into the Spec 1 design and the deployed ops-vm. Gotchas 1–6 came from the design-validation pass; 7–11 from the live deployment.

1. google_project_service disables APIs project-wide on destroy

disable_on_destroy defaults to true

google_project_service defaults to disable_on_destroy = true. Running terraform destroy will then disable the API project-wide — affecting every other resource and tool in the project, not just what Terraform manages.

Fix: set disable_on_destroy = false on every google_project_service resource.

2. Terraform’s file() does not expand ~

file() takes a literal path

Terraform’s file() function does not expand ~ to the home directory. file("~/.ssh/id_ed25519.pub") fails with a no-such-file error.

Fix: wrap the path in pathexpand()file(pathexpand(var.ssh_public_key_path)). Apply this anywhere a user-supplied path may contain ~.

3. ansible.cfg is only auto-loaded from the CWD

ansible.cfg in a subdirectory is ignored

Ansible auto-loads ansible.cfg only from the current working directory (or the path in $ANSIBLE_CONFIG). An ansible.cfg inside an ansible/ subdirectory is silently ignored when ansible-playbook is invoked from the repo root — all its settings (inventory path, roles_path, host_key_checking) are quietly lost.

Fix: invoke Ansible with ANSIBLE_CONFIG=ansible/ansible.cfg ansible-playbook …, or run from inside the ansible/ directory.

4. GCP OS Login silently overrides ssh-keys metadata

OS Login wins over instance metadata SSH keys

If GCP OS Login is enabled — via project metadata or an org policy (constraints/compute.requireOsLogin) — it silently overrides the ssh-keys instance metadata. Metadata-key-based SSH then fails with no obvious error.

Fix: set enable-oslogin = "FALSE" in the instance metadata so metadata-key SSH is honored. Caveat: if OS Login is force-enabled by org policy, the instance setting is overridden and SSH must move to OS Login — a hard limitation.

5. The GCP default VPC ships an open SSH firewall rule

default-allow-ssh is open to 0.0.0.0/0

The GCP default VPC network ships with a default-allow-ssh firewall rule open to 0.0.0.0/0 — every VM placed in the default network is exposed to the public internet on tcp:22.

Fix: create a dedicated custom-mode VPC (google_compute_network with auto_create_subnetworks = false) so the VM does not inherit default-allow-ssh. Add only the inbound rules you actually need (for a Tailscale-only design, none).

6. google_compute_instance can race API enablement

Instance created before the Compute API is ready

Terraform may create a google_compute_instance before the google_project_service resource has finished enabling the Compute API, causing an intermittent apply failure on fresh projects.

Fix: add an explicit depends_on from the instance (or the VM module) to the google_project_service resources, so the API is guaranteed enabled first.

7. Tailscale SSH ACL autogroup-self does NOT match tagged devices

autogroup:self doesn’t match tagged devices

A Tailscale SSH ACL rule with dst: ["autogroup:self"] looks like the obvious way to say “my own devices,” but it does not match a node that was joined with a tagged auth key. The VM accepts no SSH sessions and tailscale ssh returns a permission-denied with no obvious cause.

Cause: autogroup:self is scoped to devices owned by the connecting user’s identity. Tagged devices (joined via auth keys carrying tag:<name>) have no user owner — the tag IS their identity — so autogroup:self excludes them by construction.

Fix: for a VM joined with a tagged auth key, target the tag explicitly in the ACL — dst: ["tag:<name>"] (e.g. tag:cloud). Keep src as autogroup:member (or a narrower group) and list the on-VM Unix users in users.

8. Ansible gather_facts runs before pre_tasks

Default fact-gathering races the VM boot

Ansible’s default fact-gathering step runs before pre_tasks execute. If the play targets a freshly provisioned VM that is still booting and joining the tailnet, the implicit setup happens before any wait_for_connection pre-task gets a chance to delay it — the run fails on the very first connect with UNREACHABLE and no useful output. The pre_tasks you added to absorb boot time never run.

Fix: set gather_facts: false at the play level, then explicitly run ansible.builtin.setup as a normal task after wait_for_connection succeeds. The play definition becomes:

- hosts: vm
  gather_facts: false
  become: true
  pre_tasks:
    - name: Wait for SSH
      ansible.builtin.wait_for_connection:
        timeout: 300
    - name: Gather facts now that the host is reachable
      ansible.builtin.setup:
  roles: [...]

9. OTel collector :8888/metrics is the canonical export-success signal

systemctl active does not mean metrics are leaving the host

systemctl is-active otelcol-contrib only proves the process is running. The collector can be happily up while the exporter is failing (auth wrong, endpoint unreachable, TLS error) — you’ll see nothing in the SigNoz dashboard and wonder why. Logs occasionally mention errors but are noisy and easy to misread.

Fix: the collector’s own internal telemetry is the source of truth. otelcol-contrib exposes Prometheus-format metrics on :8888/metrics by default. The two counters to watch:

  • otelcol_exporter_sent_metric_points — must be >0 and increasing over time. This confirms the exporter is succeeding end-to-end.
  • otelcol_exporter_send_failed_metric_points — must be 0 or absent. Any non-zero value means the exporter is dropping data; investigate the configured endpoint, credentials, and TLS.

Quick check from the VM:

curl -s localhost:8888/metrics | grep -E 'otelcol_exporter_(sent|send_failed)_metric_points'

Wire this into smoke checks instead of relying on systemctl is-active alone.

10. OTel Collector v0.152 renamed common receivers and exporters

hostmetrics → host_metrics, otlp → otlp_grpc

OpenTelemetry Collector v0.152 renamed several long-standing component IDs:

  • The host-metrics receiver hostmetricshost_metrics.
  • The OTLP gRPC exporter otlpotlp_grpc (the new explicit name disambiguates it from otlphttp).

Older config files continue to work — the collector accepts the old names with a deprecation warning at startup — but the warning is easy to miss and the names will be removed eventually.

Fix: update the collector config to the new names whenever you touch it; consult the collector’s --help and the release notes for the version you’re deploying before copying a snippet from older tutorials. Pin the collector version in the Ansible role so behavior is reproducible.

11. rtk proxy bypasses RTK’s token-saving output filter

RTK rewrites commands and filters output — sometimes you need the raw thing

RTK (Rust Token Killer) transparently rewrites tool invocations (git statusrtk git status, etc.) and filters their output to save tokens. That filtering is great for chatty dev commands but actively harmful when a tool’s raw output is load-bearing — Terraform’s plan / apply diff, an Ansible run’s task-by-task report, or a raw JSON body from curl can be truncated or reformatted in ways that hide what you need to see.

Fix: prefix any such command with rtk proxy … to execute it unfiltered. Examples:

rtk proxy terraform plan -out=tfplan
rtk proxy terraform apply tfplan
rtk proxy ansible-playbook -i ansible/inventory/hosts.ini ansible/site.yml
rtk proxy curl -s https://api.example.com/foo | jq .

Keep the rewrite hook enabled for everyday commands; reach for rtk proxy whenever the exact raw output matters.