Reusable gotchas discovered during the validation pass on gcp-vm-provisioning-design (Spec 1), expanded with additional traps caught during the live deployment of ops-vm. Each is relevant to any future GCP / Terraform / Ansible / OTel / RTK work, not just this project — check this list before writing similar infra code.
For Agents
Eleven independent traps, each with the symptom, the cause, and the fix. They are framework/provider/tool behaviors, not project-specific bugs. The fixes are baked into the Spec 1 design and the deployed
ops-vm. Gotchas 1–6 came from the design-validation pass; 7–11 from the live deployment.
1. google_project_service disables APIs project-wide on destroy
disable_on_destroy defaults to true
google_project_servicedefaults todisable_on_destroy = true. Runningterraform destroywill then disable the API project-wide — affecting every other resource and tool in the project, not just what Terraform manages.
Fix: set disable_on_destroy = false on every google_project_service resource.
2. Terraform’s file() does not expand ~
file() takes a literal path
Terraform’s
file()function does not expand~to the home directory.file("~/.ssh/id_ed25519.pub")fails with a no-such-file error.
Fix: wrap the path in pathexpand() — file(pathexpand(var.ssh_public_key_path)). Apply this anywhere a user-supplied path may contain ~.
3. ansible.cfg is only auto-loaded from the CWD
ansible.cfg in a subdirectory is ignored
Ansible auto-loads
ansible.cfgonly from the current working directory (or the path in$ANSIBLE_CONFIG). Anansible.cfginside anansible/subdirectory is silently ignored whenansible-playbookis invoked from the repo root — all its settings (inventory path,roles_path,host_key_checking) are quietly lost.
Fix: invoke Ansible with ANSIBLE_CONFIG=ansible/ansible.cfg ansible-playbook …, or run from inside the ansible/ directory.
4. GCP OS Login silently overrides ssh-keys metadata
OS Login wins over instance metadata SSH keys
If GCP OS Login is enabled — via project metadata or an org policy (
constraints/compute.requireOsLogin) — it silently overrides thessh-keysinstance metadata. Metadata-key-based SSH then fails with no obvious error.
Fix: set enable-oslogin = "FALSE" in the instance metadata so metadata-key SSH is honored. Caveat: if OS Login is force-enabled by org policy, the instance setting is overridden and SSH must move to OS Login — a hard limitation.
5. The GCP default VPC ships an open SSH firewall rule
default-allow-ssh is open to 0.0.0.0/0
The GCP
defaultVPC network ships with adefault-allow-sshfirewall rule open to0.0.0.0/0— every VM placed in the default network is exposed to the public internet ontcp:22.
Fix: create a dedicated custom-mode VPC (google_compute_network with auto_create_subnetworks = false) so the VM does not inherit default-allow-ssh. Add only the inbound rules you actually need (for a Tailscale-only design, none).
6. google_compute_instance can race API enablement
Instance created before the Compute API is ready
Terraform may create a
google_compute_instancebefore thegoogle_project_serviceresource has finished enabling the Compute API, causing an intermittent apply failure on fresh projects.
Fix: add an explicit depends_on from the instance (or the VM module) to the google_project_service resources, so the API is guaranteed enabled first.
7. Tailscale SSH ACL autogroup-self does NOT match tagged devices
autogroup:self doesn’t match tagged devices
A Tailscale SSH ACL rule with
dst: ["autogroup:self"]looks like the obvious way to say “my own devices,” but it does not match a node that was joined with a tagged auth key. The VM accepts no SSH sessions andtailscale sshreturns a permission-denied with no obvious cause.Cause:
autogroup:selfis scoped to devices owned by the connecting user’s identity. Tagged devices (joined via auth keys carryingtag:<name>) have no user owner — the tag IS their identity — soautogroup:selfexcludes them by construction.
Fix: for a VM joined with a tagged auth key, target the tag explicitly in the ACL — dst: ["tag:<name>"] (e.g. tag:cloud). Keep src as autogroup:member (or a narrower group) and list the on-VM Unix users in users.
8. Ansible gather_facts runs before pre_tasks
Default fact-gathering races the VM boot
Ansible’s default fact-gathering step runs before
pre_tasksexecute. If the play targets a freshly provisioned VM that is still booting and joining the tailnet, the implicitsetuphappens before anywait_for_connectionpre-task gets a chance to delay it — the run fails on the very first connect withUNREACHABLEand no useful output. Thepre_tasksyou added to absorb boot time never run.
Fix: set gather_facts: false at the play level, then explicitly run ansible.builtin.setup as a normal task after wait_for_connection succeeds. The play definition becomes:
- hosts: vm
gather_facts: false
become: true
pre_tasks:
- name: Wait for SSH
ansible.builtin.wait_for_connection:
timeout: 300
- name: Gather facts now that the host is reachable
ansible.builtin.setup:
roles: [...]9. OTel collector :8888/metrics is the canonical export-success signal
systemctl active does not mean metrics are leaving the host
systemctl is-active otelcol-contribonly proves the process is running. The collector can be happily up while the exporter is failing (auth wrong, endpoint unreachable, TLS error) — you’ll see nothing in the SigNoz dashboard and wonder why. Logs occasionally mention errors but are noisy and easy to misread.
Fix: the collector’s own internal telemetry is the source of truth. otelcol-contrib exposes Prometheus-format metrics on :8888/metrics by default. The two counters to watch:
otelcol_exporter_sent_metric_points— must be >0 and increasing over time. This confirms the exporter is succeeding end-to-end.otelcol_exporter_send_failed_metric_points— must be 0 or absent. Any non-zero value means the exporter is dropping data; investigate the configured endpoint, credentials, and TLS.
Quick check from the VM:
curl -s localhost:8888/metrics | grep -E 'otelcol_exporter_(sent|send_failed)_metric_points'Wire this into smoke checks instead of relying on systemctl is-active alone.
10. OTel Collector v0.152 renamed common receivers and exporters
hostmetrics → host_metrics, otlp → otlp_grpc
OpenTelemetry Collector v0.152 renamed several long-standing component IDs:
- The host-metrics receiver
hostmetrics→host_metrics.- The OTLP gRPC exporter
otlp→otlp_grpc(the new explicit name disambiguates it fromotlphttp).Older config files continue to work — the collector accepts the old names with a deprecation warning at startup — but the warning is easy to miss and the names will be removed eventually.
Fix: update the collector config to the new names whenever you touch it; consult the collector’s --help and the release notes for the version you’re deploying before copying a snippet from older tutorials. Pin the collector version in the Ansible role so behavior is reproducible.
11. rtk proxy bypasses RTK’s token-saving output filter
RTK rewrites commands and filters output — sometimes you need the raw thing
RTK (Rust Token Killer) transparently rewrites tool invocations (
git status→rtk git status, etc.) and filters their output to save tokens. That filtering is great for chatty dev commands but actively harmful when a tool’s raw output is load-bearing — Terraform’splan/applydiff, an Ansible run’s task-by-task report, or a raw JSON body fromcurlcan be truncated or reformatted in ways that hide what you need to see.
Fix: prefix any such command with rtk proxy … to execute it unfiltered. Examples:
rtk proxy terraform plan -out=tfplan
rtk proxy terraform apply tfplan
rtk proxy ansible-playbook -i ansible/inventory/hosts.ini ansible/site.yml
rtk proxy curl -s https://api.example.com/foo | jq .Keep the rewrite hook enabled for everyday commands; reach for rtk proxy whenever the exact raw output matters.