P0 — Service Worker NavigationRoute Locks All Users Out of CF Access

Production outage on wrcm.levandor.io: every user (admin web + iOS PWA) saw a full-screen Authentication Error: No CF Access token available, persisting across browser refresh. Root cause was the custom service worker introduced for push-notifications work intercepting every navigation and serving cached /index.html instead of letting Cloudflare Access run its cookie-refresh redirect dance. Fixed in cd8394f by removing the Workbox NavigationRoute + adding immediate SW takeover + adding an in-app “Reset session and reload” recovery button.

Severity P0

All users locked out of the admin CRM and iOS PWA. Refresh did not recover (service workers persist across refresh). Recovery required DevTools (unregister SW + clear caches) until the fix shipped.

For Agents

If you are ever asked to add a custom service worker (Workbox, vite-plugin-pwa injectManifest, etc.) to this codebase, read this note first. ZTNA + navigation interception is a permanent trap; the prior research at iOS PWA Gotchas correctly flagged “SW must NOT cache user-scoped API responses” but missed NavigationRoute entirely. The push-notifications spec must explicitly forbid NavigationRoute (or whitelist CF Access endpoints) before being implemented.

Symptoms

  • Full-screen error: Authentication Error: No CF Access token available
  • Every user affected — admin web AND iOS PWA standalone
  • Refresh did NOT clear the error
  • New tab to wrcm.levandor.io reproduced immediately
  • Onset was gradual over ~24h post-deploy, not instantaneous (see Detection timeline)

Timeline

WhenEvent
~24h before reportCommit ac5f429 build(pwa): switch to injectManifest with custom sw.ts (precache parity) deployed
Throughout the dayUsers with fresh CF_Authorization cookies kept working — only those whose cookies naturally expired hit the lockout
Report time”the website is broken now too it shows the same error there too even after refresh” / “our previous impl fucked something up”
InvestigationInitial misdirection toward recent centralized notifications work (red herring)
Diagnosisweb/src/sw.ts NavigationRoute(createHandlerBoundToURL('/index.html')) identified
Fix shippedCommit cd8394f on master

Root Cause

The custom service worker introduced in commit ac5f429 (build(pwa): switch to injectManifest with custom sw.ts (precache parity)) registered a Workbox NavigationRoute bound to /index.html:

// web/src/sw.ts (offending code)
import { NavigationRoute, registerRoute } from 'workbox-routing'
registerRoute(new NavigationRoute(createHandlerBoundToURL('/index.html')))

This intercepts every page navigation — initial loads, refreshes, new-tab opens — and serves cached /index.html from the Workbox precache instead of going to the network.

Why this is catastrophic behind Cloudflare Access ZTNA

Cloudflare Access expires its CF_Authorization cookie periodically. When the cookie is stale, CF Access responds to a navigation request with a 302 to the IdP login page so the browser can re-authenticate and receive a fresh cookie. This is the entire mechanism by which long-lived sessions stay alive on a ZTNA-protected origin.

With the SW intercepting navigations:

  1. Browser issues navigation request
  2. SW returns cached /index.html immediately — request never leaves the device
  3. CF Access never sees the request → cannot issue the 302 → cookie is never refreshed
  4. SPA boots, cf-access.ts:79 calls getCfAccessToken(), finds no valid CF_Authorization in document.cookie
  5. SPA throws Authentication Error: No CF Access token available
  6. User refreshes → step 2 again, forever

Why refresh did not recover

Service workers persist across browser refreshes by design (that’s the entire point of offline-capable PWAs). Once the SW is registered, only unregistering the SW + clearing caches (or shipping a new SW that takes over) breaks the loop.

Why it took ~24h to manifest

Users whose CF_Authorization cookie was fresh at deploy time kept working — the SPA’s call to getCfAccessToken() succeeded because the cookie was still valid. The lockout only kicked in for each user the first time their cookie naturally expired after the deploy. As cookies expired throughout the day, more users hit the wall. This staged onset misleads incident triage — symptoms look like “something is gradually getting worse” rather than “we shipped a bug.”

Misleading framing during triage

The user (correctly) suspected the recent centralized notifications work and migrations. Those were a red herring — the notifications-system migrations did not touch auth. The real culprit was the earlier, seemingly-innocuous push-notifications SW commit (ac5f429). Lesson: when an auth outage follows a deploy window with multiple changes, audit every change in the window — especially anything touching service workers, edge config, or middleware — not just the most recent or most suspicious-looking one.

Fix

Commit cd8394f on master. Three changes:

1. Remove NavigationRoute from web/src/sw.ts

Stop intercepting navigations. Let every navigation hit the network so CF Access can run its 302/cookie-refresh dance unimpeded. Workbox precaching for static assets is fine and was kept — only NavigationRoute is the trap.

2. Immediate SW takeover

// web/src/sw.ts
self.addEventListener('install', () => self.skipWaiting())
self.addEventListener('activate', (event) => event.waitUntil(self.clients.claim()))

Without these, users who already have the broken SW installed would have to close all tabs to pick up the new SW. With skipWaiting + clients.claim, the new SW activates on the very next page load and takes over existing clients.

3. In-app recovery: “Reset session and reload”

Added a button to the AuthError screen in web/src/lib/supabase.tsx that:

  1. Calls navigator.serviceWorker.getRegistrations() and unregisters all SWs
  2. Iterates caches.keys() and deletes every cache
  3. Reloads the page

This unblocks users already trapped behind the broken SW without requiring DevTools. This is now the canonical recovery pattern for any future ZTNA-vs-SW issue.

Lessons & Gotchas

ZTNA + custom service worker is a permanent trap

Any navigation interception breaks CF Access’s cookie-refresh redirect dance. If NavigationRoute is ever reintroduced, it MUST whitelist requests to CF Access endpoints AND pass through 30x responses unmodified. Easier rule: don’t use NavigationRoute on this origin, period. Workbox precaching of static assets is fine — that’s not the trap.

Service workers persist across refresh — design for recovery

Refresh is the user’s first instinct. It does not help with SW-induced bugs. Every codebase shipping a custom SW should expose an in-app “reset session” button that unregisters SWs + clears caches. We now have one in web/src/lib/supabase.tsx — keep it. Do not remove it as part of any future cleanup.

Always include skipWaiting + clients.claim for fix-ship SWs

When shipping a fix to a broken SW, include self.skipWaiting() on install and self.clients.claim() on activate so the new SW activates immediately rather than waiting for all tabs to close. Without this, recovery requires manual DevTools intervention from each affected user.

Stale doc — web/CLAUDE.md says Supabase has no JWT auth

web/CLAUDE.md still claims “Supabase client factory (anon key, no JWT auth)” but the JWT exchange via the cf-access-auth Edge Function was reintroduced sometime after commit 769ba5c. The current security note also reflects the older anon-only model. Both docs need a separate update pass — out of scope for this incident note but flagged for follow-up.

Workbox precaching is fine — only NavigationRoute is the trap

The general Workbox approach (precache manifest, runtime caching for static assets) is compatible with CF Access ZTNA. The specific issue is NavigationRoute + createHandlerBoundToURL, which by design intercepts navigations. If push notifications need a custom SW, build one without NavigationRoute (use precacheAndRoute for assets only).

Implications for Push Notifications Work

The push-notifications research at push-notifications is the work that needed injectManifest + a custom web/src/sw.ts. The research correctly flagged the CF Access ZTNA + iOS standalone cookie spike as a planning blocker but did not anticipate NavigationRoute would also break navigation cookie refresh on every platform.

Before resuming push-notifications implementation:

  1. Hard rule: the push SW must NOT register NavigationRoute. Add this as an explicit “don’t” in the plan.
  2. Test plan addition: every PR touching web/src/sw.ts must include manual verification that an expired CF_Authorization cookie still triggers the CF Access login redirect. This cannot be unit-tested — it requires a real ZTNA round trip.
  3. Keep the recovery button: the “Reset session and reload” button stays. Future SW changes should consider it part of the contract.

How to Verify Going Forward

When changing anything in web/src/sw.ts or web/vite.config.ts PWA settings:

  1. Deploy to a CF Access-protected preview environment.
  2. Manually expire the CF_Authorization cookie in DevTools (Application → Cookies → delete CF_Authorization).
  3. Refresh. The expected behaviour is: browser redirects to CF Access IdP login, user re-authenticates, fresh cookie issued, SPA loads.
  4. Failure mode: if the SPA loads with Authentication Error: No CF Access token available, the SW is intercepting navigations again — STOP and audit the SW source.
  • security — Auth architecture (CF Access ZTNA + Supabase). May need an update re: JWT exchange via cf-access-auth Edge Function.
  • push-notifications — Original research that motivated the injectManifest switch. Update its “Gotchas” section to add NavigationRoute as a hard “don’t.”
  • mobile-native-feel — Mobile PWA architecture; consumers of the same SW.
  • debugging-log-crm — Project debugging log. This incident also belongs in the index there.
  • levandor-crm — Project overview.
  • agent-context-crm — Agent quick reference.