Room 62243 Screenshot Failure Investigation

All 3 screenshot attempts for room 62243 (customer 46019, NUSZ deployment) failed with RPCCLIENT MESSAGE TIMEOUT rpc-transport-css. This is NOT an isolated incident — the remote screenshot feature has a 100% failure rate across 31 days of logs (March 16 — April 16, 2026), affecting 91 unique rooms with 157 total failures and 0 successes.

Architecture Context

The screenshot flow traverses vuer_oss, vuer_css, and RabbitMQ RPC:

  1. Operator clicks “take photo” in OSS UI
  2. OSS calls VideoChatService.processRemoteScreenshot() RoomTransportSession.remoteScreenshot() RPC to rpc-transport-css queue
  3. CSS receives RPC, emits Socket.IO event videochat:screenshot:remote to customer’s browser
  4. Browser captures screenshot, sends back via Socket.IO
  5. CSS returns screenshot data via RPC reply to OSS

Steps 3-4 never complete — the browser never responds.

Timeline for Room 62243

TimeEventSource
12:43:01Email sent to customer 46019vuer_oss
12:46:37Customer enters waiting roomvuer_css
12:47:04”Client not found for room #62243” — socket not connectedvuer_css:172502
12:47:06—12:47:315 invite retries, all WRONG_PAGE (email_validation videochat)vuer_css:172505-172532
12:47:15Customer joins videochat room (but invite retries still fail — race condition)vuer_css:172527
12:50:12.934Screenshot #1 TIMEOUTvuer_oss:751742
12:50:19Customer reconnectsvuer_css:172739
12:50:20.769Ping timeout disconnect (“client did not send PONG”)vuer_css:172740
12:50:22camera:get:values fails — client not foundvuer_css:172744
12:50:23Customer reconnects againvuer_css:172759
12:50:42.452Screenshot #2 TIMEOUTvuer_oss:751802
12:52:04—12:55:106 more room rejoin events (total 10 joins in session)vuer_css
12:54:18.250Screenshot #3 TIMEOUTvuer_oss:752364
12:55:39.052Room closed by operatorvuer_css:172949
12:56:48Customer sends “elnezest itt van meg?” to closed roomvuer_oss:752820
12:59:29—13:01:03Customer toggles high contrast, mute — all rejected (closed room)vuer_oss
13:09:59Ghost socket disconnect — zombie connection 14 min after closevuer_css
13:27:35Customer re-identified in room 62265vuer_css:174549

Root Cause Analysis

Root Cause

The rpc-transport-css RabbitMQ queue is functional — it delivers other operations like videochat:camera:get:values and videochat:receivingPeer:start/stop (523 processed events in CSS logs). The failure is specific to videochat:screenshot:remote.

When CSS receives the screenshot RPC, it must relay a Socket.IO event to the customer’s browser and wait for the browser to capture and return the image. The browser never responds. This causes:

  • CSS internal timeout fires (RPCServer response timeout rpc-transport-css — 81 occurrences)
  • OSS timeout fires (RPCCLIENT MESSAGE TIMEOUT rpc-transport-css — 456 occurrences)
  • Late replies arrive after OSS timeout (UNABLE TO MATCH RPC REPLY — 152 occurrences)

The root cause is in the browser-side JavaScript screenshot handler:

  • Either the handler is not registered / event name is mismatched
  • Or browser security/tab throttling prevents canvas capture
  • Or the Socket.IO event is emitted but the client-side listener is broken

Evidence

Evidence Chain

Hypothesis 1: RPC queue broken REFUTED Queue delivers camera:get:values and receivingPeer events. 523 processed events prove the queue works.

Hypothesis 2: CSS handler missing REFUTED Room 61418 on April 8 proves the handler exists: Failed to process videochat:screenshot:remote (roomId: 61418) Error: Client not found for room (#61418) (vuer_css:85010)

Hypothesis 3: Browser never responds to screenshot request CONFIRMED

  • 153 of 157 failures are TIMEOUT (CSS never responds because browser never responds)
  • 4 are “cannot answer” (CSS received RPC but customer’s socket was already disconnected)
  • The one CSS-side log entry for room 61418 shows the failure at the Socket.IO emission step (SocketService.js:50:15)

Systemic Scope

Systemic Impact

  • 157 processRemoteScreenshot failures, 0 successes (100% failure rate)
  • 91 unique rooms affected
  • 31 days of continuous failure (Mar 16 — Apr 16)
  • Every business day with a screenshot attempt recorded failures
  • Feature has never worked in the entire log period
  • April 15 was the worst day: 28 failures, concentrated burst of 9 RPCServer timeouts in 10:00—10:07

Failure type breakdown:

TypeCountPercentage
RPCCLIENT MESSAGE TIMEOUT15296.8%
“cannot answer”42.5%
bare “timeout”10.6%

Additional Findings

  • rpc-email-validation queue also has 31 RPCServer response timeouts — a second broken queue
  • 4,763 “Client not found for room” events across 1,696 rooms in CSS — systemic Socket.IO instability
  • Room 62239 (customer 46014) had identical instability pattern at the same time — not customer-specific
  • No CSS process crashes or restarts on April 15 (supervisor/Redis stable)
  • Nginx has zero screenshot HTTP endpoints — flow is purely RPC+Socket.IO

Corrections from Investigation

Original ClaimCorrection
”TransportError at 12:47:02”No such entry exists in either log
”CSS handler doesn’t exist”Handler exists, proven by room 61418
”rpc-transport-css queue broken”Queue works for other operations; only screenshot browser round-trip fails
”Customer stranded 15 min”Active interactions lasted 5.5 min (until 13:01:03); ghost socket at 13:09:59
”3 cannot-answer occurrences”4 occurrences + 1 bare timeout variant
”80+ rooms affected”91 rooms confirmed
  1. Investigate browser-side screenshot handler: Check vuer_css frontend JS for the Socket.IO event listener that handles videochat:screenshot:remote. Verify it exists, is registered, and the event name matches.
  2. Add immediate rejection on disconnect: If customer socket is disconnected, CSS should immediately return an error to OSS instead of waiting for timeout.
  3. Increase RPC timeout or add retry with backoff: 3 retries with exponential backoff before failing.
  4. Fix post-close UX: Customer 46019 was never notified the room closed. The browser must redirect on room-close event.
  5. Add monitoring: Alert on RPCCLIENT MESSAGE TIMEOUT rpc-transport-css — even one means screenshots are broken.

Key Files to Investigate

FilePurpose
vuer_oss/server/transport/session/RoomTransportSession.js:70Sends the RPC request
vuer_oss/server/service/VideoChatService.js:272Orchestrates screenshot flow
vuer_oss/server/socket/events/videochat.js:584Socket event handler
vuer_css server-sideRPC handler for rpc-transport-css screenshot action
vuer_css frontend JSSocket.IO listener for videochat:screenshot:remote