Room 62243 Screenshot Failure Investigation
All 3 screenshot attempts for room 62243 (customer 46019, NUSZ deployment) failed with RPCCLIENT MESSAGE TIMEOUT rpc-transport-css. This is NOT an isolated incident — the remote screenshot feature has a 100% failure rate across 31 days of logs (March 16 — April 16, 2026), affecting 91 unique rooms with 157 total failures and 0 successes.
Architecture Context
The screenshot flow traverses vuer_oss, vuer_css, and RabbitMQ RPC:
Screenshot RPC Flow
sequenceDiagram participant Op as Operator (OSS UI) participant OSS as vuer_oss participant RMQ as RabbitMQ participant CSS as vuer_css participant Browser as Customer Browser Op->>OSS: Click "take photo" OSS->>OSS: VideoChatService.processRemoteScreenshot() OSS->>RMQ: RPC to rpc-transport-css RMQ->>CSS: Deliver RPC message CSS->>Browser: Socket.IO: videochat:screenshot:remote Note over Browser: Browser captures screenshot Browser-->>CSS: Screenshot data via Socket.IO CSS-->>RMQ: RPC reply with image RMQ-->>OSS: Deliver reply OSS-->>Op: Display screenshot Note over Browser,CSS: Steps 3-4 never complete.<br/>Browser never responds. style Op fill:#264653,color:#fff style OSS fill:#2d2d2d,color:#fff style CSS fill:#2d2d2d,color:#fff style Browser fill:#3d2020,color:#fff
- Operator clicks “take photo” in OSS UI
- OSS calls
VideoChatService.processRemoteScreenshot()→RoomTransportSession.remoteScreenshot()→ RPC torpc-transport-cssqueue - CSS receives RPC, emits Socket.IO event
videochat:screenshot:remoteto customer’s browser - Browser captures screenshot, sends back via Socket.IO
- CSS returns screenshot data via RPC reply to OSS
Steps 3-4 never complete — the browser never responds.
Timeline for Room 62243
| Time | Event | Source |
|---|---|---|
| 12:43:01 | Email sent to customer 46019 | vuer_oss |
| 12:46:37 | Customer enters waiting room | vuer_css |
| 12:47:04 | ”Client not found for room #62243” — socket not connected | vuer_css:172502 |
| 12:47:06—12:47:31 | 5 invite retries, all WRONG_PAGE (email_validation → videochat) | vuer_css:172505-172532 |
| 12:47:15 | Customer joins videochat room (but invite retries still fail — race condition) | vuer_css:172527 |
| 12:50:12.934 | Screenshot #1 TIMEOUT | vuer_oss:751742 |
| 12:50:19 | Customer reconnects | vuer_css:172739 |
| 12:50:20.769 | Ping timeout disconnect (“client did not send PONG”) | vuer_css:172740 |
| 12:50:22 | camera:get:values fails — client not found | vuer_css:172744 |
| 12:50:23 | Customer reconnects again | vuer_css:172759 |
| 12:50:42.452 | Screenshot #2 TIMEOUT | vuer_oss:751802 |
| 12:52:04—12:55:10 | 6 more room rejoin events (total 10 joins in session) | vuer_css |
| 12:54:18.250 | Screenshot #3 TIMEOUT | vuer_oss:752364 |
| 12:55:39.052 | Room closed by operator | vuer_css:172949 |
| 12:56:48 | Customer sends “elnezest itt van meg?” to closed room | vuer_oss:752820 |
| 12:59:29—13:01:03 | Customer toggles high contrast, mute — all rejected (closed room) | vuer_oss |
| 13:09:59 | Ghost socket disconnect — zombie connection 14 min after close | vuer_css |
| 13:27:35 | Customer re-identified in room 62265 | vuer_css:174549 |
Root Cause Analysis
Root Cause
The
rpc-transport-cssRabbitMQ queue is functional — it delivers other operations likevideochat:camera:get:valuesandvideochat:receivingPeer:start/stop(523 processed events in CSS logs). The failure is specific tovideochat:screenshot:remote.When CSS receives the screenshot RPC, it must relay a Socket.IO event to the customer’s browser and wait for the browser to capture and return the image. The browser never responds. This causes:
- CSS internal timeout fires (
RPCServer response timeout rpc-transport-css— 81 occurrences)- OSS timeout fires (
RPCCLIENT MESSAGE TIMEOUT rpc-transport-css— 456 occurrences)- Late replies arrive after OSS timeout (
UNABLE TO MATCH RPC REPLY— 152 occurrences)The root cause is in the browser-side JavaScript screenshot handler:
- Either the handler is not registered / event name is mismatched
- Or browser security/tab throttling prevents canvas capture
- Or the Socket.IO event is emitted but the client-side listener is broken
Evidence
Evidence Chain
Hypothesis 1: RPC queue broken → REFUTED Queue delivers
camera:get:valuesandreceivingPeerevents. 523 processed events prove the queue works.Hypothesis 2: CSS handler missing → REFUTED Room 61418 on April 8 proves the handler exists:
Failed to process videochat:screenshot:remote (roomId: 61418) Error: Client not found for room (#61418)(vuer_css:85010)Hypothesis 3: Browser never responds to screenshot request → CONFIRMED
- 153 of 157 failures are TIMEOUT (CSS never responds because browser never responds)
- 4 are “cannot answer” (CSS received RPC but customer’s socket was already disconnected)
- The one CSS-side log entry for room 61418 shows the failure at the Socket.IO emission step (SocketService.js:50:15)
Systemic Scope
Systemic Impact
- 157 processRemoteScreenshot failures, 0 successes (100% failure rate)
- 91 unique rooms affected
- 31 days of continuous failure (Mar 16 — Apr 16)
- Every business day with a screenshot attempt recorded failures
- Feature has never worked in the entire log period
- April 15 was the worst day: 28 failures, concentrated burst of 9 RPCServer timeouts in 10:00—10:07
Failure type breakdown:
| Type | Count | Percentage |
|---|---|---|
| RPCCLIENT MESSAGE TIMEOUT | 152 | 96.8% |
| “cannot answer” | 4 | 2.5% |
| bare “timeout” | 1 | 0.6% |
Additional Findings
rpc-email-validationqueue also has 31 RPCServer response timeouts — a second broken queue- 4,763 “Client not found for room” events across 1,696 rooms in CSS — systemic Socket.IO instability
- Room 62239 (customer 46014) had identical instability pattern at the same time — not customer-specific
- No CSS process crashes or restarts on April 15 (supervisor/Redis stable)
- Nginx has zero screenshot HTTP endpoints — flow is purely RPC+Socket.IO
Corrections from Investigation
| Original Claim | Correction |
|---|---|
| ”TransportError at 12:47:02” | No such entry exists in either log |
| ”CSS handler doesn’t exist” | Handler exists, proven by room 61418 |
| ”rpc-transport-css queue broken” | Queue works for other operations; only screenshot browser round-trip fails |
| ”Customer stranded 15 min” | Active interactions lasted 5.5 min (until 13:01:03); ghost socket at 13:09:59 |
| ”3 cannot-answer occurrences” | 4 occurrences + 1 bare timeout variant |
| ”80+ rooms affected” | 91 rooms confirmed |
Recommended Fix
- Investigate browser-side screenshot handler: Check vuer_css frontend JS for the Socket.IO event listener that handles
videochat:screenshot:remote. Verify it exists, is registered, and the event name matches. - Add immediate rejection on disconnect: If customer socket is disconnected, CSS should immediately return an error to OSS instead of waiting for timeout.
- Increase RPC timeout or add retry with backoff: 3 retries with exponential backoff before failing.
- Fix post-close UX: Customer 46019 was never notified the room closed. The browser must redirect on room-close event.
- Add monitoring: Alert on
RPCCLIENT MESSAGE TIMEOUT rpc-transport-css— even one means screenshots are broken.
Key Files to Investigate
| File | Purpose |
|---|---|
vuer_oss/server/transport/session/RoomTransportSession.js:70 | Sends the RPC request |
vuer_oss/server/service/VideoChatService.js:272 | Orchestrates screenshot flow |
vuer_oss/server/socket/events/videochat.js:584 | Socket event handler |
vuer_css server-side | RPC handler for rpc-transport-css screenshot action |
vuer_css frontend JS | Socket.IO listener for videochat:screenshot:remote |
Related
- vuer_oss
- vuer_css
- FaceKom
- SLANUSZ-25 — another NUSZ deployment investigation
- FKITDEV-8639 — another NUSZ investigation