ErrorWithStepStatus::log(status, message) returned from a flow step silently discards the message payload at the parent log site. Only a generic wrapper string + flow.step.status reaches Datadog. Diagnosed on develop after the merge of d4efd057 feat: error handling redesign (the BE-2272 / BE-2023 arc).
Impact
Every environment (local, dev, prod). Datadog log entries from
Log-variant step failures contain the wrapper text “Finished running step with errors” / “Failed running step” +statusbut never the real errormessage. Operators cannot see why a step failed without re-reproducing locally.
Symptom
Triggering a flow step that returns
ErrorWithStepStatus::log(StepStatus::Error, "explicit human-readable diagnostic")produces logs of the form
level=ERROR msg="Finished running step with errors" flow.step.status=Error flow.step.execution_time=...
The string "explicit human-readable diagnostic" is nowhere in the output — neither in the message, nor in error.source, nor in error.kind.
Root Cause
The Log variant of ErrorWithStepStatus carries a message: String, but every downstream consumer drops it.
1. Constructor (intent)
mando-lib/src/workflow/mod.rs:472-478 — ErrorWithStepStatus::log(status, message) builds:
Self::Log { status, message, logged_at_failure_site: false }The logged_at_failure_site: false flag means: “I haven’t been logged yet — the parent must log me”.
2. Destructure drops message
StepResult::log() at mando-lib/src/workflow/mod.rs:150-229:
-
Lines 155-159 destructure the
Logarm:ErrorWithStepStatus::Log { status, logged_at_failure_site, .. } => { ... }The
..swallowsmessage. -
Lines 196-225 then call
tracing::error!(whenlogged_at_failure_site: false) orwarn!(true) with only:- the literal string
"Finished running step with errors"/"Failed running step" flow.step.statusflow.step.execution_time
No
message, noerror.source, noerror.kindis attached. - the literal string
3. Display impl also drops it
#[error("status: {status}")] on the Log variant — mando-lib/src/workflow/mod.rs:443. So every downstream stringification (format!("{}", err), anyhow chain, etc.) yields only "status: Error".
4. status_or_error collapses Log to Ok
mando-lib/src/workflow/mod.rs:231-239 — Log is folded into Ok(status) for group propagation. Whatever message survived to this point is gone.
5. Top-level flow log catches the empty husk
mando-bess/src/workflow/flow.rs:289 — error!(err, message = "Flow execution failed", ...) catches FlowTriggerError::FlowError(ErrorWithStepStatus). Because of (3), err.to_string() is "status: Error". That’s all that reaches Datadog at the flow level.
Important: the Error(anyhow::Error) arm is fine
ErrorWithStepStatus::Error(anyhow::Error) is correctly logged at mando-lib/src/workflow/mod.rs:160-171 via the error! macro — that path preserves error.source, error.kind, error.code. Only the ::Log arm is broken.
For Agents
Two failure modes in
ErrorWithStepStatus:::Error(anyhow)works correctly;::Log { message, .. }dropsmessageat every consumer. Diagnose by checking which constructor was used at the step site.
Why Tests Did Not Catch It
Tests at mando-lib/src/workflow/mod.rs:548-605 exist for StepResult::log() and assert on:
- log level (error vs warn)
- the generic wrapper string (
"Finished running step with errors")
None of them capture the emitted event and assert that message (the actual payload passed into ErrorWithStepStatus::log(...)) appears in the output. That’s how the regression shipped.
Test fix pattern
Reuse the capture harness from BE-2272 (
tracing::subscriber::with_default+ customMakeWriteroverMutex<Vec<u8>>,serde_json::from_slicethe captured bytes). Add cases that pass a known-distinctivemessageintoErrorWithStepStatus::log(...)and assert that exact string appears at a documented field path in the JSON event.
Cross-reference: possibly already addressed
The historian’s LOG entry for 2026-05-04 lists follow-up commits on feature/BE-2272 that may already touch this code path:
c945514efix: log step errors at the failure site to preserve realerror.kindinstead of generic wrapper at the catch boundary4c543cb1fix: downgraded parent flow error logs towarnwhen the child step has already logged the error (dedupe DD noise)567373a5,63d6fa69,8f9297e4
c945514e in particular sounds adjacent: “log at the failure site to preserve real error.kind”. Before patching, diff these commits against the current develop head to confirm whether the Log arm’s message is still being dropped or whether the failure-site logging path has been promoted to also emit message.
If those commits only fixed the Error(anyhow) arm and left Log { message } untouched, the bug stands.
Secondary Finding
mando-lib/src/app/dd_formatter.rs:122-124 — record_error uses value.to_string(), capturing only Display, no .source() chain. Niche path though: the error! macro emits errors as error.source = %__err which lands in record_str, not record_error. Not the cause of the message drop, but worth fixing alongside.
Fix Sketch
Minimum change to stop the bleed:
StepResult::log()— destructuremessageinstead of..; pass it as a structured field on theerror!/warn!macro call:ErrorWithStepStatus::Log { status, message, logged_at_failure_site } => { if logged_at_failure_site { warn!(flow.step.status = %status, flow.step.message = %message, ...); } else { error!(flow.step.status = %status, flow.step.message = %message, ...); } }Displayfor theLogvariant — promote to#[error("status: {status}: {message}")]so anyhow chains and top-level catches preserve it too.status_or_error— when collapsingLogtoOk(status)for group propagation, either keep a side-channel or logmessagebefore discarding.- Add regression tests that assert
messagepayload presence at a documented JSON path (see harness pattern above).
How to Verify
- Pick any flow step that returns
ErrorWithStepStatus::log(StepStatus::Error, "VERIFY_MARKER_<random>")(or temporarily wrap an existing step). - Trigger the flow (locally, dev, or prod — choose appropriately).
grep VERIFY_MARKER_in the logs (stdout / Datadog / wherever).- Expected (current broken state): zero matches; logs contain only
"Finished running step with errors"+flow.step.status=Error. - Expected (after fix): the marker string appears at
flow.step.message(or whatever field name is chosen).
File:Line Index
| Concern | Location |
|---|---|
Constructor ErrorWithStepStatus::log | mando-lib/src/workflow/mod.rs:472-478 |
Display for Log variant | mando-lib/src/workflow/mod.rs:443 |
StepResult::log() destructure drops message | mando-lib/src/workflow/mod.rs:155-159 |
StepResult::log() emit site (no message) | mando-lib/src/workflow/mod.rs:196-225 |
Error(anyhow) arm — works correctly | mando-lib/src/workflow/mod.rs:160-171 |
status_or_error drops message | mando-lib/src/workflow/mod.rs:231-239 |
| Top-level flow catch | mando-bess/src/workflow/flow.rs:289 |
| Insufficient tests | mando-lib/src/workflow/mod.rs:548-605 |
Secondary record_error Display-only | mando-lib/src/app/dd_formatter.rs:122-124 |
Related
- BE-2272 — DD log field flattening, capture-harness pattern reusable for regression tests here
- BE-1842 Datadog Observability — observability arc this regression undermines
- mando-lib — crate containing the broken code path
- mando-bess — top-level flow catch site
- LOG — see 2026-05-04 entry for
c945514e/4c543cb1cross-check commits - Agent Context