ErrorWithStepStatus::log(status, message) returned from a flow step silently discards the message payload at the parent log site. Only a generic wrapper string + flow.step.status reaches Datadog. Diagnosed on develop after the merge of d4efd057 feat: error handling redesign (the BE-2272 / BE-2023 arc).

Impact

Every environment (local, dev, prod). Datadog log entries from Log-variant step failures contain the wrapper text “Finished running step with errors” / “Failed running step” + status but never the real error message. Operators cannot see why a step failed without re-reproducing locally.

Symptom

Triggering a flow step that returns

ErrorWithStepStatus::log(StepStatus::Error, "explicit human-readable diagnostic")

produces logs of the form

level=ERROR msg="Finished running step with errors" flow.step.status=Error flow.step.execution_time=...

The string "explicit human-readable diagnostic" is nowhere in the output — neither in the message, nor in error.source, nor in error.kind.

Root Cause

The Log variant of ErrorWithStepStatus carries a message: String, but every downstream consumer drops it.

1. Constructor (intent)

mando-lib/src/workflow/mod.rs:472-478ErrorWithStepStatus::log(status, message) builds:

Self::Log { status, message, logged_at_failure_site: false }

The logged_at_failure_site: false flag means: “I haven’t been logged yet — the parent must log me”.

2. Destructure drops message

StepResult::log() at mando-lib/src/workflow/mod.rs:150-229:

  • Lines 155-159 destructure the Log arm:

    ErrorWithStepStatus::Log { status, logged_at_failure_site, .. } => { ... }

    The .. swallows message.

  • Lines 196-225 then call tracing::error! (when logged_at_failure_site: false) or warn! (true) with only:

    • the literal string "Finished running step with errors" / "Failed running step"
    • flow.step.status
    • flow.step.execution_time

    No message, no error.source, no error.kind is attached.

3. Display impl also drops it

#[error("status: {status}")] on the Log variant — mando-lib/src/workflow/mod.rs:443. So every downstream stringification (format!("{}", err), anyhow chain, etc.) yields only "status: Error".

4. status_or_error collapses Log to Ok

mando-lib/src/workflow/mod.rs:231-239Log is folded into Ok(status) for group propagation. Whatever message survived to this point is gone.

5. Top-level flow log catches the empty husk

mando-bess/src/workflow/flow.rs:289error!(err, message = "Flow execution failed", ...) catches FlowTriggerError::FlowError(ErrorWithStepStatus). Because of (3), err.to_string() is "status: Error". That’s all that reaches Datadog at the flow level.

Important: the Error(anyhow::Error) arm is fine

ErrorWithStepStatus::Error(anyhow::Error) is correctly logged at mando-lib/src/workflow/mod.rs:160-171 via the error! macro — that path preserves error.source, error.kind, error.code. Only the ::Log arm is broken.

For Agents

Two failure modes in ErrorWithStepStatus: ::Error(anyhow) works correctly; ::Log { message, .. } drops message at every consumer. Diagnose by checking which constructor was used at the step site.

Why Tests Did Not Catch It

Tests at mando-lib/src/workflow/mod.rs:548-605 exist for StepResult::log() and assert on:

  • log level (error vs warn)
  • the generic wrapper string ("Finished running step with errors")

None of them capture the emitted event and assert that message (the actual payload passed into ErrorWithStepStatus::log(...)) appears in the output. That’s how the regression shipped.

Test fix pattern

Reuse the capture harness from BE-2272 (tracing::subscriber::with_default + custom MakeWriter over Mutex<Vec<u8>>, serde_json::from_slice the captured bytes). Add cases that pass a known-distinctive message into ErrorWithStepStatus::log(...) and assert that exact string appears at a documented field path in the JSON event.

Cross-reference: possibly already addressed

The historian’s LOG entry for 2026-05-04 lists follow-up commits on feature/BE-2272 that may already touch this code path:

  • c945514e fix: log step errors at the failure site to preserve real error.kind instead of generic wrapper at the catch boundary
  • 4c543cb1 fix: downgraded parent flow error logs to warn when the child step has already logged the error (dedupe DD noise)
  • 567373a5, 63d6fa69, 8f9297e4

c945514e in particular sounds adjacent: “log at the failure site to preserve real error.kind”. Before patching, diff these commits against the current develop head to confirm whether the Log arm’s message is still being dropped or whether the failure-site logging path has been promoted to also emit message.

If those commits only fixed the Error(anyhow) arm and left Log { message } untouched, the bug stands.

Secondary Finding

mando-lib/src/app/dd_formatter.rs:122-124record_error uses value.to_string(), capturing only Display, no .source() chain. Niche path though: the error! macro emits errors as error.source = %__err which lands in record_str, not record_error. Not the cause of the message drop, but worth fixing alongside.

Fix Sketch

Minimum change to stop the bleed:

  1. StepResult::log() — destructure message instead of ..; pass it as a structured field on the error! / warn! macro call:
    ErrorWithStepStatus::Log { status, message, logged_at_failure_site } => {
        if logged_at_failure_site {
            warn!(flow.step.status = %status, flow.step.message = %message, ...);
        } else {
            error!(flow.step.status = %status, flow.step.message = %message, ...);
        }
    }
  2. Display for the Log variant — promote to #[error("status: {status}: {message}")] so anyhow chains and top-level catches preserve it too.
  3. status_or_error — when collapsing Log to Ok(status) for group propagation, either keep a side-channel or log message before discarding.
  4. Add regression tests that assert message payload presence at a documented JSON path (see harness pattern above).

How to Verify

  1. Pick any flow step that returns ErrorWithStepStatus::log(StepStatus::Error, "VERIFY_MARKER_<random>") (or temporarily wrap an existing step).
  2. Trigger the flow (locally, dev, or prod — choose appropriately).
  3. grep VERIFY_MARKER_ in the logs (stdout / Datadog / wherever).
  4. Expected (current broken state): zero matches; logs contain only "Finished running step with errors" + flow.step.status=Error.
  5. Expected (after fix): the marker string appears at flow.step.message (or whatever field name is chosen).

File:Line Index

ConcernLocation
Constructor ErrorWithStepStatus::logmando-lib/src/workflow/mod.rs:472-478
Display for Log variantmando-lib/src/workflow/mod.rs:443
StepResult::log() destructure drops messagemando-lib/src/workflow/mod.rs:155-159
StepResult::log() emit site (no message)mando-lib/src/workflow/mod.rs:196-225
Error(anyhow) arm — works correctlymando-lib/src/workflow/mod.rs:160-171
status_or_error drops messagemando-lib/src/workflow/mod.rs:231-239
Top-level flow catchmando-bess/src/workflow/flow.rs:289
Insufficient testsmando-lib/src/workflow/mod.rs:548-605
Secondary record_error Display-onlymando-lib/src/app/dd_formatter.rs:122-124
  • BE-2272 — DD log field flattening, capture-harness pattern reusable for regression tests here
  • BE-1842 Datadog Observability — observability arc this regression undermines
  • mando-lib — crate containing the broken code path
  • mando-bess — top-level flow catch site
  • LOG — see 2026-05-04 entry for c945514e / 4c543cb1 cross-check commits
  • Agent Context