Metrics and tracing
Surefire emits metrics via System.Diagnostics.Metrics and traces via System.Diagnostics.ActivitySource. Wire them into OpenTelemetry using SurefireDiagnostics.MeterName and SurefireDiagnostics.ActivitySourceName.
Install the OpenTelemetry packages:
dotnet add package OpenTelemetry.Extensions.Hostingdotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocolWire up the meter and activity source in your host builder:
using OpenTelemetry.Metrics;using OpenTelemetry.Trace;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry() .WithMetrics(metrics => { metrics.AddMeter(SurefireDiagnostics.MeterName); metrics.AddOtlpExporter(); }) .WithTracing(tracing => { tracing.AddSource(SurefireDiagnostics.ActivitySourceName); tracing.AddOtlpExporter(); });
builder.Services.AddSurefire();Instruments
Section titled “Instruments”| Instrument | Type | Unit | Tags | Description |
|---|---|---|---|---|
surefire.runs.claimed | Counter | surefire.job.name | Runs claimed by workers | |
surefire.runs.completed | Counter | surefire.job.name | Runs completed successfully | |
surefire.runs.failed | Counter | surefire.job.name, surefire.dead_letter.reason | Runs that reached the Failed terminal state. Reason is one of retries_exhausted, no_handler_registered, shutdown_interrupted, stale_recovery | |
surefire.runs.canceled | Counter | surefire.job.name | Runs canceled | |
surefire.runs.duration.ms | Histogram | ms | surefire.job.name | Time from claim to terminal transition |
surefire.scheduler.lag.ms | Histogram | ms | surefire.job.name | Time between a run’s NotBefore and when it was actually claimed. Growing values mean the cluster is undersized |
surefire.store.operation.ms | Histogram | ms | surefire.store.operation | Store operation duration |
surefire.store.operation.failed | Counter | surefire.store.operation | Failed store operations | |
surefire.store.retries | Counter | surefire.service | Transient store failure retries | |
surefire.loop.errors | Counter | surefire.loop | Background loop tick failures (executor, maintenance, scheduler, retention) | |
surefire.log_entries.dropped | Counter | surefire.drop.reason | Log entries dropped before store flush | |
surefire.durable.suspended | Counter | surefire.job.name | Durable orchestrator attempts that yielded and parked in Suspended | |
surefire.durable.instant_resume | Counter | surefire.job.name | Durable yields where every awaited entity was already terminal, so the store routed them straight back to Pending. A sustained rate points to a handler yielding without making progress | |
surefire.durable.stale_recovered | Counter | surefire.job.name | Durable runs that were re-queued for replay after a host crashed mid-execution |
Traces
Section titled “Traces”The activity source creates surefire.run.execute spans with these tags:
| Tag | Description |
|---|---|
surefire.run.id | The run ID |
surefire.run.job | The job name |
surefire.run.attempt | Attempt number |
surefire.run.parent | Parent run ID (if any) |
surefire.job.timeout | true when the attempt was canceled by WithTimeout |
Failed runs set the span status to Error with the exception message.