SRE Teams

Incident Prevention

What Legible catches before you get paged

Every one of these scenarios has caused a real outage. Legible detects all of them before deployment reaches production.

Missing Critical Workflow Step

Without Legible

A deploy removes the fraud-check call from the checkout flow. Tests pass because the test suite mocks the fraud service. Canary looks healthy because fraud checks don't affect latency metrics.

With Legible

Legible detects that a REQUIRED node (fraud-check) is absent from the post-deployment workflow graph. This is an invariant violation — no amount of healthy metrics can override it.

VIOLATION → ESCALATEPrevented: $200K+ in fraudulent transactions per hour of exposure

Retry Amplification Cascade

Without Legible

A change increases retry attempts from 2 to 5 on bank timeouts. Under peak traffic, retry amplification overwhelms the bank integration service with a 45-minute cascade outage.

With Legible

Stage behavioral delta shows retry amplification of 3.5x — higher than the 2.5x ceiling configured for this workflow. The boundary flags EXCEEDED before production traffic hits.

EXCEEDED_INTENT → HOLDPrevented: 45-minute cascade outage affecting all payment processing

Silent New Dependency

Without Legible

A developer adds a direct call from payment-service to notification-service, bypassing the event bus. The new synchronous dependency creates a single point of failure nobody noticed.

With Legible

New edge (payment→notification) was not in the Trusted Change Boundary. Structurally significant unexplained change — IIC predicted retry changes, not a new dependency.

PARTIAL_MATCH → HOLDPrevented: Notification service outage cascading to payment processing

Latency Cascade from Config Change

Without Legible

A Consul config update changes connection pool size for inventory service. +40ms on p95 — individually acceptable, but shifts the checkout workflow p99 past the 3-second SLA.

With Legible

Temporal conformance dimension detects latency shift exceeds hard_max_latency for checkout workflow. Even though each service looks healthy individually, workflow-level impact triggers governance.

EXCEEDED_INTENT → MONITORPrevented: SLA breach affecting checkout conversion rate

Feature Flag Causes Behavioral Drift

Without Legible

A feature flag gradually rolls out a new recommendation algorithm. Over 3 weeks, the call pattern drifts from baseline. No single deployment caused the drift — it accumulated through progressive rollout.

With Legible

Boundary drift detection tracks boundary composition over time. When the non-SBD percentage exceeds 40%, it emits BOUNDARY_DRIFT_WARNING before it causes an incident.

BOUNDARY_DRIFT_WARNINGDetected: Gradual drift before it causes an incident

Production Truth

Behavioral baselines built from real production

No synthetic tests. No policy guesswork. Just observed reality from your existing OpenTelemetry traces.

Workflow Baseline: checkout-flow v47

Structure

7 nodes, 12 edges

api-gw → payment → bank

fraud-check (REQUIRED)

notification via event-bus

Distribution

payment-direct: 82%

payment-fallback: 18%

fraud-bypass: 0% (FORBIDDEN)

Stable ±3% over 30 days

Retry Profile

payment→bank: 1.2x avg

retry ceiling: 5

circuit-breaker at 3x

Retry rate stable ±0.1

Latency

p50: 145ms

p95: 380ms

p99: 890ms

Hard max: 3000ms

Provenance: Promoted from classification vrf_7k2mP8 (MATCHED_INTENDED, HIGH confidence) · 48,000 traces · 14 consecutive stable windows

On-Call Impact

What this means for your rotation

Without Legible

✗Paged at 3 AM because a deploy broke checkout

✗Spend 45 min figuring out which deploy caused it

✗Roll back, hope it fixes it

✗Post-incident: "we need more tests"

✗Next quarter: same thing happens again

With Legible

✓Deploy blocked at 2 PM with clear evidence why

✓Developer sees exactly which behavioral change is unexplained

✓Fix before it reaches production

✓Post-incident: "we caught it in the pipeline"

✓Baseline gets smarter with every deployment

↓ 70%

change-related incidents

↓ 85%

MTTR for change failures

↓ 60%

on-call pages from deploys

Projected impact based on design partner analysis. Results depend on deployment volume and system complexity.

Hotfix Governance

Emergency deploys stay governed

When you skip staging for a hotfix, Legible enters Restricted Authority Mode. Less evidence means less automated trust — not no governance.

Restricted Authority Mode

🔒 Boundary confidence capped at LOW

🚫 Auto-promotion disabled

⏱️ Envelope TTL reduced 50%

👁️ All structural changes: mandatory review

📊 Monitoring window doubled

🔢 Hotfix frequency cap: 3 per team per 30 days

Post-Hotfix Reconciliation

When the hotfix later goes through normal staging, Legible automatically reconciles:

✓Original hotfix envelope updated with stage SBD

✓Unexplained changes from hotfix window reclassified

✓Unresolved changes flagged as HOTFIX_RESIDUAL

✓Governance debt tracked and surfaced

Hotfix mode should feel heavier than staging, not lighter.

Stop firefighting deployments.
Start preventing incidents.

What Legible catches before you get paged

Behavioral baselines built from real production

What this means for your rotation

Emergency deploys stay governed

Your next outage is preventable.

Stop firefighting deployments. Start preventing incidents.

What Legible catches before you get paged

Behavioral baselines built from real production

What this means for your rotation

Emergency deploys stay governed

Your next outage is preventable.

Stop firefighting deployments.
Start preventing incidents.