flowchart LR
subgraph "one_for_one"
A1[A] ~~~ B1["B (failed)"] ~~~ C1[C]
B1 -->|restart| B1r["B (new)"]
end
Supervision and Reliability
Bus timeout, fire-and-forget, dead letter, and Erlang-style supervision trees.
Reading order: Threading Model -> You are here -> Bridging External Systems. For API reference, see Kernel API.
This page explains the reliability features added in spec 010. These were motivated by comparing the actor model (Pykka) against our reactive kernel. The conclusion: we don’t need actors, but we need the specific reliability patterns that make actor systems robust.
Bus reliability
Dead letter channel
When a bus event dispatch fails (subscriber error, no handler), the failure is recorded on a reserved event channel called __dead_letter__. Any component can subscribe to it:
@lifecycle.activate
def activate(self):
self.rt.on("__dead_letter__", self._on_failure)
def _on_failure(self, event_type, data):
self.rt.logger.error(
"Dead letter: target=%s reason=%s error=%s",
data["target"], data["reason"], data.get("error"),
)Each envelope contains event_type, data, reason, error, and timestamp. Reasons: no_handler, handler_error. See Kernel API: Dead Letter Channel for the full shape.
Supervision trees
The problem
When a child component fails to activate (missing config, broken dependency, network error), the kernel marks it ERRORED and moves on. Recovery requires manual intervention: fix the issue, call kernel.retry_erroneous("name").
For a system that should self-heal, this is not enough. We need automatic restart with backoff, and we need a parent component to decide the restart strategy.
The idea: Erlang/OTP supervision
Erlang/OTP systems are famous for reliability. The key idea: organize components into trees where parent components supervise their children. When a child fails, the parent decides what to do based on a strategy.
We grafted this idea onto our reactive lifecycle without changing the concurrency model.
Declaring a supervisor
A supervisor is any component that:
- Spawns children via
self.rt.spawn() - Has a
@lifecycle.supervisioncallback
@component("my-supervisor", version="1.0")
class MySupervisor:
@lifecycle.activate
async def activate(self):
await self.rt.spawn("worker-a")
await self.rt.spawn("worker-b")
await self.rt.spawn("worker-c")
@lifecycle.supervision(
strategy="one_for_one",
max_restarts=3,
within_seconds=60,
backoff="exponential",
base_delay=2.0,
)
async def on_child_failure(self, child_name, error, attempt, context):
self.rt.logger.warning(
"Child %s failed (attempt %d): %s",
child_name, attempt, error,
)
return True # proceed with restartThe three strategies
one_for_one — restart only the failed child. A and C are untouched. Use this when children are independent.
flowchart LR
subgraph "one_for_all"
A2["A (restarted)"] ~~~ B2["B (failed, restarted)"] ~~~ C2["C (restarted)"]
end
one_for_all — restart ALL children. Use this when children are interdependent (e.g., they share state and a partial restart would be inconsistent).
flowchart LR
subgraph "rest_for_one"
A3[A] ~~~ B3["B (failed, restarted)"] ~~~ C3["C (restarted)"]
end
rest_for_one — restart the failed child and everything started after it. A is untouched. Use this when later children depend on earlier ones (e.g., C reads from B’s output).
Lifecycle state machine with supervision
stateDiagram-v2
[*] --> DISCOVERED
DISCOVERED --> RESOLVED
RESOLVED --> ACTIVATING
ACTIVATING --> ACTIVE : success
ACTIVATING --> ERRORED : failure
ERRORED --> RESTARTING : supervisor decides to retry
RESTARTING --> RESOLVED : backoff delay complete
ERRORED --> RESOLVED : manual retry_erroneous()
ACTIVE --> DEACTIVATING
DEACTIVATING --> STOPPED
RESTARTING is a new state. It means “the supervisor has decided to restart this component, and we are waiting for the backoff delay.”
Backoff strategies
| Strategy | Delay formula | Example (base=2s) |
|---|---|---|
constant |
base |
2s, 2s, 2s, … |
linear |
base * attempt |
2s, 4s, 6s, … |
exponential |
base * 2^(attempt-1) |
2s, 4s, 8s, 16s, … |
All delays are capped at 60 seconds.
Restart window
Restarts are tracked in a sliding window. If a component is restarted max_restarts times within within_seconds, the supervisor gives up and escalates.
@lifecycle.supervision(
max_restarts=3, # at most 3 restarts...
within_seconds=60, # ...within any 60-second window
)Escalation
When a supervisor exceeds its restart limit, it marks itself as ERRORED and notifies its supervisor (if any). This creates a chain:
Root Supervisor
└── Data Pipeline Supervisor ← escalates here if workers keep failing
├── Fetcher Worker
├── Processor Worker ← keeps failing
└── Writer Worker
If the Root Supervisor also can’t handle it, the component stays ERRORED and requires manual intervention.
SupervisionContext
The supervision callback receives a SupervisionContext that lets you inspect and influence the restart:
@lifecycle.supervision(strategy="one_for_one", max_restarts=5)
async def on_child_failure(self, child_name, error, attempt, context):
# Inspect
print(context.child_factory) # factory name
print(context.attempt) # which attempt (1-indexed)
print(context.restarts_in_window) # total restarts in the window
print(context.strategy) # "one_for_one"
# Influence: change child properties before restart
if isinstance(error, ConnectionError) and attempt >= 2:
context.update_properties({"endpoint": "fallback.example.com"})
return True # True = restart, False = give upThe supervisable trait
Components with @lifecycle.supervision automatically get the supervisable trait at L1. This shows up in kernel.status().
What supervision does NOT do
- No runtime error supervision. An
@effectthat throws at runtime is caught by the reactive engine and logged. It does not trigger supervision. Supervision handles activation/startup failures only. - No automatic health-check restarts.
@lifecycle.healthreports status but does not trigger restarts. - No process-level isolation. Everything is in one Python process. For OOM/segfault isolation, use external supervisors (systemd, K8s).