Error Recovery¶

Hibernator includes automatic retry and manual recovery mechanisms for handling execution failures.

Automatic Retries¶

When a runner Job fails, the controller automatically retries with exponential backoff:

Backoff formula: min(60s × 2^attempt, 30m)
Default retries: 3 (configurable via spec.behavior.retries)
Maximum retries: 10

Retry Schedule¶

Attempt	Delay
1	60 seconds
2	120 seconds
3	240 seconds
4	480 seconds
...	...
Max	30 minutes

Configure Retries¶

behavior:
  retries: 5    # 0 to disable, max 10

Error Classification¶

The controller classifies errors as:

Type	Behavior	Examples
Transient	Automatic retry	Network timeout, API throttling, temporary unavailability
Permanent	No retry, plan enters Error phase	Invalid credentials, missing resource, permission denied

Manual Recovery¶

When automatic retries are exhausted or a permanent error occurs, the plan enters the Error phase.

Retry via Annotation¶

Trigger a manual retry:

kubectl annotate hibernateplan dev-offhours -n hibernator-system \
  hibernator.ardikabs.com/retry-now=true

This resets the retry counter and immediately attempts the failed operation.

Check Error Details¶

# View the error message
kubectl get hibernateplan dev-offhours -n hibernator-system \
  -o jsonpath='{.status.errorMessage}'

# Check retry count
kubectl get hibernateplan dev-offhours -n hibernator-system \
  -o jsonpath='{.status.retryCount}'

# View per-target execution state
kubectl get hibernateplan dev-offhours -n hibernator-system \
  -o jsonpath='{.status.executions}' | jq '.[] | select(.state == "Failed")'

View Runner Logs¶

# Find the failed job
kubectl get jobs -n hibernator-system -l hibernator.ardikabs.com/plan=dev-offhours

# Check the pod logs
kubectl logs job/hibernate-runner-dev-offhours-dev-database -n hibernator-system

BestEffort Mode¶

With behavior.mode: BestEffort, the plan continues executing other targets even when some fail:

behavior:
  mode: BestEffort
  failFast: false
  retries: 3

Failed targets are retried independently
Downstream DAG dependents of failed targets are marked Aborted
The plan enters Error only if all retries are exhausted and the failure affects overall completion

Recovery Checklist¶

When a plan is stuck in Error phase:

Check the error message: kubectl get hibernateplan <name> -o jsonpath='{.status.errorMessage}'
Check runner logs: Look at the failed Job's pod logs
Fix the root cause: Credentials, permissions, resource state, etc.
Retry: Apply the retry-now annotation
Monitor: Watch the plan phase until it transitions back to a healthy state