Executors¶
Executors are the hands of the Hibernator operator. While the control plane (brain) decides when and in what order to act, executors know how to shut down and wake up a specific type of resource.
Executor Contract¶
Every executor implements three operations:
| Operation | Purpose |
|---|---|
| Validate | Verify parameters and connectivity before execution |
| Shutdown | Stop or scale-down the resource, capturing restore metadata |
| WakeUp | Restore the resource to its pre-hibernation state using saved metadata |
Executors own idempotency — calling Shutdown on an already-stopped resource or WakeUp on an already-running resource must succeed without side effects.
Intent Preservation Contract¶
Hibernator implements a first-capture-wins intent preservation strategy to handle retries and partial failures during shutdown operations.
Demanded State¶
Each executor defines a demanded state — the condition that qualifies a resource for hibernator management:
| Executor | Demanded State | Intent Field |
|---|---|---|
| EC2 | Instance state is running |
wasRunning |
| EKS | Node group desiredSize > 0 |
wasScaled |
| RDS | DB instance/cluster status is available |
wasRunning |
| Karpenter | NodePool exists | (full spec captured) |
| WorkloadScaler | Workload replicas > 0 |
wasScaled |
Only resources in their demanded state are captured and managed by hibernator. Resources not in demanded state are observed passively.
First-Capture-Wins Semantics¶
When a resource is first captured during a hibernation cycle:
- Intent is locked: The
wasRunningorwasScaledvalue is preserved indefinitely - Cycle tracking: A
managedByCycleIDsmap tracks which cycle first captured each resource - Session preservation: On subsequent hibernation attempts with the same cycle ID (retries), the original intent is preserved even if the resource's current state has changed
- Fresh session: A different cycle ID starts fresh — only resources currently in demanded state are tracked
This ensures that: - Retry safety: If shutdown fails and user retries with the same cycle ID, the original intent is preserved - Consistency: Once hibernator decides a resource should be managed, that decision persists until successful wakeup (within the same session) - Clean slate: New hibernation operations (different cycle ID) get fresh tracking without stale data
Stale Resource Eviction¶
If a resource is not reported for 3 consecutive hibernation cycles:
- The resource is evicted from restore data
- The resource is removed from
managedByCycleIDstracking - On the next hibernation, the resource can be freshly captured
This prevents permanently retaining data for deleted or unmanageable resources while allowing temporary absences (e.g., API failures) without data loss.
Note
The managedByCycleIDs tracking is stored separately from resource state and is not visible in the resource data itself. It is used internally for idempotency and session management.
Edge Case Handling¶
Resource state changes between hibernation and wakeup: - If a resource is manually stopped/deleted after hibernation, the executor skips it during wakeup - The executor handles "resource not found" or "already in desired state" gracefully - Hibernator's contract is: "restore to the captured intent, but tolerate reality"
Example EC2 flow:
Cycle "hib-001" (first attempt):
Instance running → captured with wasRunning=true, tracked in managedByCycleIDs
Retry "hib-001" (user restarts same operation):
Instance still running → wasRunning=true preserved (same cycle ID)
Instance stopped → marker preserved, state unchanged (user responsibility)
New cycle "hib-002" (fresh hibernation):
Instance running → fresh capture with new cycle ID
Instance stopped → not tracked (different cycle ID, not in demanded state)
WakeUp: Instance restored based on captured intent
How Executors Run¶
Executors do not run inside the controller. Instead, the controller creates an isolated Runner Job for each target. The runner:
- Loads the executor matching the target's
typefield - Calls
Validateto verify parameters - Calls
ShutdownorWakeUpdepending on the operation - Streams logs and progress to the control plane via gRPC
- Persists restore metadata in a ConfigMap (
restore-data-{plan-name})
Each runner gets an ephemeral ServiceAccount with the minimum permissions needed.
Restore Data¶
During shutdown, executors capture metadata about the resource's current state (e.g., replica counts, scaling configs, instance IDs). This metadata is stored as JSON in a ConfigMap and used during wakeup to restore the resource to its exact pre-hibernation configuration.
The restore data ConfigMap is named hibernator-restore-{plan-name} with keys formatted as {target-name}.json.
Restore Data Timestamps¶
Each restore point entry contains timestamps that track different phases of the capture process:
Captured At¶
- Meaning: The timestamp when the hibernator captured and initiated the save operation to the ConfigMap
- Set when: When the accumulated data is ready to be persisted (just before the ConfigMap update)
- Granularity: Per-target (all resources in the target share the same CapturedAt)
- Use case: Historical tracking of when data was captured; audit and freshness checks from hibernator's perspective
Reported At (LastReportedAt)¶
- Meaning: The timestamp when a specific resource's state was reported by the executor via callback
- Set when: During
SaveStatewhen the executor reports each resource's state - Granularity: Per-resource (each resource has its own ReportedAt)
- Use case: Track idempotency within a hibernation cycle; detect stale resources
Timestamp Flow Example¶
Hibernation Cycle "cycle-001":
Executor Shutdown:
├─ 10:00:00 → Discovers resource "app-server" → reports state
│ LastReportedAt["app-server"] = 10:00:00
├─ 10:00:05 → Discovers resource "worker-1" → reports state
│ LastReportedAt["worker-1"] = 10:00:05
└─ 10:00:10 → Flush to ConfigMap completes successfully
CapturedAt = 10:00:10 (for entire target)
On Retry (same cycle ID):
├─ 10:05:00 → "app-server" already reported → preserves original state
│ LastReportedAt["app-server"] stays 10:00:00
└─ 10:05:10 → New resources reported → updated LastReportedAt
CapturedAt = 10:05:10 (updated on successful save)
Data Structure¶
Each target's restore data includes:
{
"target": "my-cluster",
"executor": "eks",
"version": 1,
"isLive": true,
"cycleID": "cycle-001",
"createdAt": "2026-04-30T10:00:00Z",
"capturedAt": "2026-04-30T10:00:10Z",
"state": {
"app-nodes": { "desired": 3, "min": 1, "max": 5 }
},
"status": {
"app-nodes": {
"staleCount": 0,
"lastReportedAt": "2026-04-30T10:00:00Z"
}
}
}
Built-in Executors¶
| Executor | Resource | Provider | Connector | Status |
|---|---|---|---|---|
eks |
EKS Managed Node Groups | AWS | CloudProvider | |
karpenter |
Karpenter NodePools | Kubernetes | K8SCluster | |
ec2 |
EC2 Instances | AWS | CloudProvider | |
rds |
RDS Instances & Clusters | AWS | CloudProvider | |
workloadscaler |
Kubernetes Workloads | Kubernetes | K8SCluster | |
noop |
None (testing) | — | Any | |
gke |
GKE Node Pools | GCP | K8SCluster | |
cloudsql |
Cloud SQL Instances | GCP | CloudProvider |
EKS¶
Type: eks · Connector: CloudProvider (AWS)
Manages EKS Managed Node Groups by scaling them to zero during hibernation and restoring original scaling configuration on wakeup.
Note
This executor only handles Managed Node Groups via the AWS EKS API. For Karpenter-managed NodePools, use the separate karpenter executor.
Shutdown Flow¶
- Discover node groups — If
nodeGroupsis empty, lists all node groups in the cluster viaListNodegroups. Otherwise, uses the specified list. - Capture state — For each node group, calls
DescribeNodegroupto record the currentdesiredSize,minSize, andmaxSize. - Persist restore data — Saves the scaling configuration per node group to the restore ConfigMap.
- Scale to zero — Calls
UpdateNodegroupConfigsettingminSize=0anddesiredSize=0(keepsmaxSizeunchanged). - Await (optional) — If
awaitCompletionis enabled, polls until all nodes with labeleks.amazonaws.com/nodegroup={name}are deleted.
Wakeup Flow¶
- Load restore data — Reads the saved scaling configuration from the ConfigMap.
- Restore scaling — For each node group, calls
UpdateNodegroupConfigwith the originaldesiredSize,minSize, andmaxSize. - Await (optional) — Polls
DescribeNodegroupuntil the node group status returns toACTIVEand node counts match.
Restore Data Shape¶
Each node group is stored under its name:
{
"app-nodes": { "desired": 3, "min": 1, "max": 5 },
"worker-nodes": { "desired": 2, "min": 0, "max": 4 }
}
Prerequisites¶
| Requirement | Details |
|---|---|
| Connector | CloudProvider with type: aws |
| IAM Permissions | eks:ListNodegroups, eks:DescribeNodegroup, eks:UpdateNodegroupConfig |
| Await Timeout | Default: 10 minutes |
Limitations¶
- Does not drain nodes — relies on AWS default graceful termination behavior.
- The EKS cluster itself stays up; only node groups are scaled.
- Multi-AZ distribution is handled transparently by AWS.
Karpenter¶
Type: karpenter · Connector: K8SCluster
Manages Karpenter NodePools by deleting them during hibernation (which tells Karpenter to drain and remove all managed nodes) and recreating them with the original spec on wakeup.
Shutdown Flow¶
- Discover NodePools — If
nodePoolsis empty, lists all NodePools via thekarpenter.sh/v1API. Otherwise, uses the specified names. - Capture state — For each NodePool, retrieves the full spec and labels using a
Getcall. - Persist restore data — Saves the complete NodePool definition (name, spec, labels) to the restore ConfigMap.
- Delete NodePools — Calls
Deleteon each NodePool. Karpenter automatically evicts pods and terminates the underlying nodes. - Await (optional) — Polls until all nodes with label
karpenter.sh/nodepool={name}are gone.
Wakeup Flow¶
- Load restore data — Reads saved NodePool definitions.
- Recreate NodePools — Reconstructs each NodePool object with the original spec, labels, and API version, then calls
Create. - Await (optional) — Polls NodePool status until the
Readycondition isTrue.
Restore Data Shape¶
Each NodePool is stored under its name with the full spec:
{
"default": {
"name": "default",
"spec": { "template": {}, "limits": {}, "disruption": {} },
"labels": { "team": "platform" }
}
}
Prerequisites¶
| Requirement | Details |
|---|---|
| Connector | K8SCluster with access to the target cluster |
| RBAC | karpenter.sh nodepools (get, list, delete, create), v1 nodes (list, get) |
| Await Timeout | Default: 5 minutes |
Limitations¶
- Assumes
karpenter.sh/v1API version. Earlier Karpenter versions usingv1beta1may require adaptation. - Karpenter respects Pod Disruption Budgets during eviction — the shutdown may not complete within the timeout if PDBs block.
- NodePool admission webhooks with side effects could interfere with deletion or recreation.
EC2¶
Type: ec2 · Connector: CloudProvider (AWS)
Manages EC2 instances by stopping running instances during hibernation and starting them back on wakeup. Automatically excludes instances managed by Auto Scaling Groups or Karpenter.
Shutdown Flow¶
- Discover instances — Calls
DescribeInstanceswith server-side filters (selector.tagsas AWS Filters orselector.instanceIdsas explicit IDs). Whenselector.tagSelectoris used, it applies as a client-side filter after fetching. Filters out terminated/shutting-down instances and those managed by ASGs or Karpenter. - Capture state — Records each instance's ID and whether it was running (
wasRunning). - Persist restore data — Saves instance states to the restore ConfigMap.
- Stop instances — Calls
StopInstancesfor all instances that were running. Already-stopped instances are skipped. - Await (optional) — Polls
DescribeInstancesuntil all instances reach thestoppedstate.
Wakeup Flow¶
- Load restore data — Reads saved instance states.
- Start instances — Calls
StartInstancesonly for instances wherewasRunning=true. Instances that were already stopped before hibernation remain stopped. - Await (optional) — Polls until all started instances reach the
runningstate.
Restore Data Shape¶
Each instance is stored under its ID:
{
"i-0abc123def456789a": { "instanceId": "i-0abc123def456789a", "wasRunning": true },
"i-0def456789abc0123": { "instanceId": "i-0def456789abc0123", "wasRunning": false }
}
Prerequisites¶
| Requirement | Details |
|---|---|
| Connector | CloudProvider with type: aws |
| IAM Permissions | ec2:DescribeInstances, ec2:StopInstances, ec2:StartInstances |
| Await Timeout | Default: 5 minutes |
Limitations¶
- ASG-managed instances are excluded — instances owned by Auto Scaling Groups are skipped to avoid conflicts with ASG desired-count reconciliation.
- Karpenter-managed instances are excluded — same logic applies.
- Elastic IPs remain associated through stop/start cycles.
- EBS volumes are preserved; instance store data is lost on stop (standard EC2 behavior).
RDS¶
Type: rds · Connector: CloudProvider (AWS)
Manages RDS DB instances and Aurora clusters with support for optional snapshot creation before stopping. Features a sophisticated selector system for targeting resources by tags, explicit IDs, or discovery mode.
Shutdown Flow¶
- Determine resource types — Based on the selector:
- Explicit
instanceIds/clusterIds→ resource types inferred from which IDs are provided. - Tag-based or
includeAll→ requiresdiscoverInstancesand/ordiscoverClustersflags to be explicitly set.
- Explicit
- Discover resources — Calls
DescribeDBInstancesand/orDescribeDBClusterswith appropriate filters. - For each DB instance:
- Checks status is
available(skips if not stoppable). - If
snapshotBeforeStop=true, creates a snapshot viaCreateDBSnapshotand waits for it to complete (30-minute waiter). - Calls
StopDBInstance. - Saves state: instance ID, previous status, snapshot ID if created.
- Checks status is
- For each DB cluster:
- Same logic via
StopDBClusterandCreateDBClusterSnapshot.
- Same logic via
- Await (optional) — Polls until all resources reach the
stoppedstatus.
Wakeup Flow¶
- Load restore data — Reads saved instance/cluster states.
- Start resources — Calls
StartDBInstanceorStartDBClusterfor each resource that was running before hibernation. - Await (optional) — Polls until all resources return to
availablestatus.
Restore Data Shape¶
Keys use a type prefix to distinguish instances from clusters:
{
"instance:production-db": {
"instanceId": "production-db",
"wasStopped": false,
"snapshotId": "production-db-hibernate-1711500000",
"instanceType": "db.r5.2xlarge"
},
"cluster:aurora-prod": {
"clusterId": "aurora-prod",
"wasStopped": false,
"snapshotId": "aurora-prod-hibernate-1711500000"
}
}
Selector Modes¶
The RDS executor supports three mutually exclusive selection methods:
| Mode | Fields | Discovery Flags Required? |
|---|---|---|
| Tag-based | tags or excludeTags |
Yes — must set discoverInstances and/or discoverClusters |
| Explicit IDs | instanceIds and/or clusterIds |
No — inferred from which IDs are provided |
| Discovery | includeAll |
Yes — must set discoverInstances and/or discoverClusters |
Warning
Setting tags without discoverInstances or discoverClusters results in a no-op — nothing will be discovered.
Prerequisites¶
| Requirement | Details |
|---|---|
| Connector | CloudProvider with type: aws |
| IAM Permissions | rds:DescribeDBInstances, rds:DescribeDBClusters, rds:StopDBInstance, rds:StartDBInstance, rds:StopDBCluster, rds:StartDBCluster, rds:CreateDBSnapshot (if snapshots enabled) |
| Await Timeout | Default: 15 minutes |
Limitations¶
- Read replicas are not managed — only primary instances and clusters.
- Aurora Serverless supports stop/start but auto-scaling behavior on wakeup may differ.
- RDS Proxy connections are not managed by this executor.
- The 7-day auto-restart limit imposed by AWS still applies — RDS automatically restarts instances that have been stopped for more than 7 days.
WorkloadScaler¶
Type: workloadscaler · Connector: K8SCluster
Manages Kubernetes workloads (Deployments, StatefulSets, ReplicaSets, or any CRD with a scale subresource) by scaling replicas to zero during hibernation and restoring original counts on wakeup.
Shutdown Flow¶
- Resolve target namespaces — Uses
namespace.literals(explicit list) ornamespace.selector(label-based discovery). - Resolve workload kinds — Uses
includedGroups(defaults to["Deployment"]). Custom CRDs use the formatgroup/version/resource(e.g.,argoproj.io/v1alpha1/rollouts). - Discover workloads — Lists resources in each namespace, optionally filtered by
workloadSelectorlabels. - For each workload:
- Reads the scale subresource via
GetScale()to capture current replica count. - Saves state: namespace, kind, name, replica count, GVR.
- Updates the scale subresource to
replicas: 0.
- Reads the scale subresource via
- Await (optional) — Polls until each workload's scale status reflects zero replicas.
Wakeup Flow¶
- Load restore data — Reads saved workload states.
- Restore replicas — For each workload, updates the scale subresource back to the original replica count.
- Await (optional) — Polls until replica counts match the desired state.
Restore Data Shape¶
Keys use a namespace/kind/name format:
{
"default/Deployment/api-server": {
"group": "apps", "version": "v1", "resource": "deployments",
"kind": "Deployment", "namespace": "default",
"name": "api-server", "replicas": 3
},
"default/Deployment/worker": {
"group": "apps", "version": "v1", "resource": "deployments",
"kind": "Deployment", "namespace": "default",
"name": "worker", "replicas": 2
}
}
Prerequisites¶
| Requirement | Details |
|---|---|
| Connector | K8SCluster with access to the target cluster |
| RBAC | apps deployments/scale, apps statefulsets/scale, apps replicasets/scale (get, update); v1 namespaces (list, get) for namespace discovery |
| Await Timeout | Default: 5 minutes |
Limitations¶
- Only works with resources that implement the Kubernetes scale subresource API.
- Namespace-scoped only — does not work with cluster-scoped resources.
- The executor does not check Pod readiness during wakeup; it relies on the workload controller's reconciliation.
- Custom CRDs require the
group/version/resourceformat inincludedGroups.
NoOp¶
Type: noop · Connector: CloudProvider or K8SCluster (either works)
A testing executor that simulates hibernation operations without touching any real resources. Useful for validating schedules, execution strategies, DAG dependencies, and error recovery flows.
Shutdown Flow¶
- Simulates work with a random delay between 0 and
randomDelaySeconds. - If
failureModeis"shutdown"or"both", returns a simulated error with the configuredfailureMessage. - Otherwise, generates restore data (parameters, timestamp, UUID) and returns success.
Wakeup Flow¶
- Simulates work with the same random delay.
- If
failureModeis"wakeup"or"both", returns a simulated error. - Otherwise, returns success.
Parameters¶
| Parameter | Default | Description |
|---|---|---|
randomDelaySeconds |
1 | Maximum random delay (0–30 seconds) |
failureMode |
"none" |
When to fail: "none", "shutdown", "wakeup", "both" |
failureMessage |
(auto) | Custom error message for simulated failures |
Use Cases¶
- Test scheduling logic without cloud credentials
- Validate DAG dependency ordering
- Test execution strategies (Sequential, Parallel, DAG, Staged)
- Simulate error recovery and manual retry workflows
- CI/CD integration tests
GKE¶
Type: gke · Connector: K8SCluster
Under Construction
The GKE executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.
Planned behavior: Manage GKE node pool scaling via the GCP Container API, similar to how the EKS executor manages managed node groups.
Planned Parameters¶
| Parameter | Description |
|---|---|
nodePools |
List of GKE node pool names to hibernate (required) |
CloudSQL¶
Type: cloudsql · Connector: CloudProvider (GCP)
Under Construction
The Cloud SQL executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.
Planned behavior: Stop and start Cloud SQL instances via the Cloud SQL Admin API, similar to how the RDS executor manages database instances.
Planned Parameters¶
| Parameter | Description |
|---|---|
instanceName |
Cloud SQL instance name (required) |
project |
GCP project ID (required) |
Choosing an Executor¶
| I want to hibernate... | Use executor | Notes |
|---|---|---|
| EKS managed node groups | eks |
Scales to zero; cluster stays up |
| Karpenter NodePools | karpenter |
Deletes and recreates pools |
| Standalone EC2 instances | ec2 |
Stops/starts; excludes ASG-managed |
| RDS databases | rds |
Supports instances, clusters, and pre-stop snapshots |
| Kubernetes Deployments/StatefulSets | workloadscaler |
Scales replicas to zero |
| Argo Rollouts or other CRDs | workloadscaler |
Use group/version/resource format in includedGroups |
| GKE node pools | gke |
|
| Cloud SQL instances | cloudsql |
For the full parameter schema of each executor, see the Executor Parameters Reference.
Operational Guides: