Executors¶

Executors are the hands of the Hibernator operator. While the control plane (brain) decides when and in what order to act, executors know how to shut down and wake up a specific type of resource.

Executor Contract¶

Every executor implements three operations:

Operation	Purpose
Validate	Verify parameters and connectivity before execution
Shutdown	Stop or scale-down the resource, capturing restore metadata
WakeUp	Restore the resource to its pre-hibernation state using saved metadata

Executors own idempotency — calling Shutdown on an already-stopped resource or WakeUp on an already-running resource must succeed without side effects.

Intent Preservation Contract¶

Hibernator implements a first-capture-wins intent preservation strategy to handle retries and partial failures during shutdown operations.

Demanded State¶

Each executor defines a demanded state — the condition that qualifies a resource for hibernator management:

Executor	Demanded State	Intent Field
EC2	Instance state is `running`	`wasRunning`
EKS	Node group `desiredSize > 0`	`wasScaled`
RDS	DB instance/cluster status is `available`	`wasRunning`
Karpenter	NodePool exists	(full spec captured)
WorkloadScaler	Workload `replicas > 0`	`wasScaled`

Only resources in their demanded state are captured and managed by hibernator. Resources not in demanded state are observed passively.

First-Capture-Wins Semantics¶

When a resource is first captured during a hibernation cycle:

Intent is locked: The wasRunning or wasScaled value is preserved indefinitely
Cycle tracking: A managedByCycleIDs map tracks which cycle first captured each resource
Session preservation: On subsequent hibernation attempts with the same cycle ID (retries), the original intent is preserved even if the resource's current state has changed
Fresh session: A different cycle ID starts fresh — only resources currently in demanded state are tracked

This ensures that: - Retry safety: If shutdown fails and user retries with the same cycle ID, the original intent is preserved - Consistency: Once hibernator decides a resource should be managed, that decision persists until successful wakeup (within the same session) - Clean slate: New hibernation operations (different cycle ID) get fresh tracking without stale data

Stale Resource Eviction¶

If a resource is not reported for 3 consecutive hibernation cycles:

The resource is evicted from restore data
The resource is removed from managedByCycleIDs tracking
On the next hibernation, the resource can be freshly captured

This prevents permanently retaining data for deleted or unmanageable resources while allowing temporary absences (e.g., API failures) without data loss.

Note

The managedByCycleIDs tracking is stored separately from resource state and is not visible in the resource data itself. It is used internally for idempotency and session management.

Edge Case Handling¶

Resource state changes between hibernation and wakeup: - If a resource is manually stopped/deleted after hibernation, the executor skips it during wakeup - The executor handles "resource not found" or "already in desired state" gracefully - Hibernator's contract is: "restore to the captured intent, but tolerate reality"

Example EC2 flow:

Cycle "hib-001" (first attempt):
  Instance running → captured with wasRunning=true, tracked in managedByCycleIDs

Retry "hib-001" (user restarts same operation):
  Instance still running → wasRunning=true preserved (same cycle ID)
  Instance stopped → marker preserved, state unchanged (user responsibility)

New cycle "hib-002" (fresh hibernation):
  Instance running → fresh capture with new cycle ID
  Instance stopped → not tracked (different cycle ID, not in demanded state)

WakeUp: Instance restored based on captured intent

How Executors Run¶

Executors do not run inside the controller. Instead, the controller creates an isolated Runner Job for each target. The runner:

Loads the executor matching the target's type field
Calls Validate to verify parameters
Calls Shutdown or WakeUp depending on the operation
Streams logs and progress to the control plane via gRPC
Persists restore metadata in a ConfigMap (restore-data-{plan-name})

Each runner gets an ephemeral ServiceAccount with the minimum permissions needed.

Restore Data¶

During shutdown, executors capture metadata about the resource's current state (e.g., replica counts, scaling configs, instance IDs). This metadata is stored as JSON in a ConfigMap and used during wakeup to restore the resource to its exact pre-hibernation configuration.

The restore data ConfigMap is named hibernator-restore-{plan-name} with keys formatted as {target-name}.json.

Restore Data Timestamps¶

Each restore point entry contains timestamps that track different phases of the capture process:

Captured At¶

Meaning: The timestamp when the hibernator captured and initiated the save operation to the ConfigMap
Set when: When the accumulated data is ready to be persisted (just before the ConfigMap update)
Granularity: Per-target (all resources in the target share the same CapturedAt)
Use case: Historical tracking of when data was captured; audit and freshness checks from hibernator's perspective

Reported At (LastReportedAt)¶

Meaning: The timestamp when a specific resource's state was reported by the executor via callback
Set when: During SaveState when the executor reports each resource's state
Granularity: Per-resource (each resource has its own ReportedAt)
Use case: Track idempotency within a hibernation cycle; detect stale resources

Timestamp Flow Example¶

Hibernation Cycle "cycle-001":

  Executor Shutdown:
    ├─ 10:00:00 → Discovers resource "app-server" → reports state
    │             LastReportedAt["app-server"] = 10:00:00
    ├─ 10:00:05 → Discovers resource "worker-1" → reports state
    │             LastReportedAt["worker-1"] = 10:00:05
    └─ 10:00:10 → Flush to ConfigMap completes successfully
                  CapturedAt = 10:00:10 (for entire target)

  On Retry (same cycle ID):
    ├─ 10:05:00 → "app-server" already reported → preserves original state
    │             LastReportedAt["app-server"] stays 10:00:00
    └─ 10:05:10 → New resources reported → updated LastReportedAt
                  CapturedAt = 10:05:10 (updated on successful save)

Data Structure¶

Each target's restore data includes:

{
  "target": "my-cluster",
  "executor": "eks",
  "version": 1,
  "isLive": true,
  "cycleID": "cycle-001",
  "createdAt": "2026-04-30T10:00:00Z",
  "capturedAt": "2026-04-30T10:00:10Z",
  "state": {
    "app-nodes": { "desired": 3, "min": 1, "max": 5 }
  },
  "status": {
    "app-nodes": {
      "staleCount": 0,
      "lastReportedAt": "2026-04-30T10:00:00Z"
    }
  }
}

Built-in Executors¶

Executor	Resource	Provider	Connector	Status
`eks`	EKS Managed Node Groups	AWS	CloudProvider	Implemented
`karpenter`	Karpenter NodePools	Kubernetes	K8SCluster	Implemented
`ec2`	EC2 Instances	AWS	CloudProvider	Implemented
`rds`	RDS Instances & Clusters	AWS	CloudProvider	Implemented
`workloadscaler`	Kubernetes Workloads	Kubernetes	K8SCluster	Implemented
`noop`	None (testing)	—	Any	Implemented
`gke`	GKE Node Pools	GCP	K8SCluster	Not Implemented
`cloudsql`	Cloud SQL Instances	GCP	CloudProvider	Not Implemented

EKS¶

Type: eks · Connector: CloudProvider (AWS)

Manages EKS Managed Node Groups by scaling them to zero during hibernation and restoring original scaling configuration on wakeup.

Note

This executor only handles Managed Node Groups via the AWS EKS API. For Karpenter-managed NodePools, use the separate karpenter executor.

Shutdown Flow¶

Discover node groups — If nodeGroups is empty, lists all node groups in the cluster via ListNodegroups. Otherwise, uses the specified list.
Capture state — For each node group, calls DescribeNodegroup to record the current desiredSize, minSize, and maxSize.
Persist restore data — Saves the scaling configuration per node group to the restore ConfigMap.
Scale to zero — Calls UpdateNodegroupConfig setting minSize=0 and desiredSize=0 (keeps maxSize unchanged).
Await (optional) — If awaitCompletion is enabled, polls until all nodes with label eks.amazonaws.com/nodegroup={name} are deleted.

Wakeup Flow¶

Load restore data — Reads the saved scaling configuration from the ConfigMap.
Restore scaling — For each node group, calls UpdateNodegroupConfig with the original desiredSize, minSize, and maxSize.
Await (optional) — Polls DescribeNodegroup until the node group status returns to ACTIVE and node counts match.

Restore Data Shape¶

Each node group is stored under its name:

{
  "app-nodes": { "desired": 3, "min": 1, "max": 5 },
  "worker-nodes": { "desired": 2, "min": 0, "max": 4 }
}

Prerequisites¶

Requirement	Details
Connector	`CloudProvider` with `type: aws`
IAM Permissions	`eks:ListNodegroups`, `eks:DescribeNodegroup`, `eks:UpdateNodegroupConfig`
Await Timeout	Default: 10 minutes

Limitations¶

Does not drain nodes — relies on AWS default graceful termination behavior.
The EKS cluster itself stays up; only node groups are scaled.
Multi-AZ distribution is handled transparently by AWS.

Karpenter¶

Type: karpenter · Connector: K8SCluster

Manages Karpenter NodePools by deleting them during hibernation (which tells Karpenter to drain and remove all managed nodes) and recreating them with the original spec on wakeup.

Shutdown Flow¶

Discover NodePools — If nodePools is empty, lists all NodePools via the karpenter.sh/v1 API. Otherwise, uses the specified names.
Capture state — For each NodePool, retrieves the full spec and labels using a Get call.
Persist restore data — Saves the complete NodePool definition (name, spec, labels) to the restore ConfigMap.
Delete NodePools — Calls Delete on each NodePool. Karpenter automatically evicts pods and terminates the underlying nodes.
Await (optional) — Polls until all nodes with label karpenter.sh/nodepool={name} are gone.

Wakeup Flow¶

Load restore data — Reads saved NodePool definitions.
Recreate NodePools — Reconstructs each NodePool object with the original spec, labels, and API version, then calls Create.
Await (optional) — Polls NodePool status until the Ready condition is True.

Restore Data Shape¶

Each NodePool is stored under its name with the full spec:

{
  "default": {
    "name": "default",
    "spec": { "template": {}, "limits": {}, "disruption": {} },
    "labels": { "team": "platform" }
  }
}

Prerequisites¶

Requirement	Details
Connector	`K8SCluster` with access to the target cluster
RBAC	`karpenter.sh nodepools` (get, list, delete, create), `v1 nodes` (list, get)
Await Timeout	Default: 5 minutes

Limitations¶

Assumes karpenter.sh/v1 API version. Earlier Karpenter versions using v1beta1 may require adaptation.
Karpenter respects Pod Disruption Budgets during eviction — the shutdown may not complete within the timeout if PDBs block.
NodePool admission webhooks with side effects could interfere with deletion or recreation.

EC2¶

Type: ec2 · Connector: CloudProvider (AWS)

Manages EC2 instances by stopping running instances during hibernation and starting them back on wakeup. Automatically excludes instances managed by Auto Scaling Groups or Karpenter.

Shutdown Flow¶

Discover instances — Calls DescribeInstances with server-side filters (selector.tags as AWS Filters or selector.instanceIds as explicit IDs). When selector.tagSelector is used, it applies as a client-side filter after fetching. Filters out terminated/shutting-down instances and those managed by ASGs or Karpenter.
Capture state — Records each instance's ID and whether it was running (wasRunning).
Persist restore data — Saves instance states to the restore ConfigMap.
Stop instances — Calls StopInstances for all instances that were running. Already-stopped instances are skipped.
Await (optional) — Polls DescribeInstances until all instances reach the stopped state.

Wakeup Flow¶

Load restore data — Reads saved instance states.
Start instances — Calls StartInstances only for instances where wasRunning=true. Instances that were already stopped before hibernation remain stopped.
Await (optional) — Polls until all started instances reach the running state.

Restore Data Shape¶

Each instance is stored under its ID:

{
  "i-0abc123def456789a": { "instanceId": "i-0abc123def456789a", "wasRunning": true },
  "i-0def456789abc0123": { "instanceId": "i-0def456789abc0123", "wasRunning": false }
}

Prerequisites¶

Requirement	Details
Connector	`CloudProvider` with `type: aws`
IAM Permissions	`ec2:DescribeInstances`, `ec2:StopInstances`, `ec2:StartInstances`
Await Timeout	Default: 5 minutes

Limitations¶

ASG-managed instances are excluded — instances owned by Auto Scaling Groups are skipped to avoid conflicts with ASG desired-count reconciliation.
Karpenter-managed instances are excluded — same logic applies.
Elastic IPs remain associated through stop/start cycles.
EBS volumes are preserved; instance store data is lost on stop (standard EC2 behavior).

RDS¶

Type: rds · Connector: CloudProvider (AWS)

Manages RDS DB instances and Aurora clusters with support for optional snapshot creation before stopping. Features a sophisticated selector system for targeting resources by tags, explicit IDs, or discovery mode.

Shutdown Flow¶

Determine resource types — Based on the selector:
- Explicit instanceIds/clusterIds → resource types inferred from which IDs are provided.
- Tag-based or includeAll → requires discoverInstances and/or discoverClusters flags to be explicitly set.
Discover resources — Calls DescribeDBInstances and/or DescribeDBClusters with appropriate filters.
For each DB instance:
- Checks status is available (skips if not stoppable).
- If snapshotBeforeStop=true, creates a snapshot via CreateDBSnapshot and waits for it to complete (30-minute waiter).
- Calls StopDBInstance.
- Saves state: instance ID, previous status, snapshot ID if created.
For each DB cluster:
- Same logic via StopDBCluster and CreateDBClusterSnapshot.
Await (optional) — Polls until all resources reach the stopped status.

Wakeup Flow¶

Load restore data — Reads saved instance/cluster states.
Start resources — Calls StartDBInstance or StartDBCluster for each resource that was running before hibernation.
Await (optional) — Polls until all resources return to available status.

Restore Data Shape¶

Keys use a type prefix to distinguish instances from clusters:

{
  "instance:production-db": {
    "instanceId": "production-db",
    "wasStopped": false,
    "snapshotId": "production-db-hibernate-1711500000",
    "instanceType": "db.r5.2xlarge"
  },
  "cluster:aurora-prod": {
    "clusterId": "aurora-prod",
    "wasStopped": false,
    "snapshotId": "aurora-prod-hibernate-1711500000"
  }
}

Selector Modes¶

The RDS executor supports three mutually exclusive selection methods:

Mode	Fields	Discovery Flags Required?
Tag-based	`tags` or `excludeTags`	Yes — must set `discoverInstances` and/or `discoverClusters`
Explicit IDs	`instanceIds` and/or `clusterIds`	No — inferred from which IDs are provided
Discovery	`includeAll`	Yes — must set `discoverInstances` and/or `discoverClusters`

Warning

Setting tags without discoverInstances or discoverClusters results in a no-op — nothing will be discovered.

Prerequisites¶

Requirement	Details
Connector	`CloudProvider` with `type: aws`
IAM Permissions	`rds:DescribeDBInstances`, `rds:DescribeDBClusters`, `rds:StopDBInstance`, `rds:StartDBInstance`, `rds:StopDBCluster`, `rds:StartDBCluster`, `rds:CreateDBSnapshot` (if snapshots enabled)
Await Timeout	Default: 15 minutes

Limitations¶

Read replicas are not managed — only primary instances and clusters.
Aurora Serverless supports stop/start but auto-scaling behavior on wakeup may differ.
RDS Proxy connections are not managed by this executor.
The 7-day auto-restart limit imposed by AWS still applies — RDS automatically restarts instances that have been stopped for more than 7 days.

WorkloadScaler¶

Type: workloadscaler · Connector: K8SCluster

Manages Kubernetes workloads (Deployments, StatefulSets, ReplicaSets, or any CRD with a scale subresource) by scaling replicas to zero during hibernation and restoring original counts on wakeup.

Shutdown Flow¶

Resolve target namespaces — Uses namespace.literals (explicit list) or namespace.selector (label-based discovery).
Resolve workload kinds — Uses includedGroups (defaults to ["Deployment"]). Custom CRDs use the format group/version/resource (e.g., argoproj.io/v1alpha1/rollouts).
Discover workloads — Lists resources in each namespace, optionally filtered by workloadSelector labels.
For each workload:
- Reads the scale subresource via GetScale() to capture current replica count.
- Saves state: namespace, kind, name, replica count, GVR.
- Updates the scale subresource to replicas: 0.
Await (optional) — Polls until each workload's scale status reflects zero replicas.

Wakeup Flow¶

Load restore data — Reads saved workload states.
Restore replicas — For each workload, updates the scale subresource back to the original replica count.
Await (optional) — Polls until replica counts match the desired state.

Restore Data Shape¶

Keys use a namespace/kind/name format:

{
  "default/Deployment/api-server": {
    "group": "apps", "version": "v1", "resource": "deployments",
    "kind": "Deployment", "namespace": "default",
    "name": "api-server", "replicas": 3
  },
  "default/Deployment/worker": {
    "group": "apps", "version": "v1", "resource": "deployments",
    "kind": "Deployment", "namespace": "default",
    "name": "worker", "replicas": 2
  }
}

Prerequisites¶

Requirement	Details
Connector	`K8SCluster` with access to the target cluster
RBAC	`apps deployments/scale`, `apps statefulsets/scale`, `apps replicasets/scale` (get, update); `v1 namespaces` (list, get) for namespace discovery
Await Timeout	Default: 5 minutes

Limitations¶

Only works with resources that implement the Kubernetes scale subresource API.
Namespace-scoped only — does not work with cluster-scoped resources.
The executor does not check Pod readiness during wakeup; it relies on the workload controller's reconciliation.
Custom CRDs require the group/version/resource format in includedGroups.

NoOp¶

Type: noop · Connector: CloudProvider or K8SCluster (either works)

A testing executor that simulates hibernation operations without touching any real resources. Useful for validating schedules, execution strategies, DAG dependencies, and error recovery flows.

Shutdown Flow¶

Simulates work with a random delay between 0 and randomDelaySeconds.
If failureMode is "shutdown" or "both", returns a simulated error with the configured failureMessage.
Otherwise, generates restore data (parameters, timestamp, UUID) and returns success.

Wakeup Flow¶

Simulates work with the same random delay.
If failureMode is "wakeup" or "both", returns a simulated error.
Otherwise, returns success.

Parameters¶

Parameter	Default	Description
`randomDelaySeconds`	1	Maximum random delay (0–30 seconds)
`failureMode`	`"none"`	When to fail: `"none"`, `"shutdown"`, `"wakeup"`, `"both"`
`failureMessage`	(auto)	Custom error message for simulated failures

Use Cases¶

Test scheduling logic without cloud credentials
Validate DAG dependency ordering
Test execution strategies (Sequential, Parallel, DAG, Staged)
Simulate error recovery and manual retry workflows
CI/CD integration tests

GKE¶

Type: gke · Connector: K8SCluster

Under Construction

The GKE executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.

Planned behavior: Manage GKE node pool scaling via the GCP Container API, similar to how the EKS executor manages managed node groups.

Planned Parameters¶

Parameter	Description
`nodePools`	List of GKE node pool names to hibernate (required)

CloudSQL¶

Type: cloudsql · Connector: CloudProvider (GCP)

Under Construction

The Cloud SQL executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.

Planned behavior: Stop and start Cloud SQL instances via the Cloud SQL Admin API, similar to how the RDS executor manages database instances.

Planned Parameters¶

Parameter	Description
`instanceName`	Cloud SQL instance name (required)
`project`	GCP project ID (required)

Choosing an Executor¶

I want to hibernate...	Use executor	Notes
EKS managed node groups	`eks`	Scales to zero; cluster stays up
Karpenter NodePools	`karpenter`	Deletes and recreates pools
Standalone EC2 instances	`ec2`	Stops/starts; excludes ASG-managed
RDS databases	`rds`	Supports instances, clusters, and pre-stop snapshots
Kubernetes Deployments/StatefulSets	`workloadscaler`	Scales replicas to zero
Argo Rollouts or other CRDs	`workloadscaler`	Use `group/version/resource` format in `includedGroups`
GKE node pools	`gke`	Not yet implemented
Cloud SQL instances	`cloudsql`	Not yet implemented

For the full parameter schema of each executor, see the Executor Parameters Reference.

Operational Guides: