Skip to content

Executors

Executors are the hands of the Hibernator operator. While the control plane (brain) decides when and in what order to act, executors know how to shut down and wake up a specific type of resource.

Executor Contract

Every executor implements three operations:

Operation Purpose
Validate Verify parameters and connectivity before execution
Shutdown Stop or scale-down the resource, capturing restore metadata
WakeUp Restore the resource to its pre-hibernation state using saved metadata

Executors own idempotency — calling Shutdown on an already-stopped resource or WakeUp on an already-running resource must succeed without side effects.

Intent Preservation Contract

Hibernator implements a first-capture-wins intent preservation strategy to handle retries and partial failures during shutdown operations.

Demanded State

Each executor defines a demanded state — the condition that qualifies a resource for hibernator management:

Executor Demanded State Intent Field
EC2 Instance state is running wasRunning
EKS Node group desiredSize > 0 wasScaled
RDS DB instance/cluster status is available wasRunning
Karpenter NodePool exists (full spec captured)
WorkloadScaler Workload replicas > 0 wasScaled

Only resources in their demanded state are captured and managed by hibernator. Resources not in demanded state are observed passively.

First-Capture-Wins Semantics

When a resource is first captured during a hibernation cycle:

  1. Intent is locked: The wasRunning or wasScaled value is preserved indefinitely
  2. Cycle tracking: A managedByCycleIDs map tracks which cycle first captured each resource
  3. Session preservation: On subsequent hibernation attempts with the same cycle ID (retries), the original intent is preserved even if the resource's current state has changed
  4. Fresh session: A different cycle ID starts fresh — only resources currently in demanded state are tracked

This ensures that: - Retry safety: If shutdown fails and user retries with the same cycle ID, the original intent is preserved - Consistency: Once hibernator decides a resource should be managed, that decision persists until successful wakeup (within the same session) - Clean slate: New hibernation operations (different cycle ID) get fresh tracking without stale data

Stale Resource Eviction

If a resource is not reported for 3 consecutive hibernation cycles:

  1. The resource is evicted from restore data
  2. The resource is removed from managedByCycleIDs tracking
  3. On the next hibernation, the resource can be freshly captured

This prevents permanently retaining data for deleted or unmanageable resources while allowing temporary absences (e.g., API failures) without data loss.

Note

The managedByCycleIDs tracking is stored separately from resource state and is not visible in the resource data itself. It is used internally for idempotency and session management.

Edge Case Handling

Resource state changes between hibernation and wakeup: - If a resource is manually stopped/deleted after hibernation, the executor skips it during wakeup - The executor handles "resource not found" or "already in desired state" gracefully - Hibernator's contract is: "restore to the captured intent, but tolerate reality"

Example EC2 flow:

Cycle "hib-001" (first attempt):
  Instance running → captured with wasRunning=true, tracked in managedByCycleIDs

Retry "hib-001" (user restarts same operation):
  Instance still running → wasRunning=true preserved (same cycle ID)
  Instance stopped → marker preserved, state unchanged (user responsibility)

New cycle "hib-002" (fresh hibernation):
  Instance running → fresh capture with new cycle ID
  Instance stopped → not tracked (different cycle ID, not in demanded state)

WakeUp: Instance restored based on captured intent

How Executors Run

Executors do not run inside the controller. Instead, the controller creates an isolated Runner Job for each target. The runner:

  1. Loads the executor matching the target's type field
  2. Calls Validate to verify parameters
  3. Calls Shutdown or WakeUp depending on the operation
  4. Streams logs and progress to the control plane via gRPC
  5. Persists restore metadata in a ConfigMap (restore-data-{plan-name})

Each runner gets an ephemeral ServiceAccount with the minimum permissions needed.

Restore Data

During shutdown, executors capture metadata about the resource's current state (e.g., replica counts, scaling configs, instance IDs). This metadata is stored as JSON in a ConfigMap and used during wakeup to restore the resource to its exact pre-hibernation configuration.

The restore data ConfigMap is named hibernator-restore-{plan-name} with keys formatted as {target-name}.json.

Restore Data Timestamps

Each restore point entry contains timestamps that track different phases of the capture process:

Captured At

  • Meaning: The timestamp when the hibernator captured and initiated the save operation to the ConfigMap
  • Set when: When the accumulated data is ready to be persisted (just before the ConfigMap update)
  • Granularity: Per-target (all resources in the target share the same CapturedAt)
  • Use case: Historical tracking of when data was captured; audit and freshness checks from hibernator's perspective

Reported At (LastReportedAt)

  • Meaning: The timestamp when a specific resource's state was reported by the executor via callback
  • Set when: During SaveState when the executor reports each resource's state
  • Granularity: Per-resource (each resource has its own ReportedAt)
  • Use case: Track idempotency within a hibernation cycle; detect stale resources

Timestamp Flow Example

Hibernation Cycle "cycle-001":

  Executor Shutdown:
    ├─ 10:00:00 → Discovers resource "app-server" → reports state
    │             LastReportedAt["app-server"] = 10:00:00
    ├─ 10:00:05 → Discovers resource "worker-1" → reports state
    │             LastReportedAt["worker-1"] = 10:00:05
    └─ 10:00:10 → Flush to ConfigMap completes successfully
                  CapturedAt = 10:00:10 (for entire target)

  On Retry (same cycle ID):
    ├─ 10:05:00 → "app-server" already reported → preserves original state
    │             LastReportedAt["app-server"] stays 10:00:00
    └─ 10:05:10 → New resources reported → updated LastReportedAt
                  CapturedAt = 10:05:10 (updated on successful save)

Data Structure

Each target's restore data includes:

{
  "target": "my-cluster",
  "executor": "eks",
  "version": 1,
  "isLive": true,
  "cycleID": "cycle-001",
  "createdAt": "2026-04-30T10:00:00Z",
  "capturedAt": "2026-04-30T10:00:10Z",
  "state": {
    "app-nodes": { "desired": 3, "min": 1, "max": 5 }
  },
  "status": {
    "app-nodes": {
      "staleCount": 0,
      "lastReportedAt": "2026-04-30T10:00:00Z"
    }
  }
}

Built-in Executors

Executor Resource Provider Connector Status
eks EKS Managed Node Groups AWS CloudProvider ✅ Implemented
karpenter Karpenter NodePools Kubernetes K8SCluster ✅ Implemented
ec2 EC2 Instances AWS CloudProvider ✅ Implemented
rds RDS Instances & Clusters AWS CloudProvider ✅ Implemented
workloadscaler Kubernetes Workloads Kubernetes K8SCluster ✅ Implemented
noop None (testing) Any ✅ Implemented
gke GKE Node Pools GCP K8SCluster 🚧 Not Implemented
cloudsql Cloud SQL Instances GCP CloudProvider 🚧 Not Implemented

EKS

Type: eks · Connector: CloudProvider (AWS)

Manages EKS Managed Node Groups by scaling them to zero during hibernation and restoring original scaling configuration on wakeup.

Note

This executor only handles Managed Node Groups via the AWS EKS API. For Karpenter-managed NodePools, use the separate karpenter executor.

Shutdown Flow

  1. Discover node groups — If nodeGroups is empty, lists all node groups in the cluster via ListNodegroups. Otherwise, uses the specified list.
  2. Capture state — For each node group, calls DescribeNodegroup to record the current desiredSize, minSize, and maxSize.
  3. Persist restore data — Saves the scaling configuration per node group to the restore ConfigMap.
  4. Scale to zero — Calls UpdateNodegroupConfig setting minSize=0 and desiredSize=0 (keeps maxSize unchanged).
  5. Await (optional) — If awaitCompletion is enabled, polls until all nodes with label eks.amazonaws.com/nodegroup={name} are deleted.

Wakeup Flow

  1. Load restore data — Reads the saved scaling configuration from the ConfigMap.
  2. Restore scaling — For each node group, calls UpdateNodegroupConfig with the original desiredSize, minSize, and maxSize.
  3. Await (optional) — Polls DescribeNodegroup until the node group status returns to ACTIVE and node counts match.

Restore Data Shape

Each node group is stored under its name:

{
  "app-nodes": { "desired": 3, "min": 1, "max": 5 },
  "worker-nodes": { "desired": 2, "min": 0, "max": 4 }
}

Prerequisites

Requirement Details
Connector CloudProvider with type: aws
IAM Permissions eks:ListNodegroups, eks:DescribeNodegroup, eks:UpdateNodegroupConfig
Await Timeout Default: 10 minutes

Limitations

  • Does not drain nodes — relies on AWS default graceful termination behavior.
  • The EKS cluster itself stays up; only node groups are scaled.
  • Multi-AZ distribution is handled transparently by AWS.

Karpenter

Type: karpenter · Connector: K8SCluster

Manages Karpenter NodePools by deleting them during hibernation (which tells Karpenter to drain and remove all managed nodes) and recreating them with the original spec on wakeup.

Shutdown Flow

  1. Discover NodePools — If nodePools is empty, lists all NodePools via the karpenter.sh/v1 API. Otherwise, uses the specified names.
  2. Capture state — For each NodePool, retrieves the full spec and labels using a Get call.
  3. Persist restore data — Saves the complete NodePool definition (name, spec, labels) to the restore ConfigMap.
  4. Delete NodePools — Calls Delete on each NodePool. Karpenter automatically evicts pods and terminates the underlying nodes.
  5. Await (optional) — Polls until all nodes with label karpenter.sh/nodepool={name} are gone.

Wakeup Flow

  1. Load restore data — Reads saved NodePool definitions.
  2. Recreate NodePools — Reconstructs each NodePool object with the original spec, labels, and API version, then calls Create.
  3. Await (optional) — Polls NodePool status until the Ready condition is True.

Restore Data Shape

Each NodePool is stored under its name with the full spec:

{
  "default": {
    "name": "default",
    "spec": { "template": {}, "limits": {}, "disruption": {} },
    "labels": { "team": "platform" }
  }
}

Prerequisites

Requirement Details
Connector K8SCluster with access to the target cluster
RBAC karpenter.sh nodepools (get, list, delete, create), v1 nodes (list, get)
Await Timeout Default: 5 minutes

Limitations

  • Assumes karpenter.sh/v1 API version. Earlier Karpenter versions using v1beta1 may require adaptation.
  • Karpenter respects Pod Disruption Budgets during eviction — the shutdown may not complete within the timeout if PDBs block.
  • NodePool admission webhooks with side effects could interfere with deletion or recreation.

EC2

Type: ec2 · Connector: CloudProvider (AWS)

Manages EC2 instances by stopping running instances during hibernation and starting them back on wakeup. Automatically excludes instances managed by Auto Scaling Groups or Karpenter.

Shutdown Flow

  1. Discover instances — Calls DescribeInstances with server-side filters (selector.tags as AWS Filters or selector.instanceIds as explicit IDs). When selector.tagSelector is used, it applies as a client-side filter after fetching. Filters out terminated/shutting-down instances and those managed by ASGs or Karpenter.
  2. Capture state — Records each instance's ID and whether it was running (wasRunning).
  3. Persist restore data — Saves instance states to the restore ConfigMap.
  4. Stop instances — Calls StopInstances for all instances that were running. Already-stopped instances are skipped.
  5. Await (optional) — Polls DescribeInstances until all instances reach the stopped state.

Wakeup Flow

  1. Load restore data — Reads saved instance states.
  2. Start instances — Calls StartInstances only for instances where wasRunning=true. Instances that were already stopped before hibernation remain stopped.
  3. Await (optional) — Polls until all started instances reach the running state.

Restore Data Shape

Each instance is stored under its ID:

{
  "i-0abc123def456789a": { "instanceId": "i-0abc123def456789a", "wasRunning": true },
  "i-0def456789abc0123": { "instanceId": "i-0def456789abc0123", "wasRunning": false }
}

Prerequisites

Requirement Details
Connector CloudProvider with type: aws
IAM Permissions ec2:DescribeInstances, ec2:StopInstances, ec2:StartInstances
Await Timeout Default: 5 minutes

Limitations

  • ASG-managed instances are excluded — instances owned by Auto Scaling Groups are skipped to avoid conflicts with ASG desired-count reconciliation.
  • Karpenter-managed instances are excluded — same logic applies.
  • Elastic IPs remain associated through stop/start cycles.
  • EBS volumes are preserved; instance store data is lost on stop (standard EC2 behavior).

RDS

Type: rds · Connector: CloudProvider (AWS)

Manages RDS DB instances and Aurora clusters with support for optional snapshot creation before stopping. Features a sophisticated selector system for targeting resources by tags, explicit IDs, or discovery mode.

Shutdown Flow

  1. Determine resource types — Based on the selector:
    • Explicit instanceIds/clusterIds → resource types inferred from which IDs are provided.
    • Tag-based or includeAll → requires discoverInstances and/or discoverClusters flags to be explicitly set.
  2. Discover resources — Calls DescribeDBInstances and/or DescribeDBClusters with appropriate filters.
  3. For each DB instance:
    • Checks status is available (skips if not stoppable).
    • If snapshotBeforeStop=true, creates a snapshot via CreateDBSnapshot and waits for it to complete (30-minute waiter).
    • Calls StopDBInstance.
    • Saves state: instance ID, previous status, snapshot ID if created.
  4. For each DB cluster:
    • Same logic via StopDBCluster and CreateDBClusterSnapshot.
  5. Await (optional) — Polls until all resources reach the stopped status.

Wakeup Flow

  1. Load restore data — Reads saved instance/cluster states.
  2. Start resources — Calls StartDBInstance or StartDBCluster for each resource that was running before hibernation.
  3. Await (optional) — Polls until all resources return to available status.

Restore Data Shape

Keys use a type prefix to distinguish instances from clusters:

{
  "instance:production-db": {
    "instanceId": "production-db",
    "wasStopped": false,
    "snapshotId": "production-db-hibernate-1711500000",
    "instanceType": "db.r5.2xlarge"
  },
  "cluster:aurora-prod": {
    "clusterId": "aurora-prod",
    "wasStopped": false,
    "snapshotId": "aurora-prod-hibernate-1711500000"
  }
}

Selector Modes

The RDS executor supports three mutually exclusive selection methods:

Mode Fields Discovery Flags Required?
Tag-based tags or excludeTags Yes — must set discoverInstances and/or discoverClusters
Explicit IDs instanceIds and/or clusterIds No — inferred from which IDs are provided
Discovery includeAll Yes — must set discoverInstances and/or discoverClusters

Warning

Setting tags without discoverInstances or discoverClusters results in a no-op — nothing will be discovered.

Prerequisites

Requirement Details
Connector CloudProvider with type: aws
IAM Permissions rds:DescribeDBInstances, rds:DescribeDBClusters, rds:StopDBInstance, rds:StartDBInstance, rds:StopDBCluster, rds:StartDBCluster, rds:CreateDBSnapshot (if snapshots enabled)
Await Timeout Default: 15 minutes

Limitations

  • Read replicas are not managed — only primary instances and clusters.
  • Aurora Serverless supports stop/start but auto-scaling behavior on wakeup may differ.
  • RDS Proxy connections are not managed by this executor.
  • The 7-day auto-restart limit imposed by AWS still applies — RDS automatically restarts instances that have been stopped for more than 7 days.

WorkloadScaler

Type: workloadscaler · Connector: K8SCluster

Manages Kubernetes workloads (Deployments, StatefulSets, ReplicaSets, or any CRD with a scale subresource) by scaling replicas to zero during hibernation and restoring original counts on wakeup.

Shutdown Flow

  1. Resolve target namespaces — Uses namespace.literals (explicit list) or namespace.selector (label-based discovery).
  2. Resolve workload kinds — Uses includedGroups (defaults to ["Deployment"]). Custom CRDs use the format group/version/resource (e.g., argoproj.io/v1alpha1/rollouts).
  3. Discover workloads — Lists resources in each namespace, optionally filtered by workloadSelector labels.
  4. For each workload:
    • Reads the scale subresource via GetScale() to capture current replica count.
    • Saves state: namespace, kind, name, replica count, GVR.
    • Updates the scale subresource to replicas: 0.
  5. Await (optional) — Polls until each workload's scale status reflects zero replicas.

Wakeup Flow

  1. Load restore data — Reads saved workload states.
  2. Restore replicas — For each workload, updates the scale subresource back to the original replica count.
  3. Await (optional) — Polls until replica counts match the desired state.

Restore Data Shape

Keys use a namespace/kind/name format:

{
  "default/Deployment/api-server": {
    "group": "apps", "version": "v1", "resource": "deployments",
    "kind": "Deployment", "namespace": "default",
    "name": "api-server", "replicas": 3
  },
  "default/Deployment/worker": {
    "group": "apps", "version": "v1", "resource": "deployments",
    "kind": "Deployment", "namespace": "default",
    "name": "worker", "replicas": 2
  }
}

Prerequisites

Requirement Details
Connector K8SCluster with access to the target cluster
RBAC apps deployments/scale, apps statefulsets/scale, apps replicasets/scale (get, update); v1 namespaces (list, get) for namespace discovery
Await Timeout Default: 5 minutes

Limitations

  • Only works with resources that implement the Kubernetes scale subresource API.
  • Namespace-scoped only — does not work with cluster-scoped resources.
  • The executor does not check Pod readiness during wakeup; it relies on the workload controller's reconciliation.
  • Custom CRDs require the group/version/resource format in includedGroups.

NoOp

Type: noop · Connector: CloudProvider or K8SCluster (either works)

A testing executor that simulates hibernation operations without touching any real resources. Useful for validating schedules, execution strategies, DAG dependencies, and error recovery flows.

Shutdown Flow

  1. Simulates work with a random delay between 0 and randomDelaySeconds.
  2. If failureMode is "shutdown" or "both", returns a simulated error with the configured failureMessage.
  3. Otherwise, generates restore data (parameters, timestamp, UUID) and returns success.

Wakeup Flow

  1. Simulates work with the same random delay.
  2. If failureMode is "wakeup" or "both", returns a simulated error.
  3. Otherwise, returns success.

Parameters

Parameter Default Description
randomDelaySeconds 1 Maximum random delay (0–30 seconds)
failureMode "none" When to fail: "none", "shutdown", "wakeup", "both"
failureMessage (auto) Custom error message for simulated failures

Use Cases

  • Test scheduling logic without cloud credentials
  • Validate DAG dependency ordering
  • Test execution strategies (Sequential, Parallel, DAG, Staged)
  • Simulate error recovery and manual retry workflows
  • CI/CD integration tests

GKE

Type: gke · Connector: K8SCluster

Under Construction

The GKE executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.

Planned behavior: Manage GKE node pool scaling via the GCP Container API, similar to how the EKS executor manages managed node groups.

Planned Parameters

Parameter Description
nodePools List of GKE node pool names to hibernate (required)

CloudSQL

Type: cloudsql · Connector: CloudProvider (GCP)

Under Construction

The Cloud SQL executor is not yet implemented. The codebase contains a placeholder that validates parameters but does not make actual GCP API calls. Do not use in production.

Planned behavior: Stop and start Cloud SQL instances via the Cloud SQL Admin API, similar to how the RDS executor manages database instances.

Planned Parameters

Parameter Description
instanceName Cloud SQL instance name (required)
project GCP project ID (required)

Choosing an Executor

I want to hibernate... Use executor Notes
EKS managed node groups eks Scales to zero; cluster stays up
Karpenter NodePools karpenter Deletes and recreates pools
Standalone EC2 instances ec2 Stops/starts; excludes ASG-managed
RDS databases rds Supports instances, clusters, and pre-stop snapshots
Kubernetes Deployments/StatefulSets workloadscaler Scales replicas to zero
Argo Rollouts or other CRDs workloadscaler Use group/version/resource format in includedGroups
GKE node pools gke 🚧 Not yet implemented
Cloud SQL instances cloudsql 🚧 Not yet implemented

For the full parameter schema of each executor, see the Executor Parameters Reference.

Operational Guides: