Metrics Reference¶

Hibernator exposes Prometheus metrics via the controller's metrics endpoint (default: :8080/metrics). These metrics provide observability into reconciliation, execution, internal pipeline health, and notification delivery.

Quick Start: Verify Metrics¶

To quickly verify that metrics are being exposed from the controller, you can use curl from within the cluster or via a port-forward.

1. Port-forward the controller¶

kubectl port-forward -n hibernator-system deployment/hibernator-controller 8080:8080

2. Query the metrics endpoint¶

curl -s http://localhost:8080/metrics | grep hibernator_

3. Check for specific metrics¶

Verify execution metrics are flowing:

curl -s http://localhost:8080/metrics | grep hibernator_execution_total

Scraping Metrics¶

The controller exposes metrics at the path configured by --metrics-bind-address (default :8080). To scrape with Prometheus, add a ServiceMonitor or a scrape config targeting the controller pod.

Execution Metrics¶

Metrics for hibernation and wakeup operations against targets.

Metric	Type	Labels	Description
`hibernator_execution_total`	Counter	`plan`, `operation`, `target_type`, `status`	Total number of hibernation and wakeup operations
`hibernator_execution_duration_seconds`	Histogram	`plan`, `operation`, `target_type`, `status`	Duration of hibernation and wakeup operations. Buckets: 1 s to ~17 min (exponential)

Label values:

operation: Hibernate, WakeUp
target_type: executor type (e.g., eks, rds, ec2, karpenter, workloadscaler)
status: success, failure

Reconciliation Metrics¶

Metrics for the HibernatePlan reconciliation loop.

Metric	Type	Labels	Description
`hibernator_reconcile_total`	Counter	`plan`, `phase`, `result`	Total number of HibernatePlan reconciliations
`hibernator_reconcile_duration_seconds`	Histogram	`plan`, `phase`	Duration of HibernatePlan reconciliation
`hibernator_active_plans`	Gauge	`phase`	Number of active HibernatePlans by phase

Job Metrics¶

Metrics for runner Jobs created by the controller.

Metric	Type	Labels	Description
`hibernator_jobs_created_total`	Counter	`plan`, `target`	Total number of runner Jobs created
`hibernator_job_failures_total`	Counter	`plan`, `target`	Total number of runner Job failures

Label values:

plan: HibernatePlan name
target: Target name

Pipeline Metrics¶

Internal metrics for the async phase-driven reconciler pipeline (Coordinator, Workers, watchable subscriptions).

Metric	Type	Labels	Description
`hibernator_watchable_subscribe_total`	Counter	`runner`, `message`, `status`	Total watchable subscription handler invocations
`hibernator_watchable_subscribe_duration_seconds`	Histogram	`runner`, `message`	Duration of watchable subscription handler processing
`hibernator_worker_goroutines`	Gauge	—	Number of live plan Worker goroutines managed by the Coordinator
`hibernator_enqueue_drop_total`	Counter	`plan`	Plan requeue events dropped because the enqueue channel was full

Label values:

status (subscribe): success, error, panic

Note

A non-zero hibernator_enqueue_drop_total signals backpressure on the controller-runtime work queue. Affected plans are reconciled on the next natural trigger (schedule tick, annotation change), but the time-based requeue was silently skipped.

Status Writer Metrics¶

Metrics for the per-object status writer that batches and deduplicates API server writes.

Metric	Type	Labels	Description
`hibernator_status_writer_active_objects`	Gauge	`type`, `key`	Number of objects with an active status-writer goroutine
`hibernator_status_writer_updates_total`	Counter	`type`, `key`	Total status updates successfully written to the API server
`hibernator_status_writer_noop_total`	Counter	`type`, `key`	Status update attempts skipped due to unchanged status
`hibernator_status_writer_errors_total`	Counter	`type`, `key`, `event`	Errors during status write operations

Label values:

type: HibernatePlan, ScheduleException
key: namespace/name
event (errors): pre_hook, apply, post_hook

Notification Metrics¶

Metrics for the async notification dispatcher. See the Notifications guide for configuration details.

Metric	Type	Labels	Description
`hibernator_notification_sent_total`	Counter	`sink_type`, `event`	Successfully delivered notifications
`hibernator_notification_errors_total`	Counter	`sink_type`, `event`	Failed notification dispatch attempts
`hibernator_notification_latency_seconds`	Histogram	`sink_type`	End-to-end dispatch latency (Secret lookup + render + HTTP POST)
`hibernator_notification_drop_total`	Counter	`sink_type`, `event`	Notifications dropped (dispatcher shutdown or buffer full)

Label values:

sink_type: slack, telegram, webhook
event: Start, Success, Failure, Recovery, PhaseChange

Example Queries¶

# Notification failure rate over 5 minutes
rate(hibernator_notification_errors_total[5m])

# Average notification latency by sink type
rate(hibernator_notification_latency_seconds_sum[5m]) / rate(hibernator_notification_latency_seconds_count[5m])

# Dropped notifications (indicates dispatcher overload)
increase(hibernator_notification_drop_total[1h])

Alerting Examples¶

groups:
  - name: hibernator
    rules:
      - alert: HibernatorExecutionFailure
        expr: increase(hibernator_execution_total{status="failure"}[1h]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Hibernation execution failure for {{ $labels.plan }}"

      - alert: HibernatorNotificationErrors
        expr: rate(hibernator_notification_errors_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Notification delivery failing for sink {{ $labels.sink_type }}"

      - alert: HibernatorEnqueueDrops
        expr: increase(hibernator_enqueue_drop_total[30m]) > 0
        labels:
          severity: info
        annotations:
          summary: "Plan requeue events dropped — controller may be under pressure"