Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana

6 min read Original article ↗

aggregator_unavailable_apiservice

Gauge of APIServices which are marked as unavailable broken down by APIService name.

apiserver_admission_webhook_admission_duration_seconds_bucket

Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).

apiserver_current_inflight_requests

Maximal number of currently used inflight request limit of this apiserver per request kind in last second.

apiserver_envelope_encryption_dek_cache_fill_percent

Percent of the cache slots currently occupied by cached DEKs.

apiserver_flowcontrol_current_executing_requests

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem.

apiserver_flowcontrol_rejected_requests_total

Number of requests in initial (for a WATCH) or any (for a non-WATCH) execution stage in the API Priority and Fairness subsystem that were rejected.

apiserver_flowcontrol_request_concurrency_limit

Nominal number of execution seats configured for each priority level.

apiserver_flowcontrol_request_execution_seconds_bucket

The bucketed histogram of duration of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_flowcontrol_request_queue_length_after_enqueue_count

The count of initial stage (for a WATCH) or any (for a non-WATCH) stage of request execution in the API Priority and Fairness subsystem.

apiserver_request

Indicates an API server request.

apiserver_requested_deprecated_apis

Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release.

apiserver_request_duration_seconds

Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_duration_seconds_bucket

The bucketed histogram of response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_slo_duration_seconds

The Service Level Objective (SLO) response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

apiserver_request_terminations_total

Number of requests which apiserver terminated in self-defense.

apiserver_request_total

Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.

container_cpu_usage_seconds_total

Cumulative cpu time consumed.

container_fs_reads_bytes_total

Cumulative count of bytes read.

container_fs_reads_total

Cumulative count of reads completed.

container_fs_writes_bytes_total

Cumulative count of bytes written.

container_fs_writes_total

Cumulative count of writes completed.

container_memory_cache

Total page cache memory.

container_memory_rss

Size of RSS.

container_memory_swap

Container swap usage.

container_memory_working_set_bytes

Current working set.

container_network_receive_bytes_total

Cumulative count of bytes received.

container_network_receive_packets_dropped_total

Cumulative count of packets dropped while receiving.

container_network_receive_packets_total

Cumulative count of packets received.

container_network_transmit_bytes_total

Cumulative count of bytes transmitted.

container_network_transmit_packets_dropped_total

Cumulative count of packets dropped while transmitting.

container_network_transmit_packets_total

Cumulative count of packets transmitted.

etcd_request_duration_seconds_bucket

The bucketed histogram of etcd request latency in seconds for each operation and object type.

go_goroutines

Number of goroutines that currently exist.

go_threads

Number of OS threads created.

kubelet_cgroup_manager_duration_seconds_bucket

The bucketed histogram of duration in seconds for cgroup manager operations. Broken down by method.

kubelet_cgroup_manager_duration_seconds_count

Duration in seconds for cgroup manager operations. Broken down by method.

kubelet_node_config_error

This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.

kubelet_node_name

The node's name. The count is always 1.

kubelet_pleg_relist_duration_seconds_bucket

The bucketed histogram of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_duration_seconds_count

The count of duration in seconds for relisting pods in PLEG.

kubelet_pleg_relist_interval_seconds_bucket

The bucketed histogram of interval in seconds between relisting in PLEG.

kubelet_pod_start_duration_seconds_count

The count of duration in seconds from kubelet seeing a pod for the first time to the pod starting to run.

kubelet_pod_worker_duration_seconds_bucket

The bucketed histogram of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_pod_worker_duration_seconds_count

The count of duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync.

kubelet_running_containers

Number of containers currently running.

kubelet_running_pods

Number of pods that have a running pod sandbox.

kubelet_runtime_operations_duration_seconds_bucket

The bucketed histogram of duration in seconds of runtime operations. Broken down by operation type.

kubelet_runtime_operations_errors_total

Cumulative number of runtime operation errors by operation type.

kubelet_runtime_operations_total

Cumulative number of runtime operations by operation type.

kube_node_status_allocatable

The amount of resources allocatable for pods (after reserving some for system daemons).

kube_node_status_capacity

The total amount of resources available for a node.

kube_pod_container_resource_limits (CPU)

The number of requested limit resource by a container.

kube_pod_container_resource_limits (Memory)

The number of requested limit resource by a container.

kube_pod_container_resource_requests (CPU)

The number of requested request resource by a container.

kube_pod_container_resource_requests (Memory)

The number of requested request resource by a container.

kube_pod_owner

Information about the Pod's owner.

kube_resourcequota

Resource quotas in Kubernetes enforce usage limits on resources such as CPU, memory, and storage within namespaces.

node_cpu

The CPU usage metrics for a node, including usage per core and total usage.

node_cpu_seconds_total

Seconds the CPUs spent in each mode.

node_disk_io_time_seconds

The cumulative amount of time spent performing I/O operations on disk by a node.

node_disk_io_time_seconds_total

The total amount of time spent performing I/O operations on disk by the node.

node_disk_read_bytes_total

The total number of bytes read from disk by the node.

node_disk_written_bytes_total

The total number of bytes written to disk by the node.

node_filesystem_avail_bytes

The amount of available space in bytes on the filesystem of a node in a Kubernetes cluster.

node_filesystem_size_bytes

The total size of the filesystem on the node.

node_load1

The 1-minute load average of a node's CPU usage.

node_load15

The 15-minute load average of a node's CPU usage.

node_load5

The 5-minute load average of a node's CPU usage.

node_memory_Buffers_bytes

The amount of memory used for buffer caching by the node's operating system.

node_memory_Cached_bytes,

The amount of memory used for disk caching by the node's operating system.

node_memory_MemAvailable_bytes

The amount of memory available for use by applications and caches.

node_memory_MemFree_bytes

The amount of free memory available on the node.

node_memory_MemTotal_bytes

The total amount of physical memory available on the node.

node_network_receive_bytes_total

The total number of bytes received over the network by the node.

node_network_transmit_bytes_total

The total number of bytes transmitted over the network by the node.

process_cpu_seconds_total

Total user and system CPU time spent in seconds.

process_resident_memory_bytes

Resident memory size in bytes.

rest_client_requests_total

Number of HTTP requests, partitioned by status code, method, and host.

rest_client_request_duration_seconds_bucket

The bucketed histogram of request latency in seconds. Broken down by verb, and host.

storage_operation_duration_seconds_bucket

The bucketed histogram of duration of storage operations.

storage_operation_duration_seconds_count

The count of duration of storage operations.

storage_operation_errors_total

Cumulative number of errors during storage operations.

up

A metric indicating whether the monitored target (e.g., node) is up and running.

volume_manager_total_volumes

The total number of volumes managed by the volume manager.

workqueue_adds_total

Total number of adds handled by workqueue.

workqueue_depth

Current depth of workqueue.

workqueue_queue_duration_seconds_bucket

The bucketed histogram of how long in seconds an item stays in workqueue before being requested.

workqueue_work_duration_seconds_bucket

The bucketed histogram of how long in seconds processing an item from workqueue takes.