updated metric

dlepow · dlepow · commit 15bb880adef6 · 2026-03-06T11:19:04.000-08:00
diff --git a/articles/redis/monitor-cache-reference.md b/articles/redis/monitor-cache-reference.md
@@ -38,66 +38,26 @@ The following table lists the metrics available for the Microsoft.Cache/redisEnt
 
 The following list provides details and more information about the supported Azure Monitor metrics for [Microsoft.Cache/redisEnterprise](/azure/azure-monitor/reference/supported-metrics/microsoft-cache-redisenterprise-metrics).
 
-
-- Cache Latency (preview)
-  - The latency of the cache calculated using the internode latency of the cache. This metric is measured in microseconds, and has three dimensions: `Avg`, `Min`, and `Max`. The dimensions represent the average, minimum, and maximum latency of the cache during the specified reporting interval.
-- Cache Hits
-  - The number of successful key lookups during the specified reporting interval.
-- Cache Misses
-  - The number of failed key lookups during the specified reporting interval. This number maps to `keyspace_misses` from the Redis INFO command. Cache misses don't necessarily mean there's an issue with the cache. For example, when using the cache-aside programming pattern, an application looks first in the cache for an item. If the item isn't there (cache miss), the item is retrieved from the database and added to the cache for next time. Cache misses are normal behavior for the cache-aside programming pattern. If the number of cache misses is higher than expected, examine the application logic that populates and reads from the cache. If items are being evicted from the cache because of memory pressure, then there might be some cache misses, but a better metric to monitor for memory pressure would be `Used Memory` or `Evicted Keys`.
-- Cache Read
-  - The amount of data read from the cache in Megabytes per second (MB/s) during the specified reporting interval. This value is derived from the network interface cards that support the virtual machine that hosts the cache and isn't Redis specific. This value corresponds to the network bandwidth used by this cache. If you want to set up alerts for server-side network bandwidth limits, then create it using this `Cache Read` counter. See [this table](planning-faq.yml#how-can-i-measure-azure-managed-redis-performance-) for the observed bandwidth limits for various cache pricing tiers and sizes.
-- Cache Write
-  - The amount of data written to the cache in Megabytes per second (MB/s) during the specified reporting interval. This value is derived from the network interface cards that support the virtual machine that hosts the cache and isn't Redis specific. This value corresponds to the network bandwidth of data sent to the cache from the client.
-- Connected Clients
-  - The number of client connections to the cache during the specified reporting interval. This number maps to `connected_clients` from the Redis INFO command. Once the connection limit is reached, later attempts to connect to the cache fail. Even if there are no active client applications, there might still be a few instances of connected clients because of internal processes and connections.
-- CPU
-  - The CPU utilization of the Azure Managed Redis server as a percentage during the specified reporting interval. This value maps to the operating system `\Processor(_Total)\% Processor Time` performance counter. Note that this metric can be noisy due to low priority background security processes running on the node, so we recommend monitoring Server Load metric to track load on a Redis server.
-- Evicted Keys
-  - The number of items evicted from the cache during the specified reporting interval because of the `maxmemory` limit.
-  - This number maps to `evicted_keys` from the Redis INFO command.
-- Expired Keys
-  - The number of items expired from the cache during the specified reporting interval. This value maps to `expired_keys` from the Redis INFO command.
-
-- Geo-replication metrics
-
-  Geo-replication metrics are affected by monthly internal maintenance operations. The Azure Managed Redis service periodically patches caches with the latest platform features and improvements. During these updates, each cache node is taken offline, which temporarily disables the geo-replication link. If your geo replication link is unhealthy, check to see if it was caused by a patching event on either the geo-primary or geo-secondary cache by using **Diagnose and Solve Problems** from the Resource menu in the portal. Depending on the amount of data in the cache, the downtime from patching can take anywhere from a few minutes to an hour. If the geo-replication link is unhealthy for over an hour, [file a support request](/azure/azure-portal/supportability/how-to-create-azure-support-request).
-
-  - Geo Replication Healthy
-    - Depicts the status of the geo-replication link between caches. There can be two possible states that the replication link can be in:
-      - 0 disconnected/unhealthy
-      - 1 – healthy
-    - The metric is available in Enterprise and Enterprise Flash tier caches with geo-replication enabled.
-    - This metric might indicate a disconnected/unhealthy replication status for several reasons, including: monthly patching, host OS updates, network misconfiguration, or failed geo-replication link provisioning.
-    - A value of 0 doesn't mean that data on the geo-replica is lost. It just means that the link between geo-primary and geo-secondary is unhealthy.
-    - If the geo-replication link is unhealthy for over an hour, [file a support request](/azure/azure-portal/supportability/how-to-create-azure-support-request).
-
-- Gets
-  - The number of get operations from the cache during the specified reporting interval. This value is the sum of the following values from the Redis INFO all command: `cmdstat_get`, `cmdstat_hget`, `cmdstat_hgetall`, `cmdstat_hmget`, `cmdstat_mget`, `cmdstat_getbit`, and `cmdstat_getrange`, and is equivalent to the sum of cache hits and misses during the reporting interval.
-- Operations per Second
-  - The total number of commands processed per second by the cache server during the specified reporting interval.  This value maps to "instantaneous_ops_per_sec" from the Redis INFO command.
-- Server Load
-  - The percentage of CPU cycles in which the Redis server is busy processing and _not waiting idle_ for messages. If this counter reaches 100, the Redis server has hit a performance ceiling, and the CPU can't process work any faster. You can expect a large latency effect. If you're seeing sustained high Redis Server Load, consider scaling up the cache or partitioning data across multiple caches. When _Server Load_ is only moderately high, such as 50 to 80 percent, average latency usually remains low, and timeout exceptions could have other causes than high server latency.
-  - The _Server Load_ metric is sensitive to other processes on the machine using existing CPU cycles that reduce the Redis server's idle time. We recommend that you pay attention to other metrics such as operations, latency, and CPU, in addition to _Server Load_.
-  
-  > [!CAUTION]
-  > The _Server Load_ metric can present incorrect data for Enterprise and Enterprise Flash tier caches. Sometimes _Server Load_ is represented as being over 100. We are investigating this issue. We recommend using the CPU metric instead in the meantime.
-
-- Sets
-  - The number of set operations to the cache during the specified reporting interval. This value is the sum of the following values from the Redis INFO all command: `cmdstat_set`, `cmdstat_hset`, `cmdstat_hmset`, `cmdstat_hsetnx`, `cmdstat_lset`, `cmdstat_mset`, `cmdstat_msetnx`, `cmdstat_setbit`, `cmdstat_setex`, `cmdstat_setrange`, and `cmdstat_setnx`.
-- Total Keys  
-  - The maximum number of keys in the cache during the past reporting time period. This number maps to `keyspace` from the Redis INFO command.
-    
-  > [!IMPORTANT]
-  > Because of a limitation in the underlying metrics system for caches with clustering enabled, Total Keys return the maximum number of keys of the shard that had the maximum number of keys during the reporting interval.
-  
-- Total Operations
-  - The total number of commands processed by the cache server during the specified reporting interval. This value maps to `total_commands_processed` from the Redis INFO command. When Azure Managed Redis is used purely for pub/sub, there are no metrics for `Cache Hits`, `Cache Misses`, `Gets`, or `Sets`, but there are `Total Operations` metrics that reflect cache usage for pub/sub operations.
-- Used Memory
-  - The amount of cache memory in MB that is used for key/value pairs in the cache during the specified reporting interval. This value maps to `used_memory` from the Redis INFO command. This value doesn't include metadata or fragmentation.
-  - On the Enterprise and Enterprise Flash tier, the Used Memory value includes the memory in both the primary and replica nodes. This can make the metric appear twice as large as expected.
-- Used Memory Percentage
-  - The percent of total memory that is being used during the specified reporting interval.  This value references the `used_memory` value from the Redis INFO command to calculate the percentage. This value doesn't include fragmentation.
+| Metric | Details |
+|--------|-------------|
+| Cache Latency (preview) | The average latency of requests handled by endpoints on the cache node during the specified reporting interval. This metric is measured in milliseconds and is sourced from the `node_avg_latency` Prometheus metric. This metric is only reported when there is active traffic on the cache. |
+| Cache Hits | The number of successful key lookups during the specified reporting interval. This value is sourced from the `bdb_read_hits` Prometheus metric. |
+| Cache Misses | The number of failed key lookups during the specified reporting interval. This value is sourced from the `bdb_read_misses_max` Prometheus metric. Cache misses don't necessarily mean there's an issue with the cache. For example, when using the cache-aside programming pattern, an application looks first in the cache for an item. If the item isn't there (cache miss), the item is retrieved from the database and added to the cache for next time. Cache misses are normal behavior for the cache-aside programming pattern. If the number of cache misses is higher than expected, examine the application logic that populates and reads from the cache. If items are being evicted from the cache because of memory pressure, then there might be some cache misses, but a better metric to monitor for memory pressure would be `Used Memory or Evicted Keys`. |
+| Cache Read | The rate of incoming network traffic to the cache node in bytes per second during the specified reporting interval. This value is sourced from the `node_ingress_bytes_max` Prometheus metric. If you want to set up alerts for server-side network bandwidth limits, then create it using this Cache Read counter. See [this table](azure/redis/planning-faq#how-can-i-measure-azure-managed-redis-performance-) for the observed bandwidth limits for various cache pricing tiers and sizes. |
+| Cache Write | The rate of outgoing network traffic from the cache node in bytes per second during the specified reporting interval. This value is sourced from the `node_egress_bytes_max` Prometheus metric. |
+| Connected Clients | The number of client connections to the cache during the specified reporting interval. This value is sourced from the `node_conns` Prometheus metric, which counts clients connected to endpoints on the node. Once the connection limit is reached, later attempts to connect to the cache fail. Even if there are no active client applications, there might still be a few instances of connected clients because of internal processes and connections. |
+| CPU | The CPU utilization of the Azure Managed Redis server as a percentage during the specified reporting interval. This value is derived from the `node_cpu_idle_min` Prometheus metric, which represents the lowest CPU idle time portion observed during the interval, and is inverted to reflect CPU busy time. The CPU metric includes background processes such as anti-malware that aren't strictly Redis server processes, so it can sometimes spike independently of Redis workload. We recommend using this metric over **Server Load** for monitoring, as it supports instance-level drill-down by splitting on Instance ID, providing more granularity into which node is under pressure. |
+| Evicted Keys | The number of keys evicted from the cache during the specified reporting interval. This value is sourced from the `bdb_evicted_objects` Prometheus metric. |
+| Expired Keys | The number of keys expired from the cache during the specified reporting interval. This value is sourced from the `bdb_expired_objects` Prometheus metric. |
+| Geo Replication Healthy | Indicates the health of the geo-replication link between caches in an Active Geo-Replication group. The metric reports one of two values:<br/><br/>0 – disconnected/unhealthy<br/>1 – healthy<br/><br/>The metric is available on Memory Optimized, Balanced, and Compute Optimized tier caches with geo-replication enabled. A value of 0 doesn't mean that data on the geo-replica is lost. It just means that the link between geo-primary and geo-secondary is unhealthy.<br/><br/>This metric might indicate a disconnected/unhealthy replication status for several reasons, including: monthly patching, host OS updates, network misconfiguration, or failed geo-replication link provisioning. The Azure Managed Redis service periodically patches caches with the latest platform features and improvements. During these updates, each cache node is taken offline, which temporarily disables the geo-replication link. If your geo-replication link is unhealthy, check to see if it was caused by a patching event on either the geo-primary or geo-secondary cache by using **Diagnose and Solve Problems** from the Resource menu in the portal. Depending on the amount of data in the cache, the downtime from patching can take anywhere from a few minutes to an hour. If the geo-replication link is unhealthy for over an hour, [file a support request](/azure/azure-portal/supportability/how-to-create-azure-support-request). |
+| Gets | The number of read requests to the cache during the specified reporting interval. This value is sourced from the `bdb_read_req` Prometheus metric, which represents the rate of all read requests on the database, and is equivalent to the sum of cache hits and misses during the reporting interval. |
+| Operations per Second | The total number of requests handled per second by all shards of the cache during the specified reporting interval. This value is sourced from the `bdb_instantaneous_ops_per_sec` Prometheus metric. |
+| Server Load | The *Server Load* metric reflects the Redis server's own assessment of overall load, and is similar to the **CPU** metric but measured at a cluster level rather than per instance. This value is derived from the `node_cpu_idle_min` Prometheus metric and inverted to reflect server busy time. If this counter reaches 100, the Redis server has hit a performance ceiling, and the CPU can't process work any faster. You can expect a large latency effect. If you're seeing sustained high Server Load, consider scaling up the cache or partitioning data across multiple caches. When *Server Load* is only moderately high, such as 50 to 80 percent, average latency usually remains low, and timeout exceptions could have other causes than high server latency.<br/><br/>Because *Server Load* is measured at the cluster level, it doesn't allow you to drill down to individual instances. We recommend using the **CPU** metric instead, as it supports splitting by Instance ID for instance-level analysis.<br/><br/>> [!CAUTION]<br/>> The *Server Load* metric can present incorrect data for Azure Managed Redis caches. Sometimes *Server Load* is represented as being over 100. We are investigating this issue. We recommend using the **CPU** metric instead.|
+| Sets | The number of write requests to the cache during the specified reporting interval. This value is sourced from the `db_write_req` Prometheus metric, which represents the rate of all write requests on the database. |
+| Total Keys | The number of keys in the cache during the specified reporting interval. This value is sourced from the `db_no_of_keys`  Prometheus metric.<br/><br/>> [!IMPORTANT]<br/>> Because of a limitation in the underlying metrics system for caches with clustering enabled, Total Keys return the maximum number of keys of the shard that had the maximum number of keys during the reporting interval. |
+| Total Operations | The total number of requests processed by the cache during the specified reporting interval. This value is sourced from the `bdb_total_req` Prometheus metric. |
+| Used Memory | The amount of cache memory in bytes used by the database during the specified reporting interval. This value is sourced from the `bdb_used_memory` Prometheus metric. On Flash Optimized tier caches, this value includes both RAM and flash memory usage. This value doesn't include fragmentation.<br/><br/> When High Availability is enabled, the Used Memory value includes the memory in both the primary and replica nodes. This can make the metric appear twice as large as expected. |
+| Used Memory Percentage | The percent of configured memory limit that is currently in use during the specified reporting interval. This value is calculated as the ratio of `bdb_used_memory` to `bdb_memory_limit` from the Redis Enterprise Prometheus metrics. This value doesn't include fragmentation. |
 
 [!INCLUDE [horz-monitor-ref-resource-logs](~/reusable-content/ce-skilling/azure/includes/azure-monitor/horizontals/horz-monitor-ref-resource-logs.md)]