Worker health - Temporal Cloud feature guide

This page is a guide to the following:

Minimal alerts

These alerts should be configured and understood first to gain intelligence into your application health and behaviors.

Create monitors and alerts for Schedule-To-Start latency SDK metrics (both Workflow Executions and Activity Executions). See Detect Task backlog section to explore sample queries and appropriate responses that accompany these values.

Alert at >200ms for your p99 value
Plot >100ms for your p95 value

Create a Grafana panel called Sync Match Rate. See the Sync Match Rate section to explore example queries and appropriate responses that accompany these values.

Alert at <95% for your p99 value
Plot <99% for your p95 value

Create a Grafana panel called Poll Success Rate. See the Detect greedy Workers section for example queries and appropriate responses that accompany these values.

Alert at <90% for your p99 value
Plot <95% for your p95 value

The following alerts build on the above to dive deeper into specific potential causes for Worker related issues you might be experiencing.

Create monitors and alerts for the temporal_worker_task_slots_available SDK metric. See the Detect misconfigured Workers section for appropriate responses based on the value.

Alert at 0 for your p99 value

Create monitors for the temporal_sticky_cache_size SDK metric. See the Configure Sticky Cache section for more details on this configuration.

Plot at {value} > {WorkflowCacheSize.Value}

Create monitors for the temporal_sticky_cache_total_forced_eviction SDK metric. This metric is available in the Go SDK, and the Java SDK only. See the Configure Sticky Cache section for more details and appropriate responses.

Alert at >{predetermined_high_number}

Detect Task Backlog

How to detect a backlog of Tasks.

Metrics to monitor:

SDK metric: workflow_task_schedule_to_start_latency
SDK metric: activity_schedule_to_start_latency
Temporal Cloud metric: temporal_cloud_v0_poll_success_count
Temporal Cloud metric: temporal_cloud_v0_poll_success_sync_count

Symptoms

If your Schedule-To-Start latency alert fires or appears high, cross-check this with the Sync Match Rate to determine if you need to change your Worker or fleet, or if you need to contact Temporal Cloud support. If your Sync Match Rate is low, you can contact Temporal Cloud support.

Schedule To Start latency

The Schedule-To-Start metric represents how long Tasks are staying, unprocessed, in the Task Queues. But differently, it is the time between when a Task is enqueued and when it is picked up by a Worker. This time being long (likely) means that your Workers can't keep up — either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker.

The schedule_to_start_latency SDK metric for both Workflow Executions and Activity Executions should have alerts.

Prometheus query samples

Workflow Task Latency, 99th percentile

histogram_quantile(0.99, sum(rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))

Workflow Task Latency, average

sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)

Activity Task Latency, 99th percentile

histogram_quantile(0.99, sum(rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))

Activity Task Latency, average

sum(increase(temporal_activity_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_activity_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)

Target

This latency should obviously be a very low value - close to zero. Anything else, indicates bottlenecking.

Sync Match Rate

The Sync Match Rate measures the rate of Tasks that can be delivered to Workers without having to be persisted (Workers are up and available to pick them up) to the rate of all delivered Tasks.

Calculate Sync Match Rate

temporal\_cloud\_v0\_poll\_success\_sync\_count / temporal\_cloud\_v0\_poll\_success\_count = N%

Prometheus sample

sync_match_rate query

sum by(temporal_namespace) (
    rate(
        temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
    )
)
/
sum by(temporal_namespace) (
    rate(
        temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
    )
)

Target

The Sync Match Rate should be at least >95%, but preferably >99%.

Interpretation

There is not enough or under-powered resources. If the Schedule-To-Start latency is high and the Sync Match Rate is high, the TaskQueue is experiencing a backlog of Tasks.

There are three typical causes for this:

There are not enough Workers to perform work
Each Worker is either under resourced, or is misconfigured, to handle enough work
There is congestion caused by the environment (eg., network) hosting the Worker(s) and Temporal Cloud.

Actions

Consider the following:

Increasing either the number of available Workers, OR
Verifying that your Worker hosts are appropriately resourced, OR
Increasing the Worker configuration value for concurrent pollers for Workers/Task executions (if your Worker resources can accommodate the increased load), OR
- TypeScript: Workflows
- Golang: Workflows
Doing some combination of the above

** Temporal Cloud bottleneck**

If the Schedule-To-Start latency is high and the Sync Match Rate is also low, Temporal Cloud could very well be the bottleneck and you should reach out via support channels for us to confirm.

Caveat

If you are setting the Schedule-To-Start Timeout value in your Activity Options, it will skew your observations if you are not following the guidance here. As such, we recommend that you avoid setting that value.

Detect greedy Worker resources

How to detect greedy Worker resources.

You can have too many Workers. If you see the Poll Success Rate showing low numbers, you might have too many resources polling Temporal Cloud.

Metrics to monitor:

Temporal Cloud metric: temporal_cloud_v0_poll_success_count Temporal Cloud metric: temporal_cloud_v0_poll_success_sync_count Temporal Cloud metric: temporal_cloud_v0_poll_timeout_count SDK metric: temporal_workflow_task_schedule_to_start_latency SDK metric: temporal_activity_schedule_to_start_latency

Calculate Poll Success Rate

(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count)(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count + temporal_cloud_v0_poll_timeout_count)

Target

Poll Success Rate should be >90% in most cases of systems with a steady load. For high volume and low latency, try to target >95%.

Interpretation

There may be too Many Workers.

If you see the followign at the same time then you might have too many Workers:

Low poll success rate, AND
Low schedule_to_start_latency, AND
Low Worker hosts resource utilization

Actions

Consider sizing down your Workers by either:

Reducing the number of Workers polling the impacted Task Queue, OR
Reducing the concurrent pollers per Worker, OR
Both of the above

Prometheus Sample

poll_success_rate query

(
    (
        sum by(temporal_namespace) (
          rate(
            temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
          )
        )
      +
        sum by(temporal_namespace) (
          rate(
            temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
          )
        )
    )
  /
    (
        (
            sum by(temporal_namespace) (
              rate(
                temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
              )
            )
          +
            sum by(temporal_namespace) (
              rate(
                temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
              )
            )
        )
      +
        sum by(temporal_namespace) (
          rate(
            temporal_cloud_v0_poll_timeout_count{temporal_namespace=~"$namespace"}[5m]
          )
        )
    )
)

Detect misconfigured Workers

How to detect misconfigured Workers.

The configuration of your Workers can negatively affect the efficiency of your Task processing as well. Please review this document.

Metrics to monitor:

SDK metric: temporal_worker_task_slots_available SDK metric: sticky_cache_size SDK metric: sticky_cache_total_forced_eviction

Execution Size Configuration

The maxConcurrentWorkflowTaskExecutionSize and maxConcurrentActivityExecutionSize define the number of total available slots for the Worker. If this is set too low, the Worker would not be able to keep up processing Tasks.

Target

The temporal_worker_task_slots_available metric should always be >0.

Prometheus Samples

Over Time

avg_over_time(temporal_worker_task_slots_available{namespace="$namespace",worker_type="WorkflowWorker"}[10m])

Current Time

temporal_worker_task_slots_available{namespace="default", worker_type="WorkflowWorker", task_queue="$task_queue_name"}

Interpretation

You are likely experiencing a Task backlog if you are seeing inadequate slot counts frequently. The work is not getting processed as fast as it should/can.

Action

Increase the maxConcurrentWorkflowTaskExecutionSize and maxConcurrentActivityExecutionSize values and keep an eye on your Worker resource metrics (CPU utilization, etc) to make sure you haven't created a new issue.

Configure Sticky Cache

How to configure your Workflow Sticky cache.

The WorkflowCacheSize should always be greater than the sticky_cache_size metric value. Additionally, you can watch sticky_cache_total_forced_eviction for unusually high numbers that are likely an indicator of inefficiency, since Workflows are being evicted from the cache.

Target

The sticky_cache_size should report less than or equal to your WorkflowCacheSize value. Also, sticky_cache_total_forced_eviction should not be reporting high numbers (relative).

Action

If you see a high eviction count, verify there are no other inefficiencies in your Worker configuration or resource provisioning (backlog). If you see the cache size metric exceed the WorkflowCacheSize, increase this value if your Worker resources can accommodate it or provision more Workers. Finally, take time to review this document and see if this document further addresses potential cache issues.

Prometheus Sample

Sticky Cache Size

max_over_time(temporal_sticky_cache_size{namespace="$namespace"}[10m])

Sticky Cache Evictions

rate(temporal_sticky_cache_total_forced_eviction_total{namespace="$namespace"}[5m]))

Minimal alerts​

Detect Task Backlog​

Schedule To Start latency​

Prometheus query samples​

Sync Match Rate​

Prometheus sample​

Detect greedy Worker resources​

Prometheus Sample​

Detect misconfigured Workers​

Prometheus Samples​

Configure Sticky Cache​

Prometheus Sample​

Minimal alerts

Detect Task Backlog

Schedule To Start latency

Prometheus query samples

Sync Match Rate

Prometheus sample

Detect greedy Worker resources

Prometheus Sample

Detect misconfigured Workers

Prometheus Samples

Configure Sticky Cache

Prometheus Sample