Dipowell/node readiness timing by diamondpowell · Pull Request #1208 · Azure/telescope

diamondpowell · 2026-06-02T20:30:51Z

Summary

Adds node_readiness_time as a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.

Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:

command_execution_time > node_readiness_time -> ARM layer is the bottleneck
node_readiness_time > command_execution_time -> K8s layer is the bottleneck

Changes

kubernetes_client.py

Added return_timestamp=False parameter to wait_for_nodes_ready()
When True, returns (ready_nodes, timestamp) tuple instead of just ready_nodes

aks_client.py

Added _instrument_nodepool_provisioning() helper using ThreadPoolExecutor
Updated create_node_pool(), scale_node_pool(), and _progressive_scale() to capture concurrent timing
Enhanced timing logs to surface bottleneck layer and total elapsed time

Timing metadata stored via op.add_metadata():

node_readiness_time: seconds from start until K8s nodes were Ready
command_execution_time: seconds from start until ARM operation completed

Implementation notes

Uses ThreadPoolExecutor(max_workers=2) to run ARM polling and K8s readiness checks concurrently, avoiding implicit event loop requirements that asyncio.run() would introduce
Both tasks run to completion even if one fails, enabling partial diagnostics
Method internalizes the ARM call, start time capture, and label selector construction so callers only pass node pool name, cluster name, parameters, and node count

Copilot

Pull request overview

This PR adds independent node-readiness timing to the Python AKS CRUD flow by extending the Kubernetes wait helper to optionally return a readiness timestamp, then running the ARM poller and the K8s readiness wait concurrently so both timings can be recorded for regression analysis.

Changes:

Extend KubernetesClient.wait_for_nodes_ready() with return_timestamp to optionally return (ready_nodes, ready_timestamp).
Add concurrent ARM + K8s readiness execution in AKSClient and record node_readiness_time / command_execution_time metadata.
Update AKS and Kubernetes client unit tests to cover the new return shape and timing metadata recording.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
modules/python/clients/kubernetes_client.py	Adds `return_timestamp` option so callers can capture when nodes became Ready.
modules/python/clients/aks_client.py	Runs ARM and readiness concurrently and stores separate timing metadata.
modules/python/tests/clients/test_kubernetes_client.py	Adds a unit test validating the timestamp-returning behavior.
modules/python/tests/clients/test_aks_client.py	Updates tests to expect timestamp return and to assert timing metadata is recorded.

liyu-ma · 2026-06-12T02:48:55Z


        return OperationContext

+    def _run_concurrent_arm_and_readiness(


I think _instrument_nodepool_provisioning is a better name for the method

Done, much clearer name for what this method actually does.

liyu-ma · 2026-06-12T02:49:38Z


+    def _run_concurrent_arm_and_readiness(
+        self,
+        poller,


I see only begin_create_or_update is passed in so no need to have a separate poller param

Moved the begin_create_or_update call inside the method. Callers now just pass node_pool_name, cluster_name, parameters, and node_count, which cleaned up all 3 call sites pretty nicely.

liyu-ma · 2026-06-12T02:53:49Z

+        poller,
+        node_count: int,
+        label_selector: str,
+        start_time: float


I think start_time can be captured within this function, no need to be passed from outside

Rolled this into the same change as the poller removal since both touched the method signature. start_time and label_selector are both derived internally now.

Use ThreadPoolExecutor instead of asyncio.run() to avoid implicit requirement that callers must not be in an existing event loop. Keeps the same method signature and behavior.

Move begin_create_or_update call and start_time capture inside the method. Callers now pass node_pool_name, cluster_name, parameters, and node_count instead of a pre-created poller and external timestamp.

…isioning

Log now shows which layer (ARM vs K8s) was the bottleneck, the delta between them, and the total elapsed time for the concurrent operation.

…isioning

This reverts commit 7c896e4.

liyu-ma · 2026-06-25T12:24:27Z

                        gpu_instance_profile=gpu_instance_profile,
                        gpu_mig_strategy=gpu_mig_strategy,
                    )
+                    command_execution_time = time.time() - start_time


What is the reason for the addition here? Is this your PR's scope to instrument GPU pool?

liyu-ma · 2026-06-25T12:32:02Z

+
+        arm_exc = arm_future.exception()
+        k8s_exc = k8s_future.exception()
+        k8s_exc = k8s_future.exception()


Why the duplication for line 116 and 117?

liyu-ma · 2026-06-25T12:36:16Z

+
+        return arm_response, ready_nodes, node_readiness_time, command_execution_time
+
+    def _log_timing_metrics(self, op, node_pool_name, node_readiness_time, command_execution_time):


How about inline this function within _instrument_nodepool_provisioning and call it at the end?

liyu-ma · 2026-06-25T12:43:55Z

+            parameters=parameters,
+        )
+
+        def _poll_arm():


I don't see the need to have separate _poll_arm and _wait_k8s. How about simplifying it with lamda, for example:

with ThreadPoolExecutor(max_workers=2) as executor: arm_future = executor.submit(lambda: (poller.result(), time.time())) k8s_future = executor.submit( lambda: self.k8s_client.wait_for_nodes_ready( node_count=node_count, operation_timeout_in_minutes=self.operation_timeout_minutes, label_selector=label_selector, return_timestamp=True, ) )

liyu-ma · 2026-06-25T13:07:01Z

                node_pool.count = node_count

                logger.info(f"Scaling node pool {node_pool_name} to {node_count} nodes")
-                self._begin_update_with_retry(


The deletion here is a behavior change. Your _instrument_nodepool_provisioning calls begin_create_or_update which does not do retry. I would suggest using _begin_update_with_retry within your instrumentation as it retries on transient errors - but this may skew your metrics.

liyu-ma · 2026-06-25T13:07:48Z

                        "cluster_info", self.get_cluster_data(cluster_name)
                    )
                    node_pool.count = step  # Update node count in the node pool object
-                    self._begin_update_with_retry(


Same as above. Now the new codes lose the retry support.

liyu-ma · 2026-06-25T13:28:18Z

                raise Exception(f"Error deleting Node '{node_name}': {str(e)}") from e

-    def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None):
+    def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None, return_timestamp=False):


The new param makes this function complicated and more importantly - it is unnecessary. You should be able to capture the current time at call site, something like:

k8s_future = executor.submit( lambda: ( self.k8s_client.wait_for_nodes_ready( ... ), time.time(), ) )

The time.time() function will only be executed when wait_for_nodes_ready() completes.

liyu-ma · 2026-06-25T13:40:19Z

+        op.add_metadata("node_readiness_time", node_readiness_time)
+        op.add_metadata("command_execution_time", command_execution_time)
+        delta = abs(command_execution_time - node_readiness_time)
+        bottleneck = "ARM" if command_execution_time > node_readiness_time else "K8s"


Don't like the way to log a 'bottleneck' - it is very subjective. Logging should better just record the facts e.g timestamps, delta. Your data analytics tool is responsible for decision making whether it is a bottleneck

diamondpowell force-pushed the dipowell/node-readiness-timing branch 7 times, most recently from 791370e to 73ebe22 Compare June 8, 2026 21:13

diamondpowell force-pushed the dipowell/node-readiness-timing branch 3 times, most recently from 0b1cf21 to 4603b7b Compare June 10, 2026 00:31

diamondpowell marked this pull request as ready for review June 10, 2026 00:40

Copilot AI review requested due to automatic review settings June 10, 2026 00:40

diamondpowell requested review from LeonardCareer, alyssa1303, anson627, liyu-ma, sumanthreddy29, vittoriasalim, wonderyl and xinWeiWei24 as code owners June 10, 2026 00:40

Copilot started reviewing on behalf of diamondpowell June 10, 2026 00:41 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread modules/python/clients/aks_client.py Outdated

diamondpowell closed this Jun 10, 2026

diamondpowell reopened this Jun 10, 2026

liyu-ma reviewed Jun 12, 2026

View reviewed changes

Comment thread modules/python/clients/aks_client.py Outdated

liyu-ma reviewed Jun 12, 2026

View reviewed changes

diamondpowell added 2 commits June 24, 2026 03:00

Add return_timestamp parameter to wait_for_nodes_ready

66ca3a5

Replace asyncio with ThreadPoolExecutor for concurrent timing

19bef92

Use ThreadPoolExecutor instead of asyncio.run() to avoid implicit requirement that callers must not be in an existing event loop. Keeps the same method signature and behavior.

diamondpowell added 3 commits June 24, 2026 03:02

Internalize ARM call and start_time in concurrent method

efc5dc2

Move begin_create_or_update call and start_time capture inside the method. Callers now pass node_pool_name, cluster_name, parameters, and node_count instead of a pre-created poller and external timestamp.

Rename _run_concurrent_arm_and_readiness to _instrument_nodepool_prov…

e2d69ec

…isioning

Enhance timing logs with bottleneck analysis and total elapsed

278e27d

Log now shows which layer (ARM vs K8s) was the bottleneck, the delta between them, and the total elapsed time for the concurrent operation.

diamondpowell force-pushed the dipowell/node-readiness-timing branch from 148a7c2 to 278e27d Compare June 24, 2026 07:12

Add pipeline test config (to be reverted before merge)

7c896e4

diamondpowell force-pushed the dipowell/node-readiness-timing branch from a083a19 to 7c896e4 Compare June 24, 2026 15:12

diamondpowell added 2 commits June 24, 2026 12:09

Extract _log_timing_metrics helper and trim _instrument_nodepool_prov…

0d8e992

…isioning

Revert "Add pipeline test config (to be reverted before merge)"

f788d4d

This reverts commit 7c896e4.

liyu-ma reviewed Jun 25, 2026

View reviewed changes

Address review: simplify timing, restore retry, remove return_timestamp

ddf8598

diamondpowell force-pushed the dipowell/node-readiness-timing branch from d799fa0 to ddf8598 Compare July 1, 2026 16:23


		return OperationContext

		def _run_concurrent_arm_and_readiness(


		return arm_response, ready_nodes, node_readiness_time, command_execution_time

		def _log_timing_metrics(self, op, node_pool_name, node_readiness_time, command_execution_time):

Uh oh!

Conversation

diamondpowell commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Implementation notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyu-ma Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyu-ma Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyu-ma Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

diamondpowell commented Jun 2, 2026 •

edited

Loading

liyu-ma Jun 25, 2026 •

edited

Loading

liyu-ma Jun 25, 2026 •

edited

Loading

liyu-ma Jun 25, 2026 •

edited

Loading