Skip to content

Dipowell/node readiness timing#1208

Open
diamondpowell wants to merge 9 commits into
mainfrom
dipowell/node-readiness-timing
Open

Dipowell/node readiness timing#1208
diamondpowell wants to merge 9 commits into
mainfrom
dipowell/node-readiness-timing

Conversation

@diamondpowell

@diamondpowell diamondpowell commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds node_readiness_time as a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.

Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:

  • command_execution_time > node_readiness_time -> ARM layer is the bottleneck
  • node_readiness_time > command_execution_time -> K8s layer is the bottleneck

Changes

kubernetes_client.py

  • Added return_timestamp=False parameter to wait_for_nodes_ready()
  • When True, returns (ready_nodes, timestamp) tuple instead of just ready_nodes

aks_client.py

  • Added _instrument_nodepool_provisioning() helper using ThreadPoolExecutor
  • Updated create_node_pool(), scale_node_pool(), and _progressive_scale() to capture concurrent timing
  • Enhanced timing logs to surface bottleneck layer and total elapsed time

Timing metadata stored via op.add_metadata():

  • node_readiness_time: seconds from start until K8s nodes were Ready
  • command_execution_time: seconds from start until ARM operation completed

Implementation notes

  • Uses ThreadPoolExecutor(max_workers=2) to run ARM polling and K8s readiness checks concurrently, avoiding implicit event loop requirements that asyncio.run() would introduce
  • Both tasks run to completion even if one fails, enabling partial diagnostics
  • Method internalizes the ARM call, start time capture, and label selector construction so callers only pass node pool name, cluster name, parameters, and node count

@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch 7 times, most recently from 791370e to 73ebe22 Compare June 8, 2026 21:13
@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch 3 times, most recently from 0b1cf21 to 4603b7b Compare June 10, 2026 00:31
@diamondpowell diamondpowell marked this pull request as ready for review June 10, 2026 00:40
Copilot AI review requested due to automatic review settings June 10, 2026 00:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds independent node-readiness timing to the Python AKS CRUD flow by extending the Kubernetes wait helper to optionally return a readiness timestamp, then running the ARM poller and the K8s readiness wait concurrently so both timings can be recorded for regression analysis.

Changes:

  • Extend KubernetesClient.wait_for_nodes_ready() with return_timestamp to optionally return (ready_nodes, ready_timestamp).
  • Add concurrent ARM + K8s readiness execution in AKSClient and record node_readiness_time / command_execution_time metadata.
  • Update AKS and Kubernetes client unit tests to cover the new return shape and timing metadata recording.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
modules/python/clients/kubernetes_client.py Adds return_timestamp option so callers can capture when nodes became Ready.
modules/python/clients/aks_client.py Runs ARM and readiness concurrently and stores separate timing metadata.
modules/python/tests/clients/test_kubernetes_client.py Adds a unit test validating the timestamp-returning behavior.
modules/python/tests/clients/test_aks_client.py Updates tests to expect timestamp return and to assert timing metadata is recorded.

Comment thread modules/python/clients/aks_client.py Outdated
Comment thread modules/python/clients/aks_client.py Outdated
Comment thread modules/python/clients/aks_client.py Outdated

return OperationContext

def _run_concurrent_arm_and_readiness(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _instrument_nodepool_provisioning is a better name for the method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, much clearer name for what this method actually does.

Comment thread modules/python/clients/aks_client.py Outdated

def _run_concurrent_arm_and_readiness(
self,
poller,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only begin_create_or_update is passed in so no need to have a separate poller param

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the begin_create_or_update call inside the method. Callers now just pass node_pool_name, cluster_name, parameters, and node_count, which cleaned up all 3 call sites pretty nicely.

Comment thread modules/python/clients/aks_client.py Outdated
poller,
node_count: int,
label_selector: str,
start_time: float

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think start_time can be captured within this function, no need to be passed from outside

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolled this into the same change as the poller removal since both touched the method signature. start_time and label_selector are both derived internally now.

Use ThreadPoolExecutor instead of asyncio.run() to avoid implicit requirement that callers must not be in an existing event loop. Keeps the same method signature and behavior.
Move begin_create_or_update call and start_time capture inside the method. Callers now pass node_pool_name, cluster_name, parameters, and node_count instead of a pre-created poller and external timestamp.
Log now shows which layer (ARM vs K8s) was the bottleneck, the delta between them, and the total elapsed time for the concurrent operation.
@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch from 148a7c2 to 278e27d Compare June 24, 2026 07:12
@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch from a083a19 to 7c896e4 Compare June 24, 2026 15:12
Comment thread modules/python/clients/aks_client.py Outdated
gpu_instance_profile=gpu_instance_profile,
gpu_mig_strategy=gpu_mig_strategy,
)
command_execution_time = time.time() - start_time

@liyu-ma liyu-ma Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for the addition here? Is this your PR's scope to instrument GPU pool?

Comment thread modules/python/clients/aks_client.py Outdated

arm_exc = arm_future.exception()
k8s_exc = k8s_future.exception()
k8s_exc = k8s_future.exception()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the duplication for line 116 and 117?

Comment thread modules/python/clients/aks_client.py Outdated

return arm_response, ready_nodes, node_readiness_time, command_execution_time

def _log_timing_metrics(self, op, node_pool_name, node_readiness_time, command_execution_time):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about inline this function within _instrument_nodepool_provisioning and call it at the end?

Comment thread modules/python/clients/aks_client.py Outdated
parameters=parameters,
)

def _poll_arm():

@liyu-ma liyu-ma Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need to have separate _poll_arm and _wait_k8s. How about simplifying it with lamda, for example:

        with ThreadPoolExecutor(max_workers=2) as executor:
            arm_future = executor.submit(lambda: (poller.result(), time.time()))
            k8s_future = executor.submit(
                lambda: self.k8s_client.wait_for_nodes_ready(
                    node_count=node_count,
                    operation_timeout_in_minutes=self.operation_timeout_minutes,
                    label_selector=label_selector,
                    return_timestamp=True,
                )
            )

node_pool.count = node_count

logger.info(f"Scaling node pool {node_pool_name} to {node_count} nodes")
self._begin_update_with_retry(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion here is a behavior change. Your _instrument_nodepool_provisioning calls begin_create_or_update which does not do retry. I would suggest using _begin_update_with_retry within your instrumentation as it retries on transient errors - but this may skew your metrics.

"cluster_info", self.get_cluster_data(cluster_name)
)
node_pool.count = step # Update node count in the node pool object
self._begin_update_with_retry(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Now the new codes lose the retry support.

raise Exception(f"Error deleting Node '{node_name}': {str(e)}") from e

def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None):
def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None, return_timestamp=False):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new param makes this function complicated and more importantly - it is unnecessary. You should be able to capture the current time at call site, something like:

k8s_future = executor.submit(
                lambda: (
                    self.k8s_client.wait_for_nodes_ready( ... ),
                    time.time(),
                )
            )

The time.time() function will only be executed when wait_for_nodes_ready() completes.

Comment thread modules/python/clients/aks_client.py Outdated
op.add_metadata("node_readiness_time", node_readiness_time)
op.add_metadata("command_execution_time", command_execution_time)
delta = abs(command_execution_time - node_readiness_time)
bottleneck = "ARM" if command_execution_time > node_readiness_time else "K8s"

@liyu-ma liyu-ma Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like the way to log a 'bottleneck' - it is very subjective. Logging should better just record the facts e.g timestamps, delta. Your data analytics tool is responsible for decision making whether it is a bottleneck

@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch from d799fa0 to ddf8598 Compare July 1, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants