Dipowell/node readiness timing#1208
Conversation
791370e to
73ebe22
Compare
0b1cf21 to
4603b7b
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds independent node-readiness timing to the Python AKS CRUD flow by extending the Kubernetes wait helper to optionally return a readiness timestamp, then running the ARM poller and the K8s readiness wait concurrently so both timings can be recorded for regression analysis.
Changes:
- Extend
KubernetesClient.wait_for_nodes_ready()withreturn_timestampto optionally return(ready_nodes, ready_timestamp). - Add concurrent ARM + K8s readiness execution in
AKSClientand recordnode_readiness_time/command_execution_timemetadata. - Update AKS and Kubernetes client unit tests to cover the new return shape and timing metadata recording.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| modules/python/clients/kubernetes_client.py | Adds return_timestamp option so callers can capture when nodes became Ready. |
| modules/python/clients/aks_client.py | Runs ARM and readiness concurrently and stores separate timing metadata. |
| modules/python/tests/clients/test_kubernetes_client.py | Adds a unit test validating the timestamp-returning behavior. |
| modules/python/tests/clients/test_aks_client.py | Updates tests to expect timestamp return and to assert timing metadata is recorded. |
|
|
||
| return OperationContext | ||
|
|
||
| def _run_concurrent_arm_and_readiness( |
There was a problem hiding this comment.
I think _instrument_nodepool_provisioning is a better name for the method
There was a problem hiding this comment.
Done, much clearer name for what this method actually does.
|
|
||
| def _run_concurrent_arm_and_readiness( | ||
| self, | ||
| poller, |
There was a problem hiding this comment.
I see only begin_create_or_update is passed in so no need to have a separate poller param
There was a problem hiding this comment.
Moved the begin_create_or_update call inside the method. Callers now just pass node_pool_name, cluster_name, parameters, and node_count, which cleaned up all 3 call sites pretty nicely.
| poller, | ||
| node_count: int, | ||
| label_selector: str, | ||
| start_time: float |
There was a problem hiding this comment.
I think start_time can be captured within this function, no need to be passed from outside
There was a problem hiding this comment.
Rolled this into the same change as the poller removal since both touched the method signature. start_time and label_selector are both derived internally now.
Use ThreadPoolExecutor instead of asyncio.run() to avoid implicit requirement that callers must not be in an existing event loop. Keeps the same method signature and behavior.
Move begin_create_or_update call and start_time capture inside the method. Callers now pass node_pool_name, cluster_name, parameters, and node_count instead of a pre-created poller and external timestamp.
Log now shows which layer (ARM vs K8s) was the bottleneck, the delta between them, and the total elapsed time for the concurrent operation.
148a7c2 to
278e27d
Compare
a083a19 to
7c896e4
Compare
| gpu_instance_profile=gpu_instance_profile, | ||
| gpu_mig_strategy=gpu_mig_strategy, | ||
| ) | ||
| command_execution_time = time.time() - start_time |
There was a problem hiding this comment.
What is the reason for the addition here? Is this your PR's scope to instrument GPU pool?
|
|
||
| arm_exc = arm_future.exception() | ||
| k8s_exc = k8s_future.exception() | ||
| k8s_exc = k8s_future.exception() |
There was a problem hiding this comment.
Why the duplication for line 116 and 117?
|
|
||
| return arm_response, ready_nodes, node_readiness_time, command_execution_time | ||
|
|
||
| def _log_timing_metrics(self, op, node_pool_name, node_readiness_time, command_execution_time): |
There was a problem hiding this comment.
How about inline this function within _instrument_nodepool_provisioning and call it at the end?
| parameters=parameters, | ||
| ) | ||
|
|
||
| def _poll_arm(): |
There was a problem hiding this comment.
I don't see the need to have separate _poll_arm and _wait_k8s. How about simplifying it with lamda, for example:
with ThreadPoolExecutor(max_workers=2) as executor:
arm_future = executor.submit(lambda: (poller.result(), time.time()))
k8s_future = executor.submit(
lambda: self.k8s_client.wait_for_nodes_ready(
node_count=node_count,
operation_timeout_in_minutes=self.operation_timeout_minutes,
label_selector=label_selector,
return_timestamp=True,
)
)
| node_pool.count = node_count | ||
|
|
||
| logger.info(f"Scaling node pool {node_pool_name} to {node_count} nodes") | ||
| self._begin_update_with_retry( |
There was a problem hiding this comment.
The deletion here is a behavior change. Your _instrument_nodepool_provisioning calls begin_create_or_update which does not do retry. I would suggest using _begin_update_with_retry within your instrumentation as it retries on transient errors - but this may skew your metrics.
| "cluster_info", self.get_cluster_data(cluster_name) | ||
| ) | ||
| node_pool.count = step # Update node count in the node pool object | ||
| self._begin_update_with_retry( |
There was a problem hiding this comment.
Same as above. Now the new codes lose the retry support.
| raise Exception(f"Error deleting Node '{node_name}': {str(e)}") from e | ||
|
|
||
| def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None): | ||
| def wait_for_nodes_ready(self, node_count, operation_timeout_in_minutes, label_selector=None, return_timestamp=False): |
There was a problem hiding this comment.
The new param makes this function complicated and more importantly - it is unnecessary. You should be able to capture the current time at call site, something like:
k8s_future = executor.submit(
lambda: (
self.k8s_client.wait_for_nodes_ready( ... ),
time.time(),
)
)
The time.time() function will only be executed when wait_for_nodes_ready() completes.
| op.add_metadata("node_readiness_time", node_readiness_time) | ||
| op.add_metadata("command_execution_time", command_execution_time) | ||
| delta = abs(command_execution_time - node_readiness_time) | ||
| bottleneck = "ARM" if command_execution_time > node_readiness_time else "K8s" |
There was a problem hiding this comment.
Don't like the way to log a 'bottleneck' - it is very subjective. Logging should better just record the facts e.g timestamps, delta. Your data analytics tool is responsible for decision making whether it is a bottleneck
d799fa0 to
ddf8598
Compare
Summary
Adds
node_readiness_timeas a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:
command_execution_time > node_readiness_time-> ARM layer is the bottlenecknode_readiness_time > command_execution_time-> K8s layer is the bottleneckChanges
kubernetes_client.py
return_timestamp=Falseparameter towait_for_nodes_ready()True, returns(ready_nodes, timestamp)tuple instead of justready_nodesaks_client.py
_instrument_nodepool_provisioning()helper usingThreadPoolExecutorcreate_node_pool(),scale_node_pool(), and_progressive_scale()to capture concurrent timingTiming metadata stored via
op.add_metadata():node_readiness_time: seconds from start until K8s nodes were Readycommand_execution_time: seconds from start until ARM operation completedImplementation notes
ThreadPoolExecutor(max_workers=2)to run ARM polling and K8s readiness checks concurrently, avoiding implicit event loop requirements thatasyncio.run()would introduce