RDKB-64847: Observed spike in load and CPU due to cpu_telemetry2_0#382
Open
tabbas651 wants to merge 4 commits into
Open
RDKB-64847: Observed spike in load and CPU due to cpu_telemetry2_0#382tabbas651 wants to merge 4 commits into
tabbas651 wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a “CPU contention avoidance window” to prevent telemetry HTTP upload work (curl pool acquisitions) from running concurrently with DCA top/process marker sampling, aiming to reduce CPU spikes and resulting WAN timeouts/FD pressure on resource-constrained devices.
Changes:
- Added new public APIs to begin/end a sampling window in the HTTP connection pool interface.
- Implemented sampling-window logic in the curl pool to defer new handle acquisitions while DCA sampling is active (with bounded waiting).
- Integrated the sampling window into
processTopPattern()so curl work is deferred during top/process collection.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
source/protocol/http/multicurlinterface.h |
Adds public begin/end sampling window APIs for coordinating with DCA sampling. |
source/protocol/http/multicurlinterface.c |
Implements sampling-window flagging/drain logic and defers new curl handle acquisitions during the window. |
source/dcautil/dca.c |
Wraps DCA top/process marker collection with begin/end sampling window calls. |
Reason for change: When DCA top/process marker sampling (processTopPattern) runs concurrently with telemetry HTTP uploads (curl), both CPU-intensive paths compete for resources causing WAN timeouts, FD exhaustion, and failed report uploads on resource-constrained devices. Introduced a CPU contention avoidance window that defers new curl handle acquisitions while DCA top pattern collection is active, preventing two CPU-intensive telemetry paths from running in parallel. Test Procedure: performance testing Risks: Medium Priority: P0 Signed-off-by: Thamim Razith Abbas Ali <[email protected]>
ab04314 to
3856a73
Compare
Comment on lines
+145
to
+146
| elapsed_ms = (unsigned int)((current_time.tv_sec - start_time.tv_sec) * 1000U); | ||
| elapsed_ms += (unsigned int)((current_time.tv_nsec - start_time.tv_nsec) / 1000000L); |
Comment on lines
565
to
569
| if (active_requests > 0) | ||
| { | ||
| active_requests--; | ||
| } | ||
| T2Info("Released curl handle = %d, active_requests = %d\n", idx, active_requests); |
Comment on lines
323
to
+335
| for (; var < vCount; ++var) // Loop of marker list starts here | ||
| { | ||
| if(!isSamplingActive) | ||
| { | ||
| /* | ||
| * Enter sampling window so new curl work is deferred while | ||
| * top/process markers are being collected. | ||
| */ | ||
| if(http_pool_begin_sampling_window(DCA_SAMPLING_CURL_WAIT_TIMEOUT_MS) == T2ERROR_SUCCESS) | ||
| { | ||
| isSamplingActive = true; | ||
| } | ||
| } |
Comment on lines
78
to
81
| static bool pool_initialized = false; | ||
| static bool pool_shutting_down = false; | ||
| static bool sampling_window_active = false; | ||
|
|
Comment on lines
+118
to
+120
| pthread_mutex_lock(&pool_mutex); | ||
| sampling_window_active = true; | ||
| pthread_mutex_unlock(&pool_mutex); |
Comment on lines
+179
to
+181
| pthread_mutex_lock(&pool_mutex); | ||
| sampling_window_active = false; | ||
| pthread_mutex_unlock(&pool_mutex); |
Comment on lines
265
to
269
| T2Debug("%s ++in\n", __FUNCTION__); | ||
| bool isSamplingActive = false; | ||
| if(profileName == NULL || topMarkerList == NULL) | ||
| { | ||
| T2Error("Invalid arguments for %s\n", __FUNCTION__); |
…ocessTopPattern()
Comment on lines
323
to
+335
| for (; var < vCount; ++var) // Loop of marker list starts here | ||
| { | ||
| if(!isSamplingActive) | ||
| { | ||
| /* | ||
| * Enter sampling window so new curl work is deferred while | ||
| * top/process markers are being collected. | ||
| */ | ||
| if(http_pool_begin_sampling_window(DCA_SAMPLING_CURL_WAIT_TIMEOUT_MS) == T2ERROR_SUCCESS) | ||
| { | ||
| isSamplingActive = true; | ||
| } | ||
| } |
Comment on lines
+509
to
+527
| if (sampling_window_refcount > 0) | ||
| { | ||
| pthread_mutex_unlock(&pool_mutex); | ||
|
|
||
| if (clock_gettime(CLOCK_MONOTONIC, ¤t_time) != 0) | ||
| { | ||
| T2Error("clock_gettime failed for current_time: %s\n", strerror(errno)); | ||
| return T2ERROR_FAILURE; | ||
| } | ||
|
|
||
| if ((current_time.tv_sec - start_time.tv_sec) >= POOL_ACQUIRE_TIMEOUT_SEC) | ||
| { | ||
| T2Error("Timeout waiting for sampling window to complete, treating as upload failure\n"); | ||
| return T2ERROR_FAILURE; | ||
| } | ||
|
|
||
| usleep(POOL_ACQUIRE_RETRY_MS * 1000); | ||
| continue; | ||
| } |
Comment on lines
+107
to
+121
| T2ERROR http_pool_begin_sampling_window(unsigned int timeout_ms) | ||
| { | ||
| struct timespec start_time, current_time; | ||
| bool timeout_reached = false; | ||
|
|
||
| if (clock_gettime(CLOCK_MONOTONIC, &start_time) != 0) | ||
| { | ||
| T2Error("clock_gettime failed for start_time in %s: %s\n", __FUNCTION__, strerror(errno)); | ||
| return T2ERROR_FAILURE; | ||
| } | ||
|
|
||
| pthread_mutex_lock(&pool_mutex); | ||
| sampling_window_refcount++; | ||
| pthread_mutex_unlock(&pool_mutex); | ||
|
|
9bc96df to
9a51828
Compare
Comment on lines
+76
to
+77
| /* Maximum wait for in-flight requests to drain when DCA sampling starts. */ | ||
| #define SAMPLE_WINDOW_DRAIN_RETRY_MS 100 |
Comment on lines
+107
to
+111
| T2ERROR http_pool_begin_sampling_window(unsigned int timeout_ms) | ||
| { | ||
| struct timespec start_time, current_time; | ||
| bool timeout_reached = false; | ||
|
|
Comment on lines
35
to
39
| -I${top_srcdir}/source/utils \ | ||
| -I${top_srcdir}/source/bulkdata \ | ||
| -I${top_srcdir}/source/protocol/http \ | ||
| -I${PKG_CONFIG_SYSROOT_DIR}$(includedir)/ccsp | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reason for change: When DCA top/process marker sampling (processTopPattern) runs concurrently with telemetry HTTP uploads (curl), both CPU-intensive paths compete for resources causing WAN timeouts, FD exhaustion, and failed report uploads on resource-constrained devices. Introduced a CPU contention avoidance window that defers new curl handle acquisitions while DCA top pattern collection is active, preventing two CPU-intensive telemetry paths from running in parallel.
Test Procedure: performance testing
Risks: Medium
Priority: P0