OCPBUGS-88685: Fix metrics-proxy unbounded memory growth#8740
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@muraee: This pull request references Jira Issue OCPBUGS-88685, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe scrape HTTP client transport in 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
325c3d4 to
e86fc21
Compare
|
/jira refresh |
|
@muraee: This pull request references Jira Issue OCPBUGS-88685, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…transports Every ScrapeAll call created a new http.Client with a new http.Transport that was never closed. Go's http.Transport retains idle connections until CloseIdleConnections is called — since we never reused or closed these ephemeral transports, connections and their TLS buffers accumulated across scrape cycles (every 30-60s), growing from 40Mi to 2774Mi and triggering RequestServingNodesNeedUpscale alerts. Set DisableKeepAlives on the Transport to prevent connection pooling on clients that are never reused, and defer CloseIdleConnections as a safety net. Add scraper_test.go with regression test verifying connections are closed after ScrapeAll returns. Refs: https://issues.redhat.com/browse/OCPBUGS-88685 Signed-off-by: Mulham Raee <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
e86fc21 to
309e1de
Compare
|
@muraee: This pull request references Jira Issue OCPBUGS-88685, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8740 +/- ##
==========================================
+ Coverage 41.75% 41.76% +0.01%
==========================================
Files 758 758
Lines 93981 93983 +2
==========================================
+ Hits 39240 39252 +12
+ Misses 51988 51982 -6
+ Partials 2753 2749 -4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@muraee: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
metrics-proxycaused by leakedhttp.Transportconnection poolsScrapeAllcall created a newhttp.Client/http.Transportthat was never closed — idle connections and TLS buffers accumulated across scrape cycles (every 30-60s), growing from 40Mi to 2774MiDisableKeepAlives: trueon the ephemeral Transport anddefer client.CloseIdleConnections()as a safety netscraper_test.gowith 8 tests covering transport configuration, parallel scraping, error handling, and a connection-leak regression testFixes: https://issues.redhat.com/browse/OCPBUGS-88685
Root Cause
Go's
http.Transportmaintains an internal connection pool. The documentation states:buildScrapeClient()was called on everyScrapeAllinvocation, creating a fresh Transport each time. Since the client was never reused andCloseIdleConnections()was never called, idle connections from each Transport accumulated indefinitely — each retaining TLS session state and read/write buffers.Test plan
go test ./control-plane-operator/metrics-proxy/...passes (8 new tests inscraper_test.go)TestScrapeAll/When_scraping_completes,_it_should_not_leak_idle_connections— regression test usingConnStatetracking to verify all connections are closed afterScrapeAllreturnsTestBuildScrapeClient— verifiesDisableKeepAlives: trueis set on Transport for both nil and non-nil TLS configs🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests