Skip to content

OCPBUGS-88685: Fix metrics-proxy unbounded memory growth#8740

Open
muraee wants to merge 1 commit into
openshift:mainfrom
muraee:OCPBUGS-88685-fix-metrics-proxy-memory-leak
Open

OCPBUGS-88685: Fix metrics-proxy unbounded memory growth#8740
muraee wants to merge 1 commit into
openshift:mainfrom
muraee:OCPBUGS-88685-fix-metrics-proxy-memory-leak

Conversation

@muraee

@muraee muraee commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix unbounded memory growth in metrics-proxy caused by leaked http.Transport connection pools
  • Every ScrapeAll call created a new http.Client/http.Transport that was never closed — idle connections and TLS buffers accumulated across scrape cycles (every 30-60s), growing from 40Mi to 2774Mi
  • Set DisableKeepAlives: true on the ephemeral Transport and defer client.CloseIdleConnections() as a safety net
  • Add scraper_test.go with 8 tests covering transport configuration, parallel scraping, error handling, and a connection-leak regression test

Fixes: https://issues.redhat.com/browse/OCPBUGS-88685

Root Cause

Go's http.Transport maintains an internal connection pool. The documentation states:

Transports should be reused instead of created as needed. Transports are safe for concurrent use by multiple goroutines.

buildScrapeClient() was called on every ScrapeAll invocation, creating a fresh Transport each time. Since the client was never reused and CloseIdleConnections() was never called, idle connections from each Transport accumulated indefinitely — each retaining TLS session state and read/write buffers.

Test plan

  • go test ./control-plane-operator/metrics-proxy/... passes (8 new tests in scraper_test.go)
  • TestScrapeAll/When_scraping_completes,_it_should_not_leak_idle_connections — regression test using ConnState tracking to verify all connections are closed after ScrapeAll returns
  • TestBuildScrapeClient — verifies DisableKeepAlives: true is set on Transport for both nil and non-nil TLS configs
  • Deploy to a cluster with metrics forwarding enabled and confirm memory usage remains stable over time

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved HTTP connection management in the metrics scraper to properly close idle connections and prevent socket leakage.
  • Tests

    • Added comprehensive unit tests for the metrics proxy scraper to verify client configuration and scraping behavior across various scenarios.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@muraee muraee changed the title Bug 88685: Fix metrics-proxy unbounded memory growth OCPBUGS-88685: Fix metrics-proxy unbounded memory growth Jun 16, 2026
@openshift-ci openshift-ci Bot requested review from csrwng and sjenning June 16, 2026 15:57
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@muraee: This pull request references Jira Issue OCPBUGS-88685, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix unbounded memory growth in metrics-proxy caused by leaked http.Transport connection pools
  • Every ScrapeAll call created a new http.Client/http.Transport that was never closed — idle connections and TLS buffers accumulated across scrape cycles (every 30-60s), growing from 40Mi to 2774Mi
  • Set DisableKeepAlives: true on the ephemeral Transport and defer client.CloseIdleConnections() as a safety net
  • Add scraper_test.go with 8 tests covering transport configuration, parallel scraping, error handling, and a connection-leak regression test

Fixes: https://issues.redhat.com/browse/OCPBUGS-88685

Root Cause

Go's http.Transport maintains an internal connection pool. The documentation states:

Transports should be reused instead of created as needed. Transports are safe for concurrent use by multiple goroutines.

buildScrapeClient() was called on every ScrapeAll invocation, creating a fresh Transport each time. Since the client was never reused and CloseIdleConnections() was never called, idle connections from each Transport accumulated indefinitely — each retaining TLS session state and read/write buffers.

Test plan

  • go test ./control-plane-operator/metrics-proxy/... passes (8 new tests in scraper_test.go)
  • TestScrapeAll/When_scraping_completes,_it_should_not_leak_idle_connections — regression test using ConnState tracking to verify all connections are closed after ScrapeAll returns
  • TestBuildScrapeClient — verifies DisableKeepAlives: true is set on Transport for both nil and non-nil TLS configs
  • Deploy to a cluster with metrics forwarding enabled and confirm memory usage remains stable over time

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: cde0c6a8-a108-436e-9257-2f1fd1a36900

📥 Commits

Reviewing files that changed from the base of the PR and between 392fd5a and 309e1de.

📒 Files selected for processing (2)
  • control-plane-operator/metrics-proxy/scraper.go
  • control-plane-operator/metrics-proxy/scraper_test.go

📝 Walkthrough

Walkthrough

The scrape HTTP client transport in scraper.go is updated to set DisableKeepAlives: true alongside the existing TLSClientConfig, preventing persistent keep-alive connections. A deferred call to CloseIdleConnections() is added at the start of scrapeAll to ensure any idle sockets are closed when the function returns. Two new test functions are added in scraper_test.go: TestBuildScrapeClient checks transport keep-alive configuration and TLS handling (defaults, server name, config cloning), and TestScrapeAll checks per-target results, partial error isolation, connection cleanup via http.ConnState, and non-200 status error messages.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly identifies the bug (OCPBUGS-88685) and concisely summarizes the main fix (unbounded memory growth in metrics-proxy), which aligns with the core objective of resolving a memory leak in the HTTP transport layer.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Tests use standard Go testing framework (not Ginkgo). All 8 test names in t.Run() are stable, deterministic, and contain no dynamic values like generated suffixes, timestamps, UUIDs, or changing id...
Test Structure And Quality ✅ Passed The custom check targets Ginkgo test code, but this PR contains standard Go tests with Gomega matchers. The check is not applicable to this pull request's testing approach.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go source files (scraper.go and scraper_test.go) for HTTP client memory leak fix. No deployment manifests, Kubernetes pod specs, or scheduling constraints are added or modified.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds standard Go unit tests (using testing.T and Gomega), not Ginkgo e2e tests. The custom check applies only to "new Ginkgo e2e tests," so it does not apply here.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms, custom crypto implementations, or unsafe secret comparisons found in the modified files; TLS is configured securely with version 1.2+.
Container-Privileges ✅ Passed PR contains only Go source code changes (scraper.go and scraper_test.go) with no Kubernetes manifests or container specifications, making the container-privileges check inapplicable.
No-Sensitive-Data-In-Logs ✅ Passed PR adds no new logging statements. Changes are limited to HTTP connection management (defer CloseIdleConnections, set DisableKeepAlives) and unit tests with no logging.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Jun 16, 2026
@muraee muraee force-pushed the OCPBUGS-88685-fix-metrics-proxy-memory-leak branch from 325c3d4 to e86fc21 Compare June 16, 2026 15:57
@muraee

muraee commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@muraee: This pull request references Jira Issue OCPBUGS-88685, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…transports

Every ScrapeAll call created a new http.Client with a new http.Transport
that was never closed. Go's http.Transport retains idle connections until
CloseIdleConnections is called — since we never reused or closed these
ephemeral transports, connections and their TLS buffers accumulated
across scrape cycles (every 30-60s), growing from 40Mi to 2774Mi and
triggering RequestServingNodesNeedUpscale alerts.

Set DisableKeepAlives on the Transport to prevent connection pooling on
clients that are never reused, and defer CloseIdleConnections as a
safety net. Add scraper_test.go with regression test verifying
connections are closed after ScrapeAll returns.

Refs: https://issues.redhat.com/browse/OCPBUGS-88685
Signed-off-by: Mulham Raee <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@muraee muraee force-pushed the OCPBUGS-88685-fix-metrics-proxy-memory-leak branch from e86fc21 to 309e1de Compare June 16, 2026 16:00
@openshift-ci-robot

Copy link
Copy Markdown

@muraee: This pull request references Jira Issue OCPBUGS-88685, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Fix unbounded memory growth in metrics-proxy caused by leaked http.Transport connection pools
  • Every ScrapeAll call created a new http.Client/http.Transport that was never closed — idle connections and TLS buffers accumulated across scrape cycles (every 30-60s), growing from 40Mi to 2774Mi
  • Set DisableKeepAlives: true on the ephemeral Transport and defer client.CloseIdleConnections() as a safety net
  • Add scraper_test.go with 8 tests covering transport configuration, parallel scraping, error handling, and a connection-leak regression test

Fixes: https://issues.redhat.com/browse/OCPBUGS-88685

Root Cause

Go's http.Transport maintains an internal connection pool. The documentation states:

Transports should be reused instead of created as needed. Transports are safe for concurrent use by multiple goroutines.

buildScrapeClient() was called on every ScrapeAll invocation, creating a fresh Transport each time. Since the client was never reused and CloseIdleConnections() was never called, idle connections from each Transport accumulated indefinitely — each retaining TLS session state and read/write buffers.

Test plan

  • go test ./control-plane-operator/metrics-proxy/... passes (8 new tests in scraper_test.go)
  • TestScrapeAll/When_scraping_completes,_it_should_not_leak_idle_connections — regression test using ConnState tracking to verify all connections are closed after ScrapeAll returns
  • TestBuildScrapeClient — verifies DisableKeepAlives: true is set on Transport for both nil and non-nil TLS configs
  • Deploy to a cluster with metrics forwarding enabled and confirm memory usage remains stable over time

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved HTTP connection management in the metrics scraper to properly close idle connections and prevent socket leakage.

  • Tests

  • Added comprehensive unit tests for the metrics proxy scraper to verify client configuration and scraping behavior across various scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.76%. Comparing base (ee9099c) to head (309e1de).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8740      +/-   ##
==========================================
+ Coverage   41.75%   41.76%   +0.01%     
==========================================
  Files         758      758              
  Lines       93981    93983       +2     
==========================================
+ Hits        39240    39252      +12     
+ Misses      51988    51982       -6     
+ Partials     2753     2749       -4     
Files with missing lines Coverage Δ
control-plane-operator/metrics-proxy/scraper.go 78.46% <100.00%> (+16.55%) ⬆️
Flag Coverage Δ
cmd-support 35.02% <ø> (ø)
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.54% <100.00%> (+0.09%) ⬆️
hypershift-operator 51.82% <ø> (ø)
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants