Fix quadratic performance and memory issues in duplicate-code checker by Pierre-Sassoulas · Pull Request #10881 · pylint-dev/pylint

Pierre-Sassoulas · 2026-03-02T12:41:35Z

Type of Changes

	Type
✓	🐛 Bug fix
✓	🔨 Refactoring

Description

Foor optimizations to symilar:

Rolling hash: compute the window hash incrementally (subtract the leaving line hash, add the entering one) instead of re-summing all k line hashes for every position.
Cache hash_lineset results per lineset in _iter_sims: each file was being hashed once per pair (N-1 times) instead of once total.
Remove the LinesChunk wrapper class and use plain int dict keys, so frozenset intersection and dict lookups use C-level hash/eq.
Above a certain threshold of similar line the algorithm is different (this one is due to psf/black/profiling/ directory and its pathological content, good profiling tool black's contributors, congrat).

One optimization to the duplicate-code checker:

Reusing pylint's parsed ast in the check instead of recalculating as it was only useful if symilar was launched independently. (This one also reduce the memory used, I went from being oom killed on cpython to having a result in 90s)

Benchmark: duplicate-code checker performance

Measured with:

"""Benchmark the duplicate-code checker on a given path.

Usage: python bench_symilar.py <path>
"""

import subprocess
import sys


def main() -> None:
  if len(sys.argv) < 2:
      print(f"Usage: {sys.argv[0]} <path>")
      sys.exit(1)

  path = sys.argv[1]
  nfiles = int(
      subprocess.check_output(
          f"find {path} -name '*.py' | wc -l", shell=True
      ).strip()
  )
  nlines = int(
      subprocess.check_output(
          f"find {path} -name '*.py' -exec cat {{}} + | wc -l", shell=True
      ).strip()
  )

  result = subprocess.run(
      [
          "/usr/bin/time", "-v",
          sys.executable, "-m", "pylint",
          "--disable=all", "--enable=duplicate-code", path,
      ],
      capture_output=True,
      text=True,
  )

  time_s = mem_mb = "?"
  for line in result.stderr.splitlines():
      if "wall clock" in line:
          time_s = line.split()[-1]
      if "Maximum resident" in line:
          mem_mb = int(line.split()[-1]) // 1024

  print(f"files={nfiles}  lines={nlines}  time={time_s}  mem={mem_mb}MB")


if __name__ == "__main__":
  main()

Project	Files	Lines	before	after	Speedup	Mem before	Mem after	Mem reduction
flask	83	18,399	4.72s	2.87s	1.6x	123 MB	108 MB	-12%
astroid	96	30,352	6.55s	2.54s	2.6x	128 MB	107 MB	-16%
psycopg	248	67,478	26.50s	8.12s	3.3x	270 MB	208 MB	-23%
black	316	129,230	stuck	6.69s	inf	?	294 MB	?
django	902	162,352	3m45s	14.52s	15.5x	393 MB	328 MB	-17%
pandas	1,414	635,395	19m58s	54.52s	22.0x	1,205 MB	877 MB	-27%
sentry	7,609	1,443,533	stuck	2m51s	inf	?	1,854 MB	?

Small projects are dominated by startup/parsing overhead, but the performance increase shows for bigger one. Maybe we could even add duplicate-code back to the primer now.

codecov · 2026-03-02T12:55:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.24%. Comparing base (b080a21) to head (1e003ed).

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10881      +/-   ##
==========================================
+ Coverage   96.19%   96.24%   +0.05%     
==========================================
  Files         178      178              
  Lines       19683    19679       -4     
==========================================
+ Hits        18934    18940       +6     
+ Misses        749      739      -10

Files with missing lines	Coverage Δ
pylint/checkers/symilar.py	`99.13% <100.00%> (+2.83%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

DanielNoord

Do we have an expert on this in our pool of maintainers? If the tests pass this LGTM as I really have no experience with this part of the codebase or these kind of computations... 😅

But I'm not sure we have anybody who does have that experience.

Pierre-Sassoulas · 2026-03-06T14:32:58Z

The expert is hippo91, I'm going to ping him when it's ready (there's still a coverage issue, and I didn't run the primer locally yet because it takes a ton of time). Although I'm not sure there's going to be a lot of duplication in the primer, maintainers tend to remove it. Also we're going to have primer for all the duplicate-code related fixes next. (in particular the search for duplicate code in the same file is pretty bad in pylint itself).

jacobtylerwalls

I looked, and the changes smell right and are clearly described.

I don't know anything about "still lets remove_successive coalesce consecutive matches.", so I would like just a little test coverage in this area. Can we have a test that mocks _HASH_BUCKET_PRODUCT_LIMIT to something very low, and then runs a functional test over some code that triggers the other form of the algorithm?

jacobtylerwalls · 2026-04-24T11:27:33Z

-            # Duplicate code takes too long and is relatively safe
            # We don't want to lint the test directory which are repetitive
-            disables = ["--disable=duplicate-code", "--ignore=test"]
+            disables = ["--ignore=test"]


I guess we can trial this for now and revert it if it's too slow.

Three optimizations to the symilar checker: - Rolling hash: compute the window hash incrementally (subtract the leaving line hash, add the entering one) instead of re-summing all k line hashes for every position. - Cache hash_lineset results per lineset in _iter_sims: each file was being hashed once per pair (N-1 times) instead of once total. - Remove the LinesChunk wrapper class and use plain int dict keys, so frozenset intersection and dict lookups use C-level hash/eq. ~=25% faster on astroid (17,5s => 12,5s, 25k SLOC) ~=70% faster on django (273s => 77s, 130k SLOC)

…mmon Replace the raw tuple[HashToIndex_T, IndexToLines_T] with a LineSetHashResult NamedTuple for clarity. Since _iter_sims always passes pre-computed hashes from its cache, make the parameters required and remove the unused fallback branches.

process_module already receives the parsed nodes.Module from pylint's main pass, but stripped_lines was re-parsing every file from source text. Thread the existing AST through process_module → append_stream → LineSet → stripped_lines, falling back to astroid.parse() only when no tree is provided (standalone symilar CLI). The redundant parse dominated stripped_lines cost. Per-file savings: file size time saved memory saved 0 lines 0.14 ms 0.03 MB 924 lines 65 ms 2.1 MB 31k lines 2764 ms 101.6 MB End-to-end on pylint's own codebase (179 files, ~49k SLOC): before median=6.6s peak RSS=170 MB after median=5.1s peak RSS=149 MB (1.5x faster, -12% memory)

When two files share a hash bucket with many indices (e.g. thousands of identical data lines), itertools.product explodes quadratically (4k x 22k = 88M pairs). Fall back to aligned zip pairing when the product exceeds 500 entries. The diagonal pairs are consecutive so remove_successive still coalesces them into one correct block. Benchmark on psf/black (316 files, 129k lines including pathological profiling/ data files with 22k+ identical lines): hung forever -> 7.2s.

The duplicate-code checker was previously disabled in the stdlib primer because it was too slow. With the recent performance optimizations it might completes in reasonable time, so re-enable it.

Adds a functional test that mocks `_HASH_BUCKET_PRODUCT_LIMIT` to zero and runs symilar over repeated-block content so the aligned-zip fallback path is always exercised. Covers the previously-untested branch flagged in the codecov report on pylint-dev#10881. Addresses Jacob's review request for test coverage on the "other form of the algorithm" introduced to cap quadratic behavior.

Adds focused tests for `LineSetStartCouple.__eq__` NotImplemented branch, `LineSet.__str__` / `__getitem__` / non-LineSet `__eq__`, `append_stream` binary-without-encoding and UnicodeDecodeError paths, `report_similarities` table building, and the `process_module` deprecation warning when `linter.current_name` is None. Takes symilar.py from 96% to 99% coverage. The remaining gaps are an unreachable defensive `except KeyError` in `remove_successive` and the `if __name__ == "__main__"` guard.

Pierre-Sassoulas · 2026-04-24T19:41:12Z

Hey @hippo91 as the expert in optimizing duplicate-code would you mind reviewing this ? (Work better commit by commit)

github-actions · 2026-04-24T20:06:34Z

🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉

This comment was generated for commit 1e003ed

Pierre-Sassoulas added this to the 4.1.0 milestone Mar 2, 2026

Pierre-Sassoulas added performance duplicate-code Related to code duplication checker labels Mar 2, 2026

This comment has been minimized.

Sign in to view

Pierre-Sassoulas commented Mar 2, 2026

View reviewed changes

Comment thread pylint/checkers/symilar.py Outdated

This comment has been minimized.

Sign in to view

Pierre-Sassoulas force-pushed the symilar-performance branch from 092ebd4 to 5ee074a Compare March 3, 2026 15:27

Pierre-Sassoulas changed the title ~~Speed up duplicate-code checker with rolling hash and caching~~ Fix quadratic performance and memory issues in duplicate-code checker Mar 3, 2026

This comment has been minimized.

Sign in to view

DanielNoord reviewed Mar 6, 2026

View reviewed changes

Pierre-Sassoulas mentioned this pull request Apr 11, 2026

Pylint OOM (exit code -9) in GitHub Actions when processing large number of files. #10963

Open

jacobtylerwalls requested changes Apr 24, 2026

View reviewed changes

jacobtylerwalls mentioned this pull request Apr 24, 2026

Add preliminary support for Python 3.15 #10983

Draft

Pierre-Sassoulas added 6 commits April 24, 2026 16:32

Re-enable duplicate-code checker in stdlib primer test

bc95dfc

The duplicate-code checker was previously disabled in the stdlib primer because it was too slow. With the recent performance optimizations it might completes in reasonable time, so re-enable it.

Pierre-Sassoulas force-pushed the symilar-performance branch from 3640ead to 1793928 Compare April 24, 2026 14:33

This comment has been minimized.

Sign in to view

Pierre-Sassoulas force-pushed the symilar-performance branch from 1793928 to 1e003ed Compare April 24, 2026 19:39

Pierre-Sassoulas requested a review from hippo91 April 24, 2026 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix quadratic performance and memory issues in duplicate-code checker#10881

Fix quadratic performance and memory issues in duplicate-code checker#10881
Pierre-Sassoulas wants to merge 7 commits intopylint-dev:mainfrom
Pierre-Sassoulas:symilar-performance

Pierre-Sassoulas commented Mar 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

This comment has been minimized.

DanielNoord left a comment

Uh oh!

Pierre-Sassoulas commented Mar 6, 2026

Uh oh!

jacobtylerwalls left a comment

Uh oh!

jacobtylerwalls Apr 24, 2026

Uh oh!

This comment has been minimized.

Pierre-Sassoulas commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Pierre-Sassoulas commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Changes

Description

Benchmark: duplicate-code checker performance

Uh oh!

codecov Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

This comment has been minimized.

DanielNoord left a comment

Choose a reason for hiding this comment

Uh oh!

Pierre-Sassoulas commented Mar 6, 2026

Uh oh!

jacobtylerwalls left a comment

Choose a reason for hiding this comment

Uh oh!

jacobtylerwalls Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Pierre-Sassoulas commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pierre-Sassoulas commented Mar 2, 2026 •

edited

Loading

codecov Bot commented Mar 2, 2026 •

edited

Loading