Skip to content

Fix quadratic performance and memory issues in duplicate-code checker#10881

Open
Pierre-Sassoulas wants to merge 7 commits intopylint-dev:mainfrom
Pierre-Sassoulas:symilar-performance
Open

Fix quadratic performance and memory issues in duplicate-code checker#10881
Pierre-Sassoulas wants to merge 7 commits intopylint-dev:mainfrom
Pierre-Sassoulas:symilar-performance

Conversation

@Pierre-Sassoulas
Copy link
Copy Markdown
Member

@Pierre-Sassoulas Pierre-Sassoulas commented Mar 2, 2026

Type of Changes

Type
🐛 Bug fix
🔨 Refactoring

Description

Foor optimizations to symilar:

  • Rolling hash: compute the window hash incrementally (subtract the leaving line hash, add the entering one) instead of re-summing all k line hashes for every position.

  • Cache hash_lineset results per lineset in _iter_sims: each file was being hashed once per pair (N-1 times) instead of once total.

  • Remove the LinesChunk wrapper class and use plain int dict keys, so frozenset intersection and dict lookups use C-level hash/eq.

  • Above a certain threshold of similar line the algorithm is different (this one is due to psf/black/profiling/ directory and its pathological content, good profiling tool black's contributors, congrat).

One optimization to the duplicate-code checker:

  • Reusing pylint's parsed ast in the check instead of recalculating as it was only useful if symilar was launched independently. (This one also reduce the memory used, I went from being oom killed on cpython to having a result in 90s)

Benchmark: duplicate-code checker performance

Measured with:

"""Benchmark the duplicate-code checker on a given path.

Usage: python bench_symilar.py <path>
"""

import subprocess
import sys


def main() -> None:
  if len(sys.argv) < 2:
      print(f"Usage: {sys.argv[0]} <path>")
      sys.exit(1)

  path = sys.argv[1]
  nfiles = int(
      subprocess.check_output(
          f"find {path} -name '*.py' | wc -l", shell=True
      ).strip()
  )
  nlines = int(
      subprocess.check_output(
          f"find {path} -name '*.py' -exec cat {{}} + | wc -l", shell=True
      ).strip()
  )

  result = subprocess.run(
      [
          "/usr/bin/time", "-v",
          sys.executable, "-m", "pylint",
          "--disable=all", "--enable=duplicate-code", path,
      ],
      capture_output=True,
      text=True,
  )

  time_s = mem_mb = "?"
  for line in result.stderr.splitlines():
      if "wall clock" in line:
          time_s = line.split()[-1]
      if "Maximum resident" in line:
          mem_mb = int(line.split()[-1]) // 1024

  print(f"files={nfiles}  lines={nlines}  time={time_s}  mem={mem_mb}MB")


if __name__ == "__main__":
  main()
Project Files Lines before after Speedup Mem before Mem after Mem reduction
flask 83 18,399 4.72s 2.87s 1.6x 123 MB 108 MB -12%
astroid 96 30,352 6.55s 2.54s 2.6x 128 MB 107 MB -16%
psycopg 248 67,478 26.50s 8.12s 3.3x 270 MB 208 MB -23%
black 316 129,230 stuck 6.69s inf ? 294 MB ?
django 902 162,352 3m45s 14.52s 15.5x 393 MB 328 MB -17%
pandas 1,414 635,395 19m58s 54.52s 22.0x 1,205 MB 877 MB -27%
sentry 7,609 1,443,533 stuck 2m51s inf ? 1,854 MB ?

Small projects are dominated by startup/parsing overhead, but the performance increase shows for bigger one. Maybe we could even add duplicate-code back to the primer now.

@Pierre-Sassoulas Pierre-Sassoulas added this to the 4.1.0 milestone Mar 2, 2026
@Pierre-Sassoulas Pierre-Sassoulas added performance duplicate-code Related to code duplication checker labels Mar 2, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.24%. Comparing base (b080a21) to head (1e003ed).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10881      +/-   ##
==========================================
+ Coverage   96.19%   96.24%   +0.05%     
==========================================
  Files         178      178              
  Lines       19683    19679       -4     
==========================================
+ Hits        18934    18940       +6     
+ Misses        749      739      -10     
Files with missing lines Coverage Δ
pylint/checkers/symilar.py 99.13% <100.00%> (+2.83%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread pylint/checkers/symilar.py Outdated
@github-actions

This comment has been minimized.

@Pierre-Sassoulas Pierre-Sassoulas changed the title Speed up duplicate-code checker with rolling hash and caching Fix quadratic performance and memory issues in duplicate-code checker Mar 3, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an expert on this in our pool of maintainers? If the tests pass this LGTM as I really have no experience with this part of the codebase or these kind of computations... 😅

But I'm not sure we have anybody who does have that experience.

@Pierre-Sassoulas
Copy link
Copy Markdown
Member Author

The expert is hippo91, I'm going to ping him when it's ready (there's still a coverage issue, and I didn't run the primer locally yet because it takes a ton of time). Although I'm not sure there's going to be a lot of duplication in the primer, maintainers tend to remove it. Also we're going to have primer for all the duplicate-code related fixes next. (in particular the search for duplicate code in the same file is pretty bad in pylint itself).

Copy link
Copy Markdown
Member

@jacobtylerwalls jacobtylerwalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked, and the changes smell right and are clearly described.

I don't know anything about "still lets remove_successive coalesce consecutive matches.", so I would like just a little test coverage in this area. Can we have a test that mocks _HASH_BUCKET_PRODUCT_LIMIT to something very low, and then runs a functional test over some code that triggers the other form of the algorithm?

Comment on lines -62 to +63
# Duplicate code takes too long and is relatively safe
# We don't want to lint the test directory which are repetitive
disables = ["--disable=duplicate-code", "--ignore=test"]
disables = ["--ignore=test"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can trial this for now and revert it if it's too slow.

Three optimizations to the symilar checker:

- Rolling hash: compute the window hash incrementally (subtract the
  leaving line hash, add the entering one) instead of re-summing all
  k line hashes for every position.

- Cache hash_lineset results per lineset in _iter_sims: each file was
  being hashed once per pair (N-1 times) instead of once total.

- Remove the LinesChunk wrapper class and use plain int dict keys,
  so frozenset intersection and dict lookups use C-level hash/eq.

~=25% faster on astroid (17,5s => 12,5s, 25k SLOC)
~=70% faster on django (273s => 77s, 130k SLOC)
…mmon

Replace the raw tuple[HashToIndex_T, IndexToLines_T] with a
LineSetHashResult NamedTuple for clarity. Since _iter_sims always
passes pre-computed hashes from its cache, make the parameters
required and remove the unused fallback branches.
process_module already receives the parsed nodes.Module from pylint's
main pass, but stripped_lines was re-parsing every file from source text.
Thread the existing AST through process_module → append_stream → LineSet
→ stripped_lines, falling back to astroid.parse() only when no tree is
provided (standalone symilar CLI).

The redundant parse dominated stripped_lines cost.  Per-file savings:

  file size     time saved    memory saved
  0 lines          0.14 ms       0.03 MB
  924 lines       65    ms       2.1  MB
  31k lines     2764    ms     101.6  MB

End-to-end on pylint's own codebase (179 files, ~49k SLOC):

  before  median=6.6s  peak RSS=170 MB
  after   median=5.1s  peak RSS=149 MB  (1.5x faster, -12% memory)
When two files share a hash bucket with many indices (e.g. thousands of
identical data lines), itertools.product explodes quadratically (4k x 22k
= 88M pairs).  Fall back to aligned zip pairing when the product exceeds
500 entries.  The diagonal pairs are consecutive so remove_successive
still coalesces them into one correct block.

Benchmark on psf/black (316 files, 129k lines including pathological
profiling/ data files with 22k+ identical lines): hung forever -> 7.2s.
The duplicate-code checker was previously disabled in the stdlib primer
because it was too slow. With the recent performance optimizations it
might completes in reasonable time, so re-enable it.
Adds a functional test that mocks `_HASH_BUCKET_PRODUCT_LIMIT` to zero and
runs symilar over repeated-block content so the aligned-zip fallback path
is always exercised. Covers the previously-untested branch flagged in the
codecov report on pylint-dev#10881.

Addresses Jacob's review request for test coverage on the "other form of
the algorithm" introduced to cap quadratic behavior.
@github-actions

This comment has been minimized.

Adds focused tests for `LineSetStartCouple.__eq__` NotImplemented branch,
`LineSet.__str__` / `__getitem__` / non-LineSet `__eq__`, `append_stream`
binary-without-encoding and UnicodeDecodeError paths, `report_similarities`
table building, and the `process_module` deprecation warning when
`linter.current_name` is None.

Takes symilar.py from 96% to 99% coverage. The remaining gaps are an
unreachable defensive `except KeyError` in `remove_successive` and the
`if __name__ == "__main__"` guard.
@Pierre-Sassoulas
Copy link
Copy Markdown
Member Author

Hey @hippo91 as the expert in optimizing duplicate-code would you mind reviewing this ? (Work better commit by commit)

@github-actions
Copy link
Copy Markdown
Contributor

🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉

This comment was generated for commit 1e003ed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

duplicate-code Related to code duplication checker performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants