Skip to content

Commit dae5b83

Browse files
Cap hash bucket Cartesian product to prevent quadratic blow-up
When two files share a hash bucket with many indices (e.g. thousands of identical data lines), itertools.product explodes quadratically (4k x 22k = 88M pairs). Fall back to aligned zip pairing when the product exceeds 500 entries. The diagonal pairs are consecutive so remove_successive still coalesces them into one correct block. Benchmark on psf/black (316 files, 129k lines including pathological profiling/ data files with 22k+ identical lines): hung forever -> 7.2s.
1 parent 8027e6c commit dae5b83

3 files changed

Lines changed: 50 additions & 21 deletions

File tree

doc/whatsnew/4/4.1/index.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,11 @@
1212
Summary -- Release highlights
1313
=============================
1414

15+
The duplicate-code checker and ``symilar`` received optimizations that
16+
result in considerable performance improvements and memory use reduction
17+
on larger codebases. For example, pandas analysis went from 20 min to
18+
55 s and pylint does not get OOM-killed when analyzing cpython anymore.
19+
1520
The required ``astroid`` version is now 4.1.1. See the
1621
`astroid changelog <https://pylint.readthedocs.io/projects/astroid/en/latest/changelog.html#what-s-new-in-astroid-4-1-0>`_
1722
for additional fixes, features, and performance improvements applicable to pylint.
Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
1-
Speed up the ``duplicate-code`` checker by using C-based hash, a rolling hash window,
2-
and caching results across file pairs. Expect pylint to be ~25% faster on ~25k SLOC
3-
(astroid) and ~70% faster on ~130k SLOC (django) overall when duplicate-code is activated.
1+
Sped up the ``duplicate-code`` checker. When run inside pylint the
2+
checker now reuses the already-parsed AST instead of re-parsing every
3+
file like it has to do when launched via ``symilar``, and it uses a
4+
rolling hash window with caching across file pairs. Additionally, a
5+
quadratic blow-up in the hash-matching phase is avoided by switching
6+
algorithm at a threshold, which previously caused the checker to hang
7+
on files with many repeated lines.
8+
9+
Speedup scales with codebase size from 1.5x on small projects
10+
(~10k lines), to 20x on large ones (500k+ lines). Memory usage also
11+
drops 12-27%. Codebases that previously hung or were OOM-killed could
12+
now complete.
413

514
Refs #10881

pylint/checkers/symilar.py

Lines changed: 33 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,12 @@
5757

5858
REGEX_FOR_LINES_WITH_CONTENT = re.compile(r".*\w+")
5959

60+
# When two files share a hash bucket whose Cartesian product exceeds this
61+
# limit, fall back to aligned (zip) pairing instead of the full product.
62+
# This prevents quadratic blow-up on files with many identical lines (e.g.
63+
# auto-generated data). The result is a correct lower bound on duplicates.
64+
_HASH_BUCKET_PRODUCT_LIMIT: int = 500
65+
6066
# Index defines a location in a LineSet stripped lines collection
6167
Index = NewType("Index", int)
6268

@@ -475,11 +481,19 @@ def _find_common(
475481
for chunk_hash in sorted(
476482
common_hashes, key=lambda h: hashes1.hash_to_index[h][0]
477483
):
478-
for indices_in_linesets in itertools.product(
479-
hashes1.hash_to_index[chunk_hash], hashes2.hash_to_index[chunk_hash]
480-
):
481-
index_1 = indices_in_linesets[0]
482-
index_2 = indices_in_linesets[1]
484+
indices_1 = hashes1.hash_to_index[chunk_hash]
485+
indices_2 = hashes2.hash_to_index[chunk_hash]
486+
487+
# When both buckets are large the Cartesian product becomes
488+
# quadratic (e.g. 4000 x 22000 = 88M pairs for repeated data
489+
# lines). Fall back to aligned pairing which is O(min(N, M))
490+
# and still lets remove_successive coalesce consecutive matches.
491+
if len(indices_1) * len(indices_2) > _HASH_BUCKET_PRODUCT_LIMIT:
492+
pairs: Iterable[tuple[Index, Index]] = zip(indices_1, indices_2)
493+
else:
494+
pairs = itertools.product(indices_1, indices_2)
495+
496+
for index_1, index_2 in pairs:
483497
all_couples[LineSetStartCouple(index_1, index_2)] = (
484498
CplSuccessiveLinesLimits(
485499
copy.copy(hashes1.index_to_lines[index_1]),
@@ -493,24 +507,25 @@ def _find_common(
493507
for cml_stripped_l, cmn_l in all_couples.items():
494508
start_index_1 = cml_stripped_l.fst_lineset_index
495509
start_index_2 = cml_stripped_l.snd_lineset_index
496-
nb_common_lines = cmn_l.effective_cmn_lines_nb
497-
498-
com = Commonality(
499-
cmn_lines_nb=nb_common_lines,
500-
fst_lset=lineset1,
501-
fst_file_start=cmn_l.first_file.start,
502-
fst_file_end=cmn_l.first_file.end,
503-
snd_lset=lineset2,
504-
snd_file_start=cmn_l.second_file.start,
505-
snd_file_end=cmn_l.second_file.end,
506-
)
507510

508511
eff_cmn_nb = filter_noncode_lines(
509-
lineset1, start_index_1, lineset2, start_index_2, nb_common_lines
512+
lineset1,
513+
start_index_1,
514+
lineset2,
515+
start_index_2,
516+
cmn_l.effective_cmn_lines_nb,
510517
)
511518

512519
if eff_cmn_nb > self.namespace.min_similarity_lines:
513-
yield com
520+
yield Commonality(
521+
cmn_lines_nb=cmn_l.effective_cmn_lines_nb,
522+
fst_lset=lineset1,
523+
fst_file_start=cmn_l.first_file.start,
524+
fst_file_end=cmn_l.first_file.end,
525+
snd_lset=lineset2,
526+
snd_file_start=cmn_l.second_file.start,
527+
snd_file_end=cmn_l.second_file.end,
528+
)
514529

515530
def _iter_sims(self) -> Generator[Commonality]:
516531
"""Iterate on similarities among all files, by making a Cartesian

0 commit comments

Comments
 (0)