Cap hash bucket Cartesian product to prevent quadratic blow-up

Pierre-Sassoulas · Pierre-Sassoulas · commit dae5b833ceb0 · 2026-04-24T16:32:54.000+02:00
When two files share a hash bucket with many indices (e.g. thousands of
identical data lines), itertools.product explodes quadratically (4k x 22k
= 88M pairs).  Fall back to aligned zip pairing when the product exceeds
500 entries.  The diagonal pairs are consecutive so remove_successive
still coalesces them into one correct block.

Benchmark on psf/black (316 files, 129k lines including pathological
profiling/ data files with 22k+ identical lines): hung forever -&gt; 7.2s.
diff --git a/doc/whatsnew/4/4.1/index.rst b/doc/whatsnew/4/4.1/index.rst
@@ -12,6 +12,11 @@
 Summary -- Release highlights
 =============================
 
+The duplicate-code checker and ``symilar`` received optimizations that
+result in considerable performance improvements and memory use reduction
+on larger codebases. For example, pandas analysis went from 20 min to
+55 s and pylint does not get OOM-killed when analyzing cpython anymore.
+
 The required ``astroid`` version is now 4.1.1. See the
 `astroid changelog <https://pylint.readthedocs.io/projects/astroid/en/latest/changelog.html#what-s-new-in-astroid-4-1-0>`_
 for additional fixes, features, and performance improvements applicable to pylint.
diff --git a/doc/whatsnew/fragments/10881.performance b/doc/whatsnew/fragments/10881.performance
@@ -1,5 +1,14 @@
-Speed up the ``duplicate-code`` checker by using C-based hash, a rolling hash window,
-and caching results across file pairs. Expect pylint to be ~25% faster on ~25k SLOC
-(astroid) and ~70% faster on ~130k SLOC (django) overall when duplicate-code is activated.
+Sped up the ``duplicate-code`` checker.  When run inside pylint the
+checker now reuses the already-parsed AST instead of re-parsing every
+file like it has to do when launched via ``symilar``, and it uses a
+rolling hash window with caching across file pairs. Additionally, a
+quadratic blow-up in the hash-matching phase is avoided by switching
+algorithm at a threshold, which previously caused the checker to hang
+on files with many repeated lines.
+
+Speedup scales with codebase size from 1.5x on small projects
+(~10k lines), to 20x on large ones (500k+ lines). Memory usage also
+drops 12-27%. Codebases that previously hung or were OOM-killed could
+now complete.
 
 Refs #10881
diff --git a/pylint/checkers/symilar.py b/pylint/checkers/symilar.py
@@ -57,6 +57,12 @@
 
 REGEX_FOR_LINES_WITH_CONTENT = re.compile(r".*\w+")
 
+# When two files share a hash bucket whose Cartesian product exceeds this
+# limit, fall back to aligned (zip) pairing instead of the full product.
+# This prevents quadratic blow-up on files with many identical lines (e.g.
+# auto-generated data).  The result is a correct lower bound on duplicates.
+_HASH_BUCKET_PRODUCT_LIMIT: int = 500
+
 # Index defines a location in a LineSet stripped lines collection
 Index = NewType("Index", int)
 
@@ -475,11 +481,19 @@ def _find_common(
         for chunk_hash in sorted(
             common_hashes, key=lambda h: hashes1.hash_to_index[h][0]
         ):
-            for indices_in_linesets in itertools.product(
-                hashes1.hash_to_index[chunk_hash], hashes2.hash_to_index[chunk_hash]
-            ):
-                index_1 = indices_in_linesets[0]
-                index_2 = indices_in_linesets[1]
+            indices_1 = hashes1.hash_to_index[chunk_hash]
+            indices_2 = hashes2.hash_to_index[chunk_hash]
+
+            # When both buckets are large the Cartesian product becomes
+            # quadratic (e.g. 4000 x 22000 = 88M pairs for repeated data
+            # lines).  Fall back to aligned pairing which is O(min(N, M))
+            # and still lets remove_successive coalesce consecutive matches.
+            if len(indices_1) * len(indices_2) > _HASH_BUCKET_PRODUCT_LIMIT:
+                pairs: Iterable[tuple[Index, Index]] = zip(indices_1, indices_2)
+            else:
+                pairs = itertools.product(indices_1, indices_2)
+
+            for index_1, index_2 in pairs:
                 all_couples[LineSetStartCouple(index_1, index_2)] = (
                     CplSuccessiveLinesLimits(
                         copy.copy(hashes1.index_to_lines[index_1]),
@@ -493,24 +507,25 @@ def _find_common(
         for cml_stripped_l, cmn_l in all_couples.items():
             start_index_1 = cml_stripped_l.fst_lineset_index
             start_index_2 = cml_stripped_l.snd_lineset_index
-            nb_common_lines = cmn_l.effective_cmn_lines_nb
-
-            com = Commonality(
-                cmn_lines_nb=nb_common_lines,
-                fst_lset=lineset1,
-                fst_file_start=cmn_l.first_file.start,
-                fst_file_end=cmn_l.first_file.end,
-                snd_lset=lineset2,
-                snd_file_start=cmn_l.second_file.start,
-                snd_file_end=cmn_l.second_file.end,
-            )
 
             eff_cmn_nb = filter_noncode_lines(
-                lineset1, start_index_1, lineset2, start_index_2, nb_common_lines
+                lineset1,
+                start_index_1,
+                lineset2,
+                start_index_2,
+                cmn_l.effective_cmn_lines_nb,
             )
 
             if eff_cmn_nb > self.namespace.min_similarity_lines:
-                yield com
+                yield Commonality(
+                    cmn_lines_nb=cmn_l.effective_cmn_lines_nb,
+                    fst_lset=lineset1,
+                    fst_file_start=cmn_l.first_file.start,
+                    fst_file_end=cmn_l.first_file.end,
+                    snd_lset=lineset2,
+                    snd_file_start=cmn_l.second_file.start,
+                    snd_file_end=cmn_l.second_file.end,
+                )
 
     def _iter_sims(self) -> Generator[Commonality]:
         """Iterate on similarities among all files, by making a Cartesian