Fix quadratic performance and memory issues in duplicate-code checker#10881
Fix quadratic performance and memory issues in duplicate-code checker#10881Pierre-Sassoulas wants to merge 7 commits intopylint-dev:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #10881 +/- ##
==========================================
+ Coverage 96.19% 96.24% +0.05%
==========================================
Files 178 178
Lines 19683 19679 -4
==========================================
+ Hits 18934 18940 +6
+ Misses 749 739 -10
🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
092ebd4 to
5ee074a
Compare
This comment has been minimized.
This comment has been minimized.
DanielNoord
left a comment
There was a problem hiding this comment.
Do we have an expert on this in our pool of maintainers? If the tests pass this LGTM as I really have no experience with this part of the codebase or these kind of computations... 😅
But I'm not sure we have anybody who does have that experience.
|
The expert is hippo91, I'm going to ping him when it's ready (there's still a coverage issue, and I didn't run the primer locally yet because it takes a ton of time). Although I'm not sure there's going to be a lot of duplication in the primer, maintainers tend to remove it. Also we're going to have primer for all the duplicate-code related fixes next. (in particular the search for duplicate code in the same file is pretty bad in pylint itself). |
jacobtylerwalls
left a comment
There was a problem hiding this comment.
I looked, and the changes smell right and are clearly described.
I don't know anything about "still lets remove_successive coalesce consecutive matches.", so I would like just a little test coverage in this area. Can we have a test that mocks _HASH_BUCKET_PRODUCT_LIMIT to something very low, and then runs a functional test over some code that triggers the other form of the algorithm?
| # Duplicate code takes too long and is relatively safe | ||
| # We don't want to lint the test directory which are repetitive | ||
| disables = ["--disable=duplicate-code", "--ignore=test"] | ||
| disables = ["--ignore=test"] |
There was a problem hiding this comment.
I guess we can trial this for now and revert it if it's too slow.
Three optimizations to the symilar checker: - Rolling hash: compute the window hash incrementally (subtract the leaving line hash, add the entering one) instead of re-summing all k line hashes for every position. - Cache hash_lineset results per lineset in _iter_sims: each file was being hashed once per pair (N-1 times) instead of once total. - Remove the LinesChunk wrapper class and use plain int dict keys, so frozenset intersection and dict lookups use C-level hash/eq. ~=25% faster on astroid (17,5s => 12,5s, 25k SLOC) ~=70% faster on django (273s => 77s, 130k SLOC)
…mmon Replace the raw tuple[HashToIndex_T, IndexToLines_T] with a LineSetHashResult NamedTuple for clarity. Since _iter_sims always passes pre-computed hashes from its cache, make the parameters required and remove the unused fallback branches.
process_module already receives the parsed nodes.Module from pylint's main pass, but stripped_lines was re-parsing every file from source text. Thread the existing AST through process_module → append_stream → LineSet → stripped_lines, falling back to astroid.parse() only when no tree is provided (standalone symilar CLI). The redundant parse dominated stripped_lines cost. Per-file savings: file size time saved memory saved 0 lines 0.14 ms 0.03 MB 924 lines 65 ms 2.1 MB 31k lines 2764 ms 101.6 MB End-to-end on pylint's own codebase (179 files, ~49k SLOC): before median=6.6s peak RSS=170 MB after median=5.1s peak RSS=149 MB (1.5x faster, -12% memory)
When two files share a hash bucket with many indices (e.g. thousands of identical data lines), itertools.product explodes quadratically (4k x 22k = 88M pairs). Fall back to aligned zip pairing when the product exceeds 500 entries. The diagonal pairs are consecutive so remove_successive still coalesces them into one correct block. Benchmark on psf/black (316 files, 129k lines including pathological profiling/ data files with 22k+ identical lines): hung forever -> 7.2s.
The duplicate-code checker was previously disabled in the stdlib primer because it was too slow. With the recent performance optimizations it might completes in reasonable time, so re-enable it.
Adds a functional test that mocks `_HASH_BUCKET_PRODUCT_LIMIT` to zero and runs symilar over repeated-block content so the aligned-zip fallback path is always exercised. Covers the previously-untested branch flagged in the codecov report on pylint-dev#10881. Addresses Jacob's review request for test coverage on the "other form of the algorithm" introduced to cap quadratic behavior.
3640ead to
1793928
Compare
This comment has been minimized.
This comment has been minimized.
Adds focused tests for `LineSetStartCouple.__eq__` NotImplemented branch, `LineSet.__str__` / `__getitem__` / non-LineSet `__eq__`, `append_stream` binary-without-encoding and UnicodeDecodeError paths, `report_similarities` table building, and the `process_module` deprecation warning when `linter.current_name` is None. Takes symilar.py from 96% to 99% coverage. The remaining gaps are an unreachable defensive `except KeyError` in `remove_successive` and the `if __name__ == "__main__"` guard.
1793928 to
1e003ed
Compare
|
Hey @hippo91 as the expert in optimizing duplicate-code would you mind reviewing this ? (Work better commit by commit) |
|
🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉 This comment was generated for commit 1e003ed |
Type of Changes
Description
Foor optimizations to symilar:
Rolling hash: compute the window hash incrementally (subtract the leaving line hash, add the entering one) instead of re-summing all k line hashes for every position.
Cache hash_lineset results per lineset in _iter_sims: each file was being hashed once per pair (N-1 times) instead of once total.
Remove the LinesChunk wrapper class and use plain int dict keys, so frozenset intersection and dict lookups use C-level hash/eq.
Above a certain threshold of similar line the algorithm is different (this one is due to
psf/black/profiling/directory and its pathological content, good profiling tool black's contributors, congrat).One optimization to the duplicate-code checker:
Benchmark: duplicate-code checker performance
Measured with:
Small projects are dominated by startup/parsing overhead, but the performance increase shows for bigger one. Maybe we could even add duplicate-code back to the primer now.