Improve ZIP store performance and filesystem listing by kulvait · Pull Request #70 · zarr-developers/zarr-java

kulvait · 2026-04-01T16:43:18Z

Motivation for this work was that on large zip stores, the indexing was very slow and then opening arrays as well.
So I did replaced Apache Commons ZipArchiveInputStream by random access java class.
Change in FilesystemStore.java is more cosmetic but also slightly increase performance.

Replace stream-based ZIP access with random-access ZipFile for faster and more efficient reads
Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal
Introduce improved caching of directory structure and file sizes using synchronized maps
Simplify internal logic and remove unnecessary dependencies
Improve filesystem listing performance

Details:

Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore
Added efficient caching using maps for directories and file sizes, async friendly
Normalized entry names for consistent lookup
Reduced redundant computations (e.g. cached entry sizes)
Simplified stream handling and chunk calculation

These changes significantly improve performance, reduce complexity, and make the ZIP store implementation more maintainable.

- Replace stream-based ZIP access with random-access `ZipFile` for faster and more efficient reads - Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal - Introduce improved caching of directory structure and file sizes using synchronized maps - Simplify internal logic and remove unnecessary dependencies - Improve filesystem listing performance Details: - Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore - Added efficient caching using maps for directories and file sizes, async friendly - Normalized entry names for consistent lookup - Reduced redundant computations (e.g. cached entry sizes) - Simplified stream handling and chunk calculation These changes significantly improve performance, reduce complexity, and make the ZIP store implementation more maintainable.

… and directory existence - Problem: * Some tests failed because ReadOnlyZipStore could not locate ZIP entries when the store was created by simply zipping a directory. Tools differ: some produce entry names with a leading slash, others without. This caused getInputStream(), read(), and getSize() to return null. * testExists() failed for root keys because exists() included directories, but by design it should only test file existence. - Solution: * Added resolvePathWithLeadingSlashFromKeys() to try a secondary lookup with a leading slash. Primary lookup uses the standard key without leading slash; secondary is only for compatibility. * Modified exists(String[] keys) to check only fileSizeIndex, ignoring directories, to match the intended design. - Effect: * ReadOnlyZipStoreTest passes regardless of whether ZIP entries have leading slashes or not. * Test logic now clearly distinguishes between file existence and directory entries.

kulvait · 2026-04-16T14:54:06Z

I have updated the PR so it fixes failed workflow and tests. Can you please run it again?

This reverts a previous attempt to simplify the FilesystemStore#list implementation using a parallel stream and String-based path splitting: Files.walk(rootPath) .filter(Files::isRegularFile) .parallel() .map(path -> rootPath.relativize(path) .toString() .split(File.separator)); While this approach showed performance improvements in some scenarios (likely due to .parallel() decoupling filesystem traversal from mapping), it introduced a platform-specific issue: String.split() expects a regex, and File.separator ("\" on Windows) leads to a PatternSyntaxException. A correct fix would require quoting the separator, e.g.: split((java.util.regex.Pattern.quote(File.separator)) However, this change is outside the scope of the current PR, which focuses on ZipStore improvements rather than altering core FilesystemStore behavior. Revert to the upstream commit c9a5ee1 for correctness and consistency. As a follow-up, it may be worth evaluating the use of .parallel() separately, as it appears to provide measurable benefits in some workloads. Change of ReadOnlyZipStore.java is formatting only

kulvait · 2026-04-17T19:49:25Z

Due to remaining Windows test failures, I reverted FilesystemStore to the upstream implementation, see commit message. This change should restore cross-platform compatibility and allow all tests to pass. Please approve workflow.

…resource leaks Both get(...) and getInputStream(...) now delegate to a shared helper method that reads ZIP entry data into a byte[]. This refactor removes duplicated ZIP handling logic (entry lookup, range validation, skip/read loop) and ensures consistent behavior across both APIs. Previous implementation of getInputStream() returned an InputStream backed by a ZipFile entry. If callers failed to close the stream, as is a case in one of the tests in StoreTest.java, on Windows this caused file locking issues, wchich prevented deletion of the underlying ZIP file during tests.

- Clean up debugging and temporary logging statements - Rename ensureCacheNew to buildZipIndex for clarity - Replace synchronized HashSet wrappers with ConcurrentHashMap.newKeySet() to improve performance and reduce locking overhead during indexing

kulvait · 2026-04-23T13:32:42Z

I hope this PR will be considered for merge, as I put a lot of effort into addressing issues that were triggering test failures in CI. The changes also meaningfully improve the performance and efficiency of the ReadOnlyZipStore implementation.

normanrz

Thanks for the PR!
I think this should be a new class, because this implementation with ZipFile only works on the filesystem (FilesystemStore), whereas the existing ReadOnlyZipStore also works with remote stores (e.g. HttpStore, S3Store).

Thank you for the review, these suggestions integrate cleanly. I will review remaining suggestions and comment or integrate them. Co-authored-by: Norman Rzepka <[email protected]>

Renamed normalizeEntryName to stripLeadingAndTrailingSlashes for better clarity. Enhanced the normalization logic to handle multiple consecutive leading or trailing slashes. Now, any degenerate entries with multiple slashes are properly stripped to ensure consistent and safe entry paths. Removed redundant checks for leading or trailing slashes after normalization. -> Based on Norman Rzepka's suggestion. Since the normalization process now handles all cases, there's no need for further slash-stripping. Updated zip entry processing logic to: Continue using the isDirectory() flag and directoryToChildrenDirectoriesIndex for directory entries. Unified root directory handling, creating indices, no recursion.

kulvait · 2026-05-04T08:32:00Z

Thank you, Norman, for your comments and suggestions!

Personally, I don't have a strong preference on whether this should be implemented as a new class or not. From my perspective, in my plugin, reading the ZIP-backed Zarr is becoming more responsible than just the on-disk representation. It would be nice to support e.g. S3Store via HttpRange requests, but at the moment, this would be lot of work.

Feel free to rename the class.

normanrz

Thanks for the edits. This needs to be a new class with clear instructions for the users that it only works for filesystem stores. Maybe call it ReadOnlyFilesystemStore.

kulvait added 2 commits April 1, 2026 18:34

kulvait added 2 commits April 23, 2026 13:34

normanrz reviewed Apr 24, 2026

View reviewed changes

kulvait and others added 2 commits May 4, 2026 08:58

Apply suggestions from code review of Norman Rzepka

6761eb5

Thank you for the review, these suggestions integrate cleanly. I will review remaining suggestions and comment or integrate them. Co-authored-by: Norman Rzepka <[email protected]>

normanrz reviewed May 4, 2026

View reviewed changes

Comment thread src/main/java/dev/zarr/zarrjava/store/ReadOnlyZipStore.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ZIP store performance and filesystem listing#70

Improve ZIP store performance and filesystem listing#70
kulvait wants to merge 7 commits intozarr-developers:mainfrom
kulvait:zip-clean

kulvait commented Apr 1, 2026

Uh oh!

kulvait commented Apr 16, 2026

Uh oh!

kulvait commented Apr 17, 2026

Uh oh!

kulvait commented Apr 23, 2026 •

edited

Loading

Uh oh!

normanrz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kulvait commented May 4, 2026

Uh oh!

normanrz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kulvait commented Apr 1, 2026

Uh oh!

kulvait commented Apr 16, 2026

Uh oh!

kulvait commented Apr 17, 2026

Uh oh!

kulvait commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kulvait commented May 4, 2026

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kulvait commented Apr 23, 2026 •

edited

Loading