What would you like to be added ?
Here, the cache refers to the file written after a download to the .downloaded folder.
Currently, the caching system does not handle concurrent read and write access.
If an eodag instance forces the regeneration of a cache while another instance is serving it, the shared file will cause the code to freeze, or worse, serve corrupted data.
To prevent this, cache writing must occur transiently when no one is reading it.
Why is this needed ?
With an eodag server, or multiple instances of the eodag library, concurrent cache management is necessary.
The goal is to prevent multiple file operations from occurring simultaneously and corrupting the file or failing the download.
(read vs write) When serving a file as a stream, the read operation is spread out over time, but the read pointer is not necessarily maintained. If the file has just been written to or deleted during the service, it will be served corrupted or truncated without any apparent error.
(write vs write) The same applies if two simultaneous downloads are requested from streaming resources, and chunks from two overlapping streams are simultaneously written to the same file. File pointer protection is not maintained, and without an application mutex, there is no protection against these scenarios.
How should it be implemented ?
To achieve this:
- Cache generation of a new cache file must be done in a temporary file, unique on each action, not the final destination.
- eodag must use an internal state file (shared between instances and managed concurrently) indicating whether a cache is being read, to wait for it to be replaced by the new download in a temporary file. - Ideally, a cache update request queue should be centralized to eliminate duplicates (2 requests to update the same cache, which having waited at the same time are triggered simultaneously), only one of the two (usually the most recent) is applied.
- If the request volume increases significantly, it will be necessary to consider implementing DMA behavior: instead of waiting for writes until there are no more reads, reads and writes are placed on a stack, contiguous reads execute in parallel, and as soon as a write operation is popped from the stack, execution switches to sequential to prevent new read requests from executing concurrently with writes.
What would you like to be added ?
Here, the cache refers to the file written after a download to the .downloaded folder.
Currently, the caching system does not handle concurrent read and write access.
If an eodag instance forces the regeneration of a cache while another instance is serving it, the shared file will cause the code to freeze, or worse, serve corrupted data.
To prevent this, cache writing must occur transiently when no one is reading it.
Why is this needed ?
With an eodag server, or multiple instances of the eodag library, concurrent cache management is necessary.
The goal is to prevent multiple file operations from occurring simultaneously and corrupting the file or failing the download.
(read vs write) When serving a file as a stream, the read operation is spread out over time, but the read pointer is not necessarily maintained. If the file has just been written to or deleted during the service, it will be served corrupted or truncated without any apparent error.
(write vs write) The same applies if two simultaneous downloads are requested from streaming resources, and chunks from two overlapping streams are simultaneously written to the same file. File pointer protection is not maintained, and without an application mutex, there is no protection against these scenarios.
How should it be implemented ?
To achieve this: