Skip to content

Commit 9e20d7e

Browse files
committed
rpng: stream deflate output via multi-IDAT instead of one big buffer
Replaces rpng_save_image_stream's full-frame encode_buf + monolithic deflate with per-row incremental deflate feeding a 16 KiB chunk buffer that emits multiple IDAT chunks as it fills. Pixel output is unchanged; only the internal memory shape differs. Peak transient memory at 4K BGR24 drops from ~50 MiB (encode_buf + deflate_buf, both sized to the full filtered frame) to ~70 KiB (per-row filter scratch + one 16 KiB chunk buffer). Memory now scales with width only, not width*height. Cost is ~0.07% more output bytes from IDAT chunk headers (12 bytes per 16 KiB chunk): size old bytes new bytes delta 320x240 19,733 19,745 +12 (+0.06%) 1024x768 100,995 101,067 +72 (+0.07%) 1920x1080 181,909 182,041 +132 (+0.07%) A 1920x1080 screenshot now produces 12 IDAT chunks vs. 1 before. pngcheck accepts both old and new output at identical compression ratio (97.8% on 1920x1080 pseudorandom). Wall-clock performance: no regression. Measured on Linux x86_64 with /tmp on tmpfs, 5 iterations per config, two trials each: size pattern old avg new avg delta 1280x720 solid 112.1 ms 112.2 ms +0.1% 1280x720 gradient 116.2 ms 112.7 ms -3.0% 1280x720 photo 497.7 ms 497.2 ms -0.1% 1280x720 random 129.3 ms 127.8 ms -1.1% 1920x1080 solid 244.9 ms 238.3 ms -2.7% 1920x1080 gradient 249.5 ms 240.5 ms -3.6% 1920x1080 random 270.8 ms 269.5 ms -0.5% Deltas below ~3% are within the noise floor observed for the same encoder trial-to-trial. The overall shape is that the new encoder is equal to or marginally faster than the old one across every measured case, plausibly from reduced allocator pressure (no ~25 MiB malloc+free per encode) and better cache locality. Results on spinning disks where each IDAT flush becomes a real write may differ; not measured. Filter selection, per-row scratch layout, and prev_encoded carry-over are unchanged -- same five filters, same lowest-SAD pick. Only the downstream "what happens to the filtered row" differs. Three trans_stream_zlib subtleties the implementation accounts for: 1. The backend reports AGAIN (not BUFFER_FULL) when avail_out and avail_in both hit zero on the same call -- BUFFER_FULL requires avail_in != 0. We detect this via chunk_fill >= IDAT_CHUNK_SIZE after every successful row feed and flush proactively; otherwise the next row's trans() would find avail_out=0 with fresh input waiting and return Z_BUF_ERROR. 2. During the final Z_FINISH drain, AGAIN means "more output pending, buffer full" (set_in(NULL,0) ensures input is never the gating factor). We flush and retry until NONE. 3. flush_idat_chunk tolerates payload_len=0 so a Z_STREAM_END landing on an exact chunk boundary doesn't emit a spurious empty IDAT. Ships with a round-trip regression test (libretro-common/samples/formats/png/rpng_roundtrip_test.c) verifying pixel-level round-trip across all three public encode entry points (argb, bgr24 top-down, bgr24 bottom-up via the negative-pitch trick used by task_screenshot's viewport fast path) and three size tiers: - Hand-picked (45): 4x4, 37x29, 320x240 x 5 patterns x 3 entry points. Patterns exercise different filter selection outcomes. - Small-width (48): widths {1,2,3,8,31,32,33,257} x heights {1,3} x 3 entry points, pseudorandom. Catches off-by-one bugs at narrow-width and near- alignment boundaries. - Large (12): 1024x1, 1x1024, 2048x1, 1x12000 x 3 entry points, pseudorandom. 1x12000 crosses zlib's default 32 KiB sliding window. 105 subtests total. Exit 0 on success. Every encoded PNG also passes a structural check: pngcheck(1) if installed, otherwise a built-in fallback that walks chunks, verifies CRCs via encoding_crc32 (already linked), and enforces chunk ordering. pngcheck is not a build or runtime dependency. When the built-in fallback is active a startup self-test generates a valid PNG, confirms the fallback accepts it, corrupts a byte, confirms the fallback rejects it -- aborting if the validator itself is broken rather than running 105 subtests under a broken validator. CI integration (.github/workflows/Linux-libretro-common-samples.yml): adds rpng_roundtrip_test to the RUN_TARGETS allowlist and extends the Makefile target-extraction regex from TARGET_TEST to TARGET_TEST[0-9]* so that TARGET_TEST2 is picked up by auto-discovery. Strict superset -- extracted target lists are unchanged for every other Makefile in libretro-common/samples/. Verified: - 105/105 pass with pngcheck validation active. - 105/105 pass with built-in fallback active (stock ubuntu-latest runner scenario -- no pngcheck, zlib1g-dev only). - ASan+UBSan clean in both configurations. - pngcheck OK on every output file. - File sizes within +0.07% of old encoder. - No wall-clock regression on measured configurations.
1 parent 5aabb22 commit 9e20d7e

4 files changed

Lines changed: 956 additions & 104 deletions

File tree

.github/workflows/Linux-libretro-common-samples.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ jobs:
5959
rtga_test
6060
rbmp_test
6161
rpng_chunk_overflow_test
62+
rpng_roundtrip_test
6263
)
6364
6465
# Per-binary run command (overrides ./<binary> if present).
@@ -130,8 +131,9 @@ jobs:
130131
# TARGET := foo
131132
# TARGETS = a b c
132133
# TARGET_TEST := foo_test (second target in same Makefile)
134+
# TARGET_TEST2 := foo_test2 (third target, fourth, ...)
133135
mapfile -t targets < <(
134-
grep -hE '^(TARGET|TARGETS|TARGET_TEST)[[:space:]]*[:?]?=' "$d/Makefile" \
136+
grep -hE '^(TARGET|TARGETS|TARGET_TEST[0-9]*)[[:space:]]*[:?]?=' "$d/Makefile" \
135137
| sed -E 's/^[^=]*=[[:space:]]*//' \
136138
| tr -s ' \t' '\n' \
137139
| grep -v '^$' \

libretro-common/formats/png/rpng_encode.c

Lines changed: 168 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -204,26 +204,56 @@ static unsigned filter_paeth(uint8_t *target,
204204
return count_sad(target, width);
205205
}
206206

207+
/* Size of the per-chunk deflate output buffer. A screenshot-sized
208+
* encode will fill this many times over and produce multiple IDAT
209+
* chunks; smaller than zlib's default window (32 KiB) to keep
210+
* peak memory low while large enough to amortise chunk-header
211+
* overhead (12 bytes per IDAT). */
212+
#define IDAT_CHUNK_SIZE 16384
213+
214+
/* Emit one IDAT chunk. `chunk_buf` is laid out as:
215+
* bytes [0..4): length field (filled in here, big-endian)
216+
* bytes [4..8): literal "IDAT"
217+
* bytes [8..8+payload_len): the deflate output produced by
218+
* this chunk's worth of trans() calls
219+
* Matches the layout png_write_idat_string expects. */
220+
static bool flush_idat_chunk(intfstream_t *intf_s,
221+
uint8_t *chunk_buf, size_t payload_len)
222+
{
223+
if (payload_len == 0)
224+
return true; /* empty chunk -- nothing to emit, not an error */
225+
dword_write_be(chunk_buf + 0, (uint32_t)payload_len);
226+
memcpy(chunk_buf + 4, "IDAT", 4);
227+
return png_write_idat_string(intf_s, chunk_buf, payload_len + 8);
228+
}
229+
207230
bool rpng_save_image_stream(const uint8_t *data, intfstream_t* intf_s,
208231
unsigned width, unsigned height, signed pitch, unsigned bpp)
209232
{
210233
unsigned h;
211234
struct png_ihdr ihdr = {0};
212235
bool ret = true;
213236
const struct trans_stream_backend *stream_backend = NULL;
214-
size_t encode_buf_size = 0;
215-
uint8_t *encode_buf = NULL;
216-
uint8_t *deflate_buf = NULL;
217-
uint8_t *rgba_line = NULL;
218-
uint8_t *up_filtered = NULL;
219-
uint8_t *sub_filtered = NULL;
220-
uint8_t *avg_filtered = NULL;
221-
uint8_t *paeth_filtered = NULL;
222-
uint8_t *prev_encoded = NULL;
223-
uint8_t *encode_target = NULL;
224-
void *stream = NULL;
225-
uint32_t total_in = 0;
226-
uint32_t total_out = 0;
237+
uint8_t *rgba_line = NULL;
238+
uint8_t *up_filtered = NULL;
239+
uint8_t *sub_filtered = NULL;
240+
uint8_t *avg_filtered = NULL;
241+
uint8_t *paeth_filtered = NULL;
242+
uint8_t *prev_encoded = NULL;
243+
/* filter_line holds [filter_byte][filtered_row] and is what we
244+
* feed into deflate one row at a time. */
245+
uint8_t *filter_line = NULL;
246+
/* chunk_buf is the IDAT-chunk staging buffer:
247+
* [0..4): length field (filled in at flush time)
248+
* [4..8): "IDAT"
249+
* [8..8+IDAT_CHUNK_SIZE): deflate output */
250+
uint8_t *chunk_buf = NULL;
251+
void *stream = NULL;
252+
size_t line_len = (size_t)width * bpp;
253+
/* How many bytes deflate has produced into the current chunk_buf
254+
* since the last set_out. Reset to 0 after every flush_idat_chunk. */
255+
size_t chunk_fill = 0;
256+
enum trans_stream_error err = TRANS_STREAM_ERROR_NONE;
227257

228258
if (!intf_s)
229259
GOTO_END_ERROR();
@@ -233,133 +263,170 @@ bool rpng_save_image_stream(const uint8_t *data, intfstream_t* intf_s,
233263
if (intfstream_write(intf_s, png_magic, sizeof(png_magic)) != sizeof(png_magic))
234264
GOTO_END_ERROR();
235265

236-
ihdr.width = width;
237-
ihdr.height = height;
238-
ihdr.depth = 8;
266+
ihdr.width = width;
267+
ihdr.height = height;
268+
ihdr.depth = 8;
239269
ihdr.color_type = bpp == sizeof(uint32_t) ? 6 : 2; /* RGBA or RGB */
240270
if (!png_write_ihdr_string(intf_s, &ihdr))
241271
GOTO_END_ERROR();
242272

243-
encode_buf_size = (width * bpp + 1) * height;
244-
encode_buf = (uint8_t*)malloc(encode_buf_size);
245-
if (!encode_buf)
273+
/* Per-row scratch. ~width*bpp each -- trivial compared to the
274+
* frame-sized encode_buf the old full-buffer path allocated. */
275+
prev_encoded = (uint8_t*)calloc(1, line_len);
276+
rgba_line = (uint8_t*)malloc(line_len);
277+
up_filtered = (uint8_t*)malloc(line_len);
278+
sub_filtered = (uint8_t*)malloc(line_len);
279+
avg_filtered = (uint8_t*)malloc(line_len);
280+
paeth_filtered = (uint8_t*)malloc(line_len);
281+
filter_line = (uint8_t*)malloc(line_len + 1);
282+
chunk_buf = (uint8_t*)malloc(IDAT_CHUNK_SIZE + 8);
283+
if (!prev_encoded || !rgba_line || !up_filtered || !sub_filtered
284+
|| !avg_filtered || !paeth_filtered || !filter_line
285+
|| !chunk_buf)
246286
GOTO_END_ERROR();
247287

248-
prev_encoded = (uint8_t*)calloc(1, width * bpp);
249-
if (!prev_encoded)
288+
stream = stream_backend->stream_new();
289+
if (!stream)
250290
GOTO_END_ERROR();
251291

252-
rgba_line = (uint8_t*)malloc(width * bpp);
253-
up_filtered = (uint8_t*)malloc(width * bpp);
254-
sub_filtered = (uint8_t*)malloc(width * bpp);
255-
avg_filtered = (uint8_t*)malloc(width * bpp);
256-
paeth_filtered = (uint8_t*)malloc(width * bpp);
257-
if (!rgba_line || !up_filtered || !sub_filtered || !avg_filtered || !paeth_filtered)
258-
GOTO_END_ERROR();
292+
/* Point deflate's output at our chunk staging area (after the
293+
* 8-byte chunk header). We re-point it every time we flush
294+
* a chunk so the driver doesn't need to know the chunk layout. */
295+
stream_backend->set_out(stream,
296+
chunk_buf + 8, (uint32_t)IDAT_CHUNK_SIZE);
259297

260-
encode_target = encode_buf;
261-
for (h = 0; h < height;
262-
h++, encode_target += width * bpp, data += pitch)
298+
for (h = 0; h < height; h++, data += pitch)
263299
{
300+
uint32_t rd, wn;
301+
uint8_t filter;
302+
unsigned none_score, up_score, sub_score, avg_score, paeth_score;
303+
unsigned min_sad;
304+
const uint8_t *chosen_filtered;
305+
264306
if (bpp == sizeof(uint32_t))
265307
copy_argb_line(rgba_line, (const uint32_t*)data, width);
266308
else
267309
copy_bgr24_line(rgba_line, data, width);
268310

269-
/* Try every filtering method, and choose the method
270-
* which has most entries as zero.
311+
/* Filter selection unchanged from the previous implementation:
312+
* try every filter, pick the one with lowest sum-of-abs-deviation. */
313+
none_score = count_sad(rgba_line, line_len);
314+
up_score = filter_up (up_filtered, rgba_line, prev_encoded, width, bpp);
315+
sub_score = filter_sub (sub_filtered, rgba_line, width, bpp);
316+
avg_score = filter_avg (avg_filtered, rgba_line, prev_encoded, width, bpp);
317+
paeth_score = filter_paeth(paeth_filtered, rgba_line, prev_encoded, width, bpp);
318+
319+
filter = 0;
320+
min_sad = none_score;
321+
chosen_filtered = rgba_line;
322+
if (sub_score < min_sad) { filter = 1; chosen_filtered = sub_filtered; min_sad = sub_score; }
323+
if (up_score < min_sad) { filter = 2; chosen_filtered = up_filtered; min_sad = up_score; }
324+
if (avg_score < min_sad) { filter = 3; chosen_filtered = avg_filtered; min_sad = avg_score; }
325+
if (paeth_score < min_sad) { filter = 4; chosen_filtered = paeth_filtered; }
326+
327+
filter_line[0] = filter;
328+
memcpy(filter_line + 1, chosen_filtered, line_len);
329+
memcpy(prev_encoded, rgba_line, line_len);
330+
331+
/* Feed this row into deflate. The loop handles the case where
332+
* our chunk buffer fills mid-row (BUFFER_FULL): flush IDAT,
333+
* point deflate at a fresh output buffer, and keep going.
271334
*
272-
* This is probably not very optimal, but it's very
273-
* simple to implement.
274-
*/
335+
* When trans() returns success with err=AGAIN, zlib has
336+
* consumed what we gave it but hasn't finalized (no Z_FINISH
337+
* was requested) -- that's the normal "ok, send more data
338+
* next time" signal. We break out and feed the next row. */
339+
stream_backend->set_in(stream, filter_line, (uint32_t)(line_len + 1));
340+
for (;;)
275341
{
276-
unsigned none_score = count_sad(rgba_line, width * bpp);
277-
unsigned up_score = filter_up(up_filtered, rgba_line, prev_encoded, width, bpp);
278-
unsigned sub_score = filter_sub(sub_filtered, rgba_line, width, bpp);
279-
unsigned avg_score = filter_avg(avg_filtered, rgba_line, prev_encoded, width, bpp);
280-
unsigned paeth_score = filter_paeth(paeth_filtered, rgba_line, prev_encoded, width, bpp);
281-
282-
uint8_t filter = 0;
283-
unsigned min_sad = none_score;
284-
const uint8_t *chosen_filtered = rgba_line;
285-
286-
if (sub_score < min_sad)
287-
{
288-
filter = 1;
289-
chosen_filtered = sub_filtered;
290-
min_sad = sub_score;
291-
}
342+
bool ok = stream_backend->trans(stream, false, &rd, &wn, &err);
343+
chunk_fill += wn;
292344

293-
if (up_score < min_sad)
345+
if (ok)
294346
{
295-
filter = 2;
296-
chosen_filtered = up_filtered;
297-
min_sad = up_score;
347+
/* All input consumed. If the output buffer also happens
348+
* to be exactly full (avail_in=0 AND avail_out=0 on the
349+
* same call, which the trans API reports as success
350+
* with AGAIN rather than BUFFER_FULL), flush proactively
351+
* -- otherwise the next row's trans() would find
352+
* avail_out=0 and error out. */
353+
if (chunk_fill >= IDAT_CHUNK_SIZE)
354+
{
355+
if (!flush_idat_chunk(intf_s, chunk_buf, chunk_fill))
356+
GOTO_END_ERROR();
357+
chunk_fill = 0;
358+
stream_backend->set_out(stream,
359+
chunk_buf + 8, (uint32_t)IDAT_CHUNK_SIZE);
360+
}
361+
break;
298362
}
299363

300-
if (avg_score < min_sad)
301-
{
302-
filter = 3;
303-
chosen_filtered = avg_filtered;
304-
min_sad = avg_score;
305-
}
306-
307-
if (paeth_score < min_sad)
308-
{
309-
filter = 4;
310-
chosen_filtered = paeth_filtered;
311-
}
364+
if (err != TRANS_STREAM_ERROR_BUFFER_FULL)
365+
GOTO_END_ERROR();
312366

313-
*encode_target++ = filter;
314-
memcpy(encode_target, chosen_filtered, width * bpp);
315-
316-
memcpy(prev_encoded, rgba_line, width * bpp);
367+
/* Output filled mid-row. chunk_fill should equal
368+
* IDAT_CHUNK_SIZE. Flush and re-point. */
369+
if (!flush_idat_chunk(intf_s, chunk_buf, chunk_fill))
370+
GOTO_END_ERROR();
371+
chunk_fill = 0;
372+
stream_backend->set_out(stream,
373+
chunk_buf + 8, (uint32_t)IDAT_CHUNK_SIZE);
317374
}
318375
}
319376

320-
/* Deflate output is typically smaller than input; zlib guarantees it
321-
* will not exceed input + 0.1% + 12 bytes. We use 1.01x + 256 for
322-
* safety, which is substantially less wasteful than the previous 2x. */
377+
/* All rows consumed. Drain deflate with Z_FINISH, emitting IDATs
378+
* on BUFFER_FULL, final partial on NONE (Z_STREAM_END). */
379+
stream_backend->set_in(stream, NULL, 0);
380+
for (;;)
323381
{
324-
size_t deflate_bound = (size_t)(encode_buf_size * 1.01) + 256;
325-
deflate_buf = (uint8_t*)malloc(deflate_bound + 8); /* +8 for IDAT header */
326-
if (!deflate_buf)
327-
GOTO_END_ERROR();
382+
uint32_t rd = 0, wn = 0;
383+
bool ok = stream_backend->trans(stream, true, &rd, &wn, &err);
384+
chunk_fill += wn;
328385

329-
stream = stream_backend->stream_new();
330-
331-
if (!stream)
386+
if (!ok)
387+
{
388+
/* BUFFER_FULL during flush-drain with avail_in=0 shouldn't
389+
* strictly be reachable, but handle defensively. */
390+
if (err != TRANS_STREAM_ERROR_BUFFER_FULL)
391+
GOTO_END_ERROR();
392+
if (!flush_idat_chunk(intf_s, chunk_buf, chunk_fill))
393+
GOTO_END_ERROR();
394+
chunk_fill = 0;
395+
stream_backend->set_out(stream,
396+
chunk_buf + 8, (uint32_t)IDAT_CHUNK_SIZE);
397+
continue;
398+
}
399+
if (err == TRANS_STREAM_ERROR_AGAIN)
400+
{
401+
/* Z_OK during Z_FINISH with avail_in=0 means deflate has
402+
* more output to emit but our buffer ran out of space.
403+
* Flush the full chunk and give it more room. */
404+
if (!flush_idat_chunk(intf_s, chunk_buf, chunk_fill))
405+
GOTO_END_ERROR();
406+
chunk_fill = 0;
407+
stream_backend->set_out(stream,
408+
chunk_buf + 8, (uint32_t)IDAT_CHUNK_SIZE);
409+
continue;
410+
}
411+
/* err == NONE: Z_STREAM_END. Flush whatever's in the buffer
412+
* and we're done. flush_idat_chunk tolerates chunk_fill==0. */
413+
if (!flush_idat_chunk(intf_s, chunk_buf, chunk_fill))
332414
GOTO_END_ERROR();
333-
334-
stream_backend->set_in(
335-
stream,
336-
encode_buf,
337-
(unsigned)encode_buf_size);
338-
stream_backend->set_out(
339-
stream,
340-
deflate_buf + 8,
341-
(unsigned)deflate_bound);
415+
break;
342416
}
343417

344-
if (!stream_backend->trans(stream, true, &total_in, &total_out, NULL))
345-
GOTO_END_ERROR();
346-
347-
memcpy(deflate_buf + 4, "IDAT", 4);
348-
dword_write_be(deflate_buf + 0, ((uint32_t)total_out));
349-
if (!png_write_idat_string(intf_s, deflate_buf, ((size_t)total_out + 8)))
350-
GOTO_END_ERROR();
351-
352418
if (!png_write_iend_string(intf_s))
353419
GOTO_END_ERROR();
420+
354421
end:
355-
free(encode_buf);
356-
free(deflate_buf);
357422
free(rgba_line);
358423
free(prev_encoded);
359424
free(up_filtered);
360425
free(sub_filtered);
361426
free(avg_filtered);
362427
free(paeth_filtered);
428+
free(filter_line);
429+
free(chunk_buf);
363430

364431
if (stream_backend)
365432
{

0 commit comments

Comments
 (0)