Skip to content

Commit 6d07a2a

Browse files
committed
metal: memory and correctness cleanup in gfx/drivers/metal.m
Five related fixes in the Metal driver, grouped together because they're all narrow, independent, and touch the same file: 1. Fix byte offset in MetalRaster.updateGlyph didModifyRange: The managed-storage buffer invalidation for incremental glyph uploads was passing the row index as the byte offset rather than row * stride. Length was correctly in bytes (height * _stride), so the invalidated range described bytes [row_index .. row_index + height*_stride), which almost never overlapped the actually-modified rows. On managed-storage devices this can leave recently-drawn glyphs invisible to the GPU until the atlas is invalidated by some other path. Every other didModifyRange: call site in this file uses a byte range; aligning this one matches the convention. No behavioural change on shared-storage / Cocoa Touch. 2. Stream screenshot read-back one row at a time Context.readBackBuffer: malloced a full-frame BGRA copy of the whole backbuffer, getBytes:'d into it, then converted BGRA->BGR into the caller's buffer row by row. For a 4K capture that is ~32 MiB of transient heap per screenshot. Restructure to getBytes: one row at a time directly into a small scratch buffer (stack up to 16K-wide, heap fallback beyond), then convert in place. Peak transient footprint drops from ~32 MiB to ~16 KiB. Also narrows the getBytes: source region to the viewport Y range instead of reading the whole backbuffer and discarding rows above and below. 3. Unify font atlas upload paths MetalRaster init had two branches: a "fast path" using newBufferWithBytes:length:options: when stride matched atlas width, and a row memcpy loop when it did not. Both copied the atlas exactly once, and the fast path carried a workaround comment noting that newBufferWithBytes: does not correctly invalidate the buffer on macOS, forcing a manual didModifyRange: anyway. That made the two paths behaviourally identical. Collapse both to a single newBufferWithLength: + .contents fill, with a whole-buffer memcpy when stride matches width and a row memcpy loop otherwise. One code path, one invalidation site, no change in allocated memory or copies. 4. Bound BufferChain memory and clear stale per-node allocated BufferChain grew monotonically: allocRange: appended a new node whenever a request exceeded the current node's remaining space, but discard only reset the head pointer and offset. Backing nodes were never freed, so a single oversized allocation (heavy shader pass, one-off geometry spike, content switch to a larger resolution or shader chain) kept its node alive for the lifetime of the driver, retained across all CHAIN_LENGTH chains. Steady-state retention was therefore the all-time high-water mark * CHAIN_LENGTH. Trim the tail at discard: find the last node with allocated > 0 and drop nodes after it. Nodes are appended in alloc order and allocRange: only advances forward, so a trailing unused node means the whole tail is unused and safe to drop. Interior unused nodes are kept (waste bounded by _blockLen per node; they will be reused by smaller allocs on the next cycle). Only trims when the chain was actually used this cycle (_allocated > 0) so a quiescent frame doesn't drop the chain and force reallocation on the next use. Also reset n.allocated on every node at discard. commitRanges walks all nodes with allocated > 0 and didModifyRange:'s them, so without this reset a node that was filled in cycle N but partially refilled in cycle N+1 would get a stale (larger) range committed. Semantically wrong and a bandwidth waste on macOS managed storage. 5. Fix bytesPerRow in TexturedView BGRA upload path TexturedView.updateFrame: (the menu pixel framebuffer upload, called from MetalMenu.updateFrame: -> set_texture_frame) was passing (4 * pitch) as bytesPerRow to replaceRegion: for BGRA8Unorm/BGRX8Unorm sources. pitch is already the source row stride in bytes (libretro convention, matched by the MetalMenu caller which computes it as RPixelFormatToBPP(format) * width, and matched by the adjacent else-branch). Multiplying by 4 told Metal to step 4x the source stride between rows, so row 0 read correct pixels but rows 1..height-1 read beyond the caller's buffer. Most likely to surface on the 32-bit RGUI path (rgb32=true -> RPixelFormatBGRA8Unorm); the 16-bit path (BGRA4Unorm, rgb32=false) goes through the conversion branch and is unaffected. Tested with RGUI, shaders and regular core
1 parent 04731c7 commit 6d07a2a

1 file changed

Lines changed: 120 additions & 47 deletions

File tree

gfx/drivers/metal.m

Lines changed: 120 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -691,10 +691,22 @@ - (bool)captureEnabled
691691

692692
- (bool)readBackBuffer:(uint8_t *)buffer
693693
{
694-
size_t x, y;
695-
NSUInteger dstStride, srcStride;
696-
uint8_t const *src;
697-
uint8_t *dst, *tmp;
694+
/* Read back the viewport region BGRA -> BGR and flip vertically.
695+
*
696+
* We stream one row at a time from Metal into a small scratch
697+
* buffer, converting as we go. Previously this allocated a
698+
* full-frame BGRA copy (width * height * 4 bytes) via malloc(),
699+
* which for a 4K capture is ~32 MiB of transient allocation
700+
* pressure per screenshot. One row is typically a few KiB and
701+
* fits comfortably on the stack (up to 16K width here; beyond
702+
* that we fall back to heap for safety). */
703+
size_t y;
704+
NSUInteger rowBytes, dstStride;
705+
uint8_t *dst;
706+
uint8_t stackRow[16 * 1024];
707+
uint8_t *row = stackRow;
708+
uint8_t *heapRow = NULL;
709+
698710
if (!_captureEnabled || _backBuffer == nil)
699711
return NO;
700712

@@ -704,30 +716,36 @@ - (bool)readBackBuffer:(uint8_t *)buffer
704716
return NO;
705717
}
706718

707-
tmp = malloc(_backBuffer.width * _backBuffer.height * 4);
708-
709-
[_backBuffer getBytes:tmp
710-
bytesPerRow:4 * _backBuffer.width
711-
fromRegion:MTLRegionMake2D(0, 0, _backBuffer.width, _backBuffer.height)
712-
mipmapLevel:0];
713-
714-
srcStride = _backBuffer.width * 4;
715-
src = tmp + (_viewport.y * srcStride);
719+
rowBytes = _backBuffer.width * 4;
720+
if (rowBytes > sizeof(stackRow))
721+
{
722+
heapRow = (uint8_t *)malloc(rowBytes);
723+
if (!heapRow)
724+
return NO;
725+
row = heapRow;
726+
}
716727

717728
dstStride = _viewport.width * 3;
718729
dst = buffer + (_viewport.height - 1) * dstStride;
719730

720-
for (y = 0; y < _viewport.height; y++, src += srcStride, dst -= dstStride)
731+
for (y = 0; y < _viewport.height; y++, dst -= dstStride)
721732
{
733+
size_t x;
734+
[_backBuffer getBytes:row
735+
bytesPerRow:rowBytes
736+
fromRegion:MTLRegionMake2D(0, (NSUInteger)_viewport.y + y,
737+
_backBuffer.width, 1)
738+
mipmapLevel:0];
739+
722740
for (x = 0; x < _viewport.width; x++)
723741
{
724-
dst[3 * x + 0] = src[4 * (_viewport.x + x) + 0];
725-
dst[3 * x + 1] = src[4 * (_viewport.x + x) + 1];
726-
dst[3 * x + 2] = src[4 * (_viewport.x + x) + 2];
742+
dst[3 * x + 0] = row[4 * (_viewport.x + x) + 0];
743+
dst[3 * x + 1] = row[4 * (_viewport.x + x) + 1];
744+
dst[3 * x + 2] = row[4 * (_viewport.x + x) + 2];
727745
}
728746
}
729747

730-
free(tmp);
748+
free(heapRow);
731749

732750
return YES;
733751
}
@@ -962,6 +980,48 @@ - (void)commitRanges
962980

963981
- (void)discard
964982
{
983+
/* Trim the tail: any node that wasn't touched during this
984+
* chain's previous use (allocated == 0) is dropped. Nodes are
985+
* appended in alloc order, so once we see the first trailing
986+
* unused node the whole tail is unused. We only trim when the
987+
* chain was actually used (_allocated > 0) so that a single
988+
* quiescent frame doesn't drop the entire chain and force
989+
* reallocation on the next use.
990+
*
991+
* This bounds retained memory to the recent high-water mark
992+
* rather than the all-time high-water mark, which previously
993+
* grew monotonically: any one-off large allocation (e.g. a
994+
* heavy shader pass or a brief geometry spike) kept its
995+
* oversized backing node alive for the lifetime of the driver,
996+
* across all CHAIN_LENGTH chains. */
997+
if (_head && _allocated > 0)
998+
{
999+
BufferNode *keep = _head;
1000+
BufferNode *n;
1001+
for (n = _head; n != nil; n = n.next)
1002+
{
1003+
if (n.allocated > 0)
1004+
keep = n;
1005+
}
1006+
if (keep.next)
1007+
{
1008+
NSUInteger dropped = 0;
1009+
for (n = keep.next; n != nil; n = n.next)
1010+
dropped += n.src.length;
1011+
_length -= dropped;
1012+
keep.next = nil;
1013+
}
1014+
}
1015+
1016+
/* Reset per-node allocated so commitRanges on the next use of
1017+
* this chain does not didModifyRange: a stale range from this
1018+
* cycle into a node that gets partially refilled. */
1019+
{
1020+
BufferNode *n;
1021+
for (n = _head; n != nil; n = n.next)
1022+
n.allocated = 0;
1023+
}
1024+
9651025
_current = _head;
9661026
_offset = 0;
9671027
_allocated = 0;
@@ -1471,17 +1531,23 @@ - (void)drawWithEncoder:(id<MTLRenderCommandEncoder>)rce
14711531

14721532
- (void)updateFrame:(void const *)src pitch:(NSUInteger)pitch
14731533
{
1534+
/* pitch is the source row stride in bytes (libretro convention,
1535+
* matched by the MetalMenu caller which passes BPP * width).
1536+
* Pass it straight through to Metal: multiplying by 4 here told
1537+
* the driver to walk 4x the source memory between rows, reading
1538+
* past the source allocation on every row after the first. The
1539+
* else-branch already passes pitch straight through. */
14741540
if (_format == RPixelFormatBGRA8Unorm || _format == RPixelFormatBGRX8Unorm)
14751541
{
14761542
[_texture replaceRegion:MTLRegionMake2D(0, 0, (NSUInteger)_size.width, (NSUInteger)_size.height)
14771543
mipmapLevel:0 withBytes:src
1478-
bytesPerRow:(NSUInteger)(4 * pitch)];
1544+
bytesPerRow:pitch];
14791545
}
14801546
else
14811547
{
14821548
[_src replaceRegion:MTLRegionMake2D(0, 0, (NSUInteger)_size.width, (NSUInteger)_size.height)
14831549
mipmapLevel:0 withBytes:src
1484-
bytesPerRow:(NSUInteger)(pitch)];
1550+
bytesPerRow:pitch];
14851551
_srcDirty = YES;
14861552
}
14871553
}
@@ -1650,37 +1716,37 @@ - (instancetype)initWithDriver:(MetalDriver *)driver fontPath:(const char *)font
16501716
_uniforms.projectionMatrix = matrix_proj_ortho(0, 1, 0, 1);
16511717
_atlas = _font_driver->get_atlas(_font_data);
16521718
_stride = MTL_ALIGN_BUFFER(_atlas->width);
1653-
if (_stride == _atlas->width)
1654-
{
1655-
_buffer = [_context.device newBufferWithBytes:_atlas->buffer
1656-
length:(NSUInteger)(_stride * _atlas->height)
1657-
options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
1658-
1659-
/* Even though newBufferWithBytes will copy the initial contents
1660-
* from our atlas, it doesn't seem to invalidate the buffer when
1661-
* doing so, causing corrupted text rendering if we hit this code
1662-
* path. To work around it we manually invalidate the buffer. */
1663-
#if !defined(HAVE_COCOATOUCH)
1664-
[_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
1665-
#endif
1666-
}
1667-
else
1719+
1720+
/* Allocate an uninitialized managed buffer and fill it through
1721+
* .contents. This collapses two previous branches (fast path
1722+
* via newBufferWithBytes:, slow path via row memcpy loop) into
1723+
* one: row memcpy handles both the aligned and padded cases
1724+
* and avoids the newBufferWithBytes: workaround (which had to
1725+
* manually didModifyRange: the whole buffer anyway because
1726+
* the initial copy was not correctly invalidated on macOS). */
1727+
_buffer = [_context.device newBufferWithLength:(NSUInteger)(_stride * _atlas->height)
1728+
options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
16681729
{
16691730
size_t i;
1670-
_buffer = [_context.device newBufferWithLength:(NSUInteger)(_stride * _atlas->height)
1671-
options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
1672-
void *dst = _buffer.contents;
1673-
void *src = _atlas->buffer;
1674-
for (i = 0; i < _atlas->height; i++)
1731+
uint8_t *dst = (uint8_t *)_buffer.contents;
1732+
const uint8_t *src = (const uint8_t *)_atlas->buffer;
1733+
if (_stride == _atlas->width)
1734+
{
1735+
memcpy(dst, src, (size_t)_stride * _atlas->height);
1736+
}
1737+
else
16751738
{
1676-
memcpy(dst, src, _atlas->width);
1677-
dst += _stride;
1678-
src += _atlas->width;
1739+
for (i = 0; i < _atlas->height; i++)
1740+
{
1741+
memcpy(dst, src, _atlas->width);
1742+
dst += _stride;
1743+
src += _atlas->width;
1744+
}
16791745
}
1746+
}
16801747
#if !defined(HAVE_COCOATOUCH)
1681-
[_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
1748+
[_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
16821749
#endif
1683-
}
16841750

16851751
MTLTextureDescriptor *td = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatR8Unorm
16861752
width:_atlas->width
@@ -1756,8 +1822,15 @@ - (void)updateGlyph:(const struct font_glyph *)glyph
17561822
}
17571823

17581824
#if !defined(HAVE_COCOATOUCH)
1759-
NSUInteger offset = glyph->atlas_offset_y;
1760-
NSUInteger len = glyph->height * _stride;
1825+
/* didModifyRange takes a BYTE range, not a row index.
1826+
* Every other call site in this file (lines 958, 1664, 1681,
1827+
* 3082, 3550) passes bytes. Previously offset was the row
1828+
* index, which meant the invalidated range almost never
1829+
* overlapped the actually-modified rows on managed-storage
1830+
* devices, producing stale/garbled glyphs until the entire
1831+
* atlas was invalidated by some other path. */
1832+
NSUInteger offset = (NSUInteger)glyph->atlas_offset_y * _stride;
1833+
NSUInteger len = (NSUInteger)glyph->height * _stride;
17611834
[_buffer didModifyRange:NSMakeRange(offset, len)];
17621835
#endif
17631836

0 commit comments

Comments
 (0)