metal: memory and correctness cleanup in gfx/drivers/metal.m

LibretroAdmin · LibretroAdmin · commit 6d07a2a97edd · 2026-04-22T15:32:51.000+02:00
Five related fixes in the Metal driver, grouped together because
they're all narrow, independent, and touch the same file:

1. Fix byte offset in MetalRaster.updateGlyph didModifyRange:

   The managed-storage buffer invalidation for incremental glyph
   uploads was passing the row index as the byte offset rather than
   row * stride. Length was correctly in bytes (height * _stride),
   so the invalidated range described bytes
   [row_index .. row_index + height*_stride), which almost never
   overlapped the actually-modified rows. On managed-storage devices
   this can leave recently-drawn glyphs invisible to the GPU until
   the atlas is invalidated by some other path.

   Every other didModifyRange: call site in this file uses a byte
   range; aligning this one matches the convention. No behavioural
   change on shared-storage / Cocoa Touch.

2. Stream screenshot read-back one row at a time

   Context.readBackBuffer: malloced a full-frame BGRA copy of the
   whole backbuffer, getBytes:'d into it, then converted BGRA-&gt;BGR
   into the caller's buffer row by row. For a 4K capture that is
   ~32 MiB of transient heap per screenshot.

   Restructure to getBytes: one row at a time directly into a small
   scratch buffer (stack up to 16K-wide, heap fallback beyond),
   then convert in place. Peak transient footprint drops from
   ~32 MiB to ~16 KiB. Also narrows the getBytes: source region to
   the viewport Y range instead of reading the whole backbuffer
   and discarding rows above and below.

3. Unify font atlas upload paths

   MetalRaster init had two branches: a "fast path" using
   newBufferWithBytes:length:options: when stride matched atlas
   width, and a row memcpy loop when it did not. Both copied the
   atlas exactly once, and the fast path carried a workaround
   comment noting that newBufferWithBytes: does not correctly
   invalidate the buffer on macOS, forcing a manual
   didModifyRange: anyway. That made the two paths behaviourally
   identical.

   Collapse both to a single newBufferWithLength: + .contents
   fill, with a whole-buffer memcpy when stride matches width
   and a row memcpy loop otherwise. One code path, one
   invalidation site, no change in allocated memory or copies.

4. Bound BufferChain memory and clear stale per-node allocated

   BufferChain grew monotonically: allocRange: appended a new node
   whenever a request exceeded the current node's remaining space,
   but discard only reset the head pointer and offset. Backing
   nodes were never freed, so a single oversized allocation (heavy
   shader pass, one-off geometry spike, content switch to a larger
   resolution or shader chain) kept its node alive for the lifetime
   of the driver, retained across all CHAIN_LENGTH chains.
   Steady-state retention was therefore the all-time high-water
   mark * CHAIN_LENGTH.

   Trim the tail at discard: find the last node with allocated &gt; 0
   and drop nodes after it. Nodes are appended in alloc order and
   allocRange: only advances forward, so a trailing unused node
   means the whole tail is unused and safe to drop. Interior
   unused nodes are kept (waste bounded by _blockLen per node;
   they will be reused by smaller allocs on the next cycle).
   Only trims when the chain was actually used this cycle
   (_allocated &gt; 0) so a quiescent frame doesn't drop the chain
   and force reallocation on the next use.

   Also reset n.allocated on every node at discard. commitRanges
   walks all nodes with allocated &gt; 0 and didModifyRange:'s them,
   so without this reset a node that was filled in cycle N but
   partially refilled in cycle N+1 would get a stale (larger)
   range committed. Semantically wrong and a bandwidth waste on
   macOS managed storage.

5. Fix bytesPerRow in TexturedView BGRA upload path

   TexturedView.updateFrame: (the menu pixel framebuffer upload,
   called from MetalMenu.updateFrame: -&gt; set_texture_frame) was
   passing (4 * pitch) as bytesPerRow to replaceRegion: for
   BGRA8Unorm/BGRX8Unorm sources. pitch is already the source row
   stride in bytes (libretro convention, matched by the MetalMenu
   caller which computes it as RPixelFormatToBPP(format) * width,
   and matched by the adjacent else-branch).

   Multiplying by 4 told Metal to step 4x the source stride
   between rows, so row 0 read correct pixels but rows 1..height-1
   read beyond the caller's buffer. Most likely to surface on the
   32-bit RGUI path (rgb32=true -&gt; RPixelFormatBGRA8Unorm); the
   16-bit path (BGRA4Unorm, rgb32=false) goes through the
   conversion branch and is unaffected.

Tested with RGUI, shaders and regular core
diff --git a/gfx/drivers/metal.m b/gfx/drivers/metal.m
@@ -691,10 +691,22 @@ - (bool)captureEnabled
 
 - (bool)readBackBuffer:(uint8_t *)buffer
 {
-   size_t x, y;
-   NSUInteger dstStride, srcStride;
-   uint8_t const *src;
-   uint8_t *dst, *tmp;
+   /* Read back the viewport region BGRA -> BGR and flip vertically.
+    *
+    * We stream one row at a time from Metal into a small scratch
+    * buffer, converting as we go. Previously this allocated a
+    * full-frame BGRA copy (width * height * 4 bytes) via malloc(),
+    * which for a 4K capture is ~32 MiB of transient allocation
+    * pressure per screenshot. One row is typically a few KiB and
+    * fits comfortably on the stack (up to 16K width here; beyond
+    * that we fall back to heap for safety). */
+   size_t y;
+   NSUInteger rowBytes, dstStride;
+   uint8_t *dst;
+   uint8_t  stackRow[16 * 1024];
+   uint8_t *row        = stackRow;
+   uint8_t *heapRow    = NULL;
+
    if (!_captureEnabled || _backBuffer == nil)
       return NO;
 
@@ -704,30 +716,36 @@ - (bool)readBackBuffer:(uint8_t *)buffer
       return NO;
    }
 
-   tmp = malloc(_backBuffer.width * _backBuffer.height * 4);
-
-   [_backBuffer getBytes:tmp
-             bytesPerRow:4 * _backBuffer.width
-              fromRegion:MTLRegionMake2D(0, 0, _backBuffer.width, _backBuffer.height)
-             mipmapLevel:0];
-
-   srcStride = _backBuffer.width * 4;
-   src       = tmp + (_viewport.y * srcStride);
+   rowBytes  = _backBuffer.width * 4;
+   if (rowBytes > sizeof(stackRow))
+   {
+      heapRow = (uint8_t *)malloc(rowBytes);
+      if (!heapRow)
+         return NO;
+      row     = heapRow;
+   }
 
    dstStride = _viewport.width * 3;
    dst       = buffer + (_viewport.height - 1) * dstStride;
 
-   for (y = 0; y < _viewport.height; y++, src += srcStride, dst -= dstStride)
+   for (y = 0; y < _viewport.height; y++, dst -= dstStride)
    {
+      size_t x;
+      [_backBuffer getBytes:row
+                bytesPerRow:rowBytes
+                 fromRegion:MTLRegionMake2D(0, (NSUInteger)_viewport.y + y,
+                                            _backBuffer.width, 1)
+                mipmapLevel:0];
+
       for (x = 0; x < _viewport.width; x++)
       {
-         dst[3 * x + 0] = src[4 * (_viewport.x + x) + 0];
-         dst[3 * x + 1] = src[4 * (_viewport.x + x) + 1];
-         dst[3 * x + 2] = src[4 * (_viewport.x + x) + 2];
+         dst[3 * x + 0] = row[4 * (_viewport.x + x) + 0];
+         dst[3 * x + 1] = row[4 * (_viewport.x + x) + 1];
+         dst[3 * x + 2] = row[4 * (_viewport.x + x) + 2];
       }
    }
 
-   free(tmp);
+   free(heapRow);
 
    return YES;
 }
@@ -962,6 +980,48 @@ - (void)commitRanges
 
 - (void)discard
 {
+   /* Trim the tail: any node that wasn't touched during this
+    * chain's previous use (allocated == 0) is dropped. Nodes are
+    * appended in alloc order, so once we see the first trailing
+    * unused node the whole tail is unused. We only trim when the
+    * chain was actually used (_allocated > 0) so that a single
+    * quiescent frame doesn't drop the entire chain and force
+    * reallocation on the next use.
+    *
+    * This bounds retained memory to the recent high-water mark
+    * rather than the all-time high-water mark, which previously
+    * grew monotonically: any one-off large allocation (e.g. a
+    * heavy shader pass or a brief geometry spike) kept its
+    * oversized backing node alive for the lifetime of the driver,
+    * across all CHAIN_LENGTH chains. */
+   if (_head && _allocated > 0)
+   {
+      BufferNode *keep = _head;
+      BufferNode *n;
+      for (n = _head; n != nil; n = n.next)
+      {
+         if (n.allocated > 0)
+            keep = n;
+      }
+      if (keep.next)
+      {
+         NSUInteger dropped = 0;
+         for (n = keep.next; n != nil; n = n.next)
+            dropped += n.src.length;
+         _length -= dropped;
+         keep.next = nil;
+      }
+   }
+
+   /* Reset per-node allocated so commitRanges on the next use of
+    * this chain does not didModifyRange: a stale range from this
+    * cycle into a node that gets partially refilled. */
+   {
+      BufferNode *n;
+      for (n = _head; n != nil; n = n.next)
+         n.allocated = 0;
+   }
+
    _current   = _head;
    _offset    = 0;
    _allocated = 0;
@@ -1471,17 +1531,23 @@ - (void)drawWithEncoder:(id<MTLRenderCommandEncoder>)rce
 
 - (void)updateFrame:(void const *)src pitch:(NSUInteger)pitch
 {
+   /* pitch is the source row stride in bytes (libretro convention,
+    * matched by the MetalMenu caller which passes BPP * width).
+    * Pass it straight through to Metal: multiplying by 4 here told
+    * the driver to walk 4x the source memory between rows, reading
+    * past the source allocation on every row after the first. The
+    * else-branch already passes pitch straight through. */
    if (_format == RPixelFormatBGRA8Unorm || _format == RPixelFormatBGRX8Unorm)
    {
       [_texture replaceRegion:MTLRegionMake2D(0, 0, (NSUInteger)_size.width, (NSUInteger)_size.height)
                   mipmapLevel:0 withBytes:src
-                  bytesPerRow:(NSUInteger)(4 * pitch)];
+                  bytesPerRow:pitch];
    }
    else
    {
       [_src replaceRegion:MTLRegionMake2D(0, 0, (NSUInteger)_size.width, (NSUInteger)_size.height)
               mipmapLevel:0 withBytes:src
-              bytesPerRow:(NSUInteger)(pitch)];
+              bytesPerRow:pitch];
       _srcDirty = YES;
    }
 }
@@ -1650,37 +1716,37 @@ - (instancetype)initWithDriver:(MetalDriver *)driver fontPath:(const char *)font
       _uniforms.projectionMatrix = matrix_proj_ortho(0, 1, 0, 1);
       _atlas  = _font_driver->get_atlas(_font_data);
       _stride = MTL_ALIGN_BUFFER(_atlas->width);
-      if (_stride == _atlas->width)
-      {
-         _buffer = [_context.device newBufferWithBytes:_atlas->buffer
-                                                length:(NSUInteger)(_stride * _atlas->height)
-                                               options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
-
-         /* Even though newBufferWithBytes will copy the initial contents
-          * from our atlas, it doesn't seem to invalidate the buffer when
-          * doing so, causing corrupted text rendering if we hit this code
-          * path. To work around it we manually invalidate the buffer. */
-#if !defined(HAVE_COCOATOUCH)
-         [_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
-#endif
-      }
-      else
+
+      /* Allocate an uninitialized managed buffer and fill it through
+       * .contents. This collapses two previous branches (fast path
+       * via newBufferWithBytes:, slow path via row memcpy loop) into
+       * one: row memcpy handles both the aligned and padded cases
+       * and avoids the newBufferWithBytes: workaround (which had to
+       * manually didModifyRange: the whole buffer anyway because
+       * the initial copy was not correctly invalidated on macOS). */
+      _buffer = [_context.device newBufferWithLength:(NSUInteger)(_stride * _atlas->height)
+                                             options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
       {
          size_t i;
-         _buffer   = [_context.device newBufferWithLength:(NSUInteger)(_stride * _atlas->height)
-                                                options:PLATFORM_METAL_RESOURCE_STORAGE_MODE];
-         void *dst = _buffer.contents;
-         void *src = _atlas->buffer;
-         for (i = 0; i < _atlas->height; i++)
+         uint8_t       *dst = (uint8_t *)_buffer.contents;
+         const uint8_t *src = (const uint8_t *)_atlas->buffer;
+         if (_stride == _atlas->width)
+         {
+            memcpy(dst, src, (size_t)_stride * _atlas->height);
+         }
+         else
          {
-            memcpy(dst, src, _atlas->width);
-            dst += _stride;
-            src += _atlas->width;
+            for (i = 0; i < _atlas->height; i++)
+            {
+               memcpy(dst, src, _atlas->width);
+               dst += _stride;
+               src += _atlas->width;
+            }
          }
+      }
 #if !defined(HAVE_COCOATOUCH)
-          [_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
+      [_buffer didModifyRange:NSMakeRange(0, _buffer.length)];
 #endif
-      }
 
       MTLTextureDescriptor *td = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatR8Unorm
                                                                                     width:_atlas->width
@@ -1756,8 +1822,15 @@ - (void)updateGlyph:(const struct font_glyph *)glyph
       }
 
 #if !defined(HAVE_COCOATOUCH)
-      NSUInteger offset = glyph->atlas_offset_y;
-      NSUInteger len    = glyph->height * _stride;
+      /* didModifyRange takes a BYTE range, not a row index.
+       * Every other call site in this file (lines 958, 1664, 1681,
+       * 3082, 3550) passes bytes. Previously offset was the row
+       * index, which meant the invalidated range almost never
+       * overlapped the actually-modified rows on managed-storage
+       * devices, producing stale/garbled glyphs until the entire
+       * atlas was invalidated by some other path. */
+      NSUInteger offset = (NSUInteger)glyph->atlas_offset_y * _stride;
+      NSUInteger len    = (NSUInteger)glyph->height         * _stride;
       [_buffer didModifyRange:NSMakeRange(offset, len)];
 #endif