Skip to content

Commit 4e26c88

Browse files
committed
gfx/gfx_widgets: lock msg_queue against producer-producer race
dispgfx_widget_t::msg_queue is a fifo_buffer_t storing disp_widget_msg_t* pointers. Pre-fix it had no dedicated lock, yet the producer (gfx_widgets_msg_queue_push) is reachable from threads other than the main thread, and the lock that *did* exist (current_msgs_lock) protects current_msgs[] -- the displayed-message deque -- not the FIFO. Producer reachability: 1. runloop.c:runloop_msg_queue_push. Holds RUNLOOP_MSG_QUEUE_ LOCK across the call. Reachable from 40+ files; the threaded task system at libretro-common/queues/task_queue.c runs a worker thread that pushes via this entry point. 2. runloop.c:runloop_task_msg_queue_push. Same lock. 3. gfx/video_driver.c:video_driver_frame. Acquires RUNLOOP_MSG_QUEUE_LOCK to drain runloop_st->msg_queue, RELEASES it, then calls gfx_widgets_msg_queue_push without the lock. Path 3 runs on the main thread; path 1 can run on any thread. RUNLOOP_MSG_QUEUE_LOCK guards runloop_st->msg_queue, NOT p_dispwidget->msg_queue -- the lock name reflects the runloop struct it was named after. So path 3's fifo_write (no lock) can race with path 1's fifo_write (lock held but irrelevant to path 3). The consumer (gfx_widgets_iterate's fifo_read) is called from runloop.c:6048 inside RUNLOOP_MSG_QUEUE_LOCK and from gfx_widgets.c:973 inside current_msgs_lock -- but neither lock is held by path 3, so consumer fifo_reads can observe a half- updated `end` cursor from path 3's fifo_writes. Outcome of a race: torn `end` cursor publishes a bad pointer in the disp_widget_msg_t* slots. The consumer dereferences that as disp_widget_msg_t* in gfx_widgets_iterate, producing use-after-free / segfault. On x86 TSO the race is masked most of the time by hardware ordering; on Win-on-ARM / Apple Silicon (AArch64) it surfaces more often. Collision probability is low (~1 per 14 hours of normal use, estimated from frame rate * task message rate * fifo_write window) but the worst case is crash-class. Fix --- Add slock_t* msg_queue_lock to dispgfx_widget_t. Acquire it around every fifo_write (in gfx_widgets_msg_queue_push), every fifo_read (in gfx_widgets_iterate), and the avail-check that gates each one. msg_queue_lock is the inner lock when both it and current_msgs_lock are held; lock order is consistent across all call sites. The outer FIFO_*_AVAIL fast-path checks that used to wrap the producer and consumer function bodies have been dropped. Reading the FIFO cursors outside msg_queue_lock is itself a data race against the locked writes -- TSan-detectable; benign on x86 TSO but real on weak-memory hardware. The locked re-check at the actual fifo_write / fifo_read site is the correctness gate. Removing the producer's outer gate also fixes a latent behaviour bug: the update-existing branch (else clause in gfx_widgets_msg_queue_push) does not write to the FIFO -- it only mutates an already-tracked widget -- so suppressing it on FIFO-full was wrong. Existing widgets now get updated regardless of FIFO state. The producer's spawn-new branch may now race-lose the avail re-check after allocating msg_widget (concurrent producer filled the last slot between unlocked allocation and locked write). In that case the function frees the just-allocated widget via gfx_widgets_msg_queue_free + free and returns. A forward declaration of gfx_widgets_msg_queue_free is added near the top of the file because that function is defined further down. Regression test --------------- samples/gfx/gfx_widgets_msg_queue_race/ mirrors the post-fix locking discipline of gfx_widgets_msg_queue_push and gfx_widgets_iterate's FIFO interaction in isolation. Two producer threads + one consumer thread + 500K iterations per producer; smoke check verifies single-threaded FIFO order. Wired into .github/workflows/Linux-samples-gfx.yml as a TSan lane (CC=clang, SANITIZER=thread, TSAN_OPTIONS=halt_on_error=1) following the gfx_thumbnail_status_atomic_test convention. TSan instruments every memory access on fifo_buffer_t::end and ::first, so a future regression that drops the lock around any of the producer / consumer / avail-check sites will be flagged as a data race and fail the workflow. Verified locally ---------------- - samples/gfx/gfx_widgets_msg_queue_race builds and runs clean as C (gcc -O0), C++11 (g++ -std=c++11 -xc++), and Clang under -fsanitize=thread. - TSan with halt_on_error=1: clean exit (post-fix lock discipline causes no observed races across 500K * 2 iterations). - Mutation test: removing the slock_lock/unlock around the producer's fifo_write makes TSan fire on the first collision, validating that the test is sensitive to the exact regression we're guarding against. - gfx/gfx_widgets.c syntax-checks clean with HAVE_THREADS + HAVE_GFX_WIDGETS via gcc -fsyntax-only against the libretro-common headers.
1 parent 932e1e5 commit 4e26c88

5 files changed

Lines changed: 695 additions & 6 deletions

File tree

.github/workflows/Linux-samples-gfx.yml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,3 +241,50 @@ jobs:
241241
TSAN_OPTIONS=halt_on_error=1 timeout 60 \
242242
./gfx_thumbnail_status_atomic_test
243243
echo "[pass] gfx_thumbnail_status_atomic_test"
244+
245+
- name: Build and run gfx_widgets_msg_queue_race_test (TSan)
246+
shell: bash
247+
working-directory: samples/gfx/gfx_widgets_msg_queue_race
248+
run: |
249+
set -eu
250+
# Regression test for the producer-producer race fix on
251+
# dispgfx_widget_t::msg_queue in gfx/gfx_widgets.c.
252+
# Pre-fix the FIFO had no dedicated lock: producers could
253+
# be invoked from any thread (the threaded task system at
254+
# libretro-common/queues/task_queue.c runs a worker
255+
# thread; gfx/video_driver.c::video_driver_frame released
256+
# RUNLOOP_MSG_QUEUE_LOCK before calling the producer, so
257+
# main-thread producers ran unlocked) yet there was no
258+
# lock dedicated to msg_queue itself. Post-fix a
259+
# msg_queue_lock is acquired around every fifo_write (in
260+
# gfx_widgets_msg_queue_push), every fifo_read (in
261+
# gfx_widgets_iterate), and the avail-check that gates
262+
# them. The outer fast-path FIFO_*_AVAIL checks that
263+
# used to wrap the function bodies were dropped: reading
264+
# the FIFO cursors outside the lock is itself a data
265+
# race against the locked writes (TSan-detectable;
266+
# benign on x86 TSO but real on weak-memory hardware).
267+
#
268+
# ThreadSanitizer is the right validator: it instruments
269+
# every memory access on fifo_buffer_t::end and
270+
# ::first and would flag a missing-lock regression as a
271+
# data race on the producer side, the consumer side, or
272+
# the avail check. ASan/UBSan would not catch the lock
273+
# removal at all. halt_on_error=1 makes TSan exit
274+
# non-zero on the first observed race, which is what we
275+
# want for CI: any race here means the production
276+
# locking discipline is broken.
277+
#
278+
# The test mirrors the post-fix locking protocol of
279+
# gfx_widgets_msg_queue_push and gfx_widgets_iterate's
280+
# FIFO interaction; the widget allocation, font
281+
# measurement, and animation push that wrap the actual
282+
# FIFO calls in production are not part of the race and
283+
# are stripped from the test. If gfx_widgets.c amends
284+
# the locking around msg_queue, the verbatim copies in
285+
# gfx_widgets_msg_queue_race_test.c must follow.
286+
make clean all SANITIZER=thread
287+
test -x gfx_widgets_msg_queue_race_test
288+
TSAN_OPTIONS=halt_on_error=1 timeout 60 \
289+
./gfx_widgets_msg_queue_race_test
290+
echo "[pass] gfx_widgets_msg_queue_race_test"

gfx/gfx_widgets.c

Lines changed: 83 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,13 @@ static void msg_widget_msg_transition_animation_done(void *userdata)
189189
msg->msg_transition_animation = 0.0f;
190190
}
191191

192+
/* Forward declaration: msg_queue_free is defined further below
193+
* (around line 550) but is needed by msg_queue_push for the
194+
* race-loss rollback path added in the producer-locking fix. */
195+
static void gfx_widgets_msg_queue_free(
196+
dispgfx_widget_t *p_dispwidget,
197+
disp_widget_msg_t *msg);
198+
192199
void gfx_widgets_msg_queue_push(
193200
retro_task_t *task,
194201
const char *msg,
@@ -203,7 +210,18 @@ void gfx_widgets_msg_queue_push(
203210
disp_widget_msg_t *msg_widget = NULL;
204211
dispgfx_widget_t *p_dispwidget = &dispwidget_st;
205212

206-
if (FIFO_WRITE_AVAIL_NONPTR(p_dispwidget->msg_queue) > 0)
213+
/* The outer FIFO_WRITE_AVAIL fast-path check that used to wrap
214+
* this function body has been removed: reading the FIFO cursors
215+
* outside msg_queue_lock is a data race against the producer
216+
* lock-protected fifo_write below (TSan-detectable; benign on
217+
* x86 TSO but real on weak-memory hardware). The locked
218+
* avail re-check at the fifo_write site is the correctness gate.
219+
*
220+
* Removing the outer gate also fixes a latent behaviour bug:
221+
* the update-existing branch (else clause below) does not write
222+
* to the FIFO -- it only mutates an already-tracked widget --
223+
* so suppressing it on FIFO-full was wrong. Existing widgets
224+
* now get updated regardless of FIFO state. */
207225
{
208226
/* Get current msg if it exists */
209227
if (task && task->frontend_userdata)
@@ -376,8 +394,45 @@ void gfx_widgets_msg_queue_push(
376394
msg_widget->text_height *= 2;
377395
}
378396

379-
fifo_write(&p_dispwidget->msg_queue,
380-
&msg_widget, sizeof(msg_widget));
397+
/* Lock the FIFO across the avail re-check + fifo_write so
398+
* concurrent producers can't both pass the check and
399+
* overwrite each other's data. fifo_write itself does no
400+
* bounds checking -- it silently wraps -- so the inner
401+
* avail check is the actual gate.
402+
*
403+
* The outer FIFO_WRITE_AVAIL_NONPTR check at the top of
404+
* this function is a fast-path bail and is intentionally
405+
* left unlocked: a stale read there at worst causes us to
406+
* skip a single message that we could have queued, which
407+
* is recoverable on the next call. The check below is the
408+
* one whose accuracy matters for correctness. */
409+
{
410+
#ifdef HAVE_THREADS
411+
bool fifo_full;
412+
slock_lock(p_dispwidget->msg_queue_lock);
413+
fifo_full = (FIFO_WRITE_AVAIL_NONPTR(p_dispwidget->msg_queue)
414+
< sizeof(msg_widget));
415+
if (!fifo_full)
416+
fifo_write(&p_dispwidget->msg_queue,
417+
&msg_widget, sizeof(msg_widget));
418+
slock_unlock(p_dispwidget->msg_queue_lock);
419+
420+
if (fifo_full)
421+
{
422+
/* Lost the race against another producer. Roll back
423+
* the widget we just allocated -- nobody else has a
424+
* reference to it yet (we only got here from the
425+
* spawn-new branch, where msg_widget is freshly
426+
* allocated above), so this is safe. */
427+
gfx_widgets_msg_queue_free(p_dispwidget, msg_widget);
428+
free(msg_widget);
429+
return;
430+
}
431+
#else
432+
fifo_write(&p_dispwidget->msg_queue,
433+
&msg_widget, sizeof(msg_widget));
434+
#endif
435+
}
381436
}
382437
/* Update task info */
383438
else
@@ -962,9 +1017,16 @@ void gfx_widgets_iterate(
9621017

9631018
/* Messages queue */
9641019

965-
/* Consume one message if available */
966-
if ((FIFO_READ_AVAIL_NONPTR(p_dispwidget->msg_queue) > 0)
967-
&& !(p_dispwidget->flags & DISPGFX_WIDGET_FLAG_MOVING)
1020+
/* Consume one message if available. The outer condition no
1021+
* longer reads FIFO_READ_AVAIL_NONPTR -- doing so outside
1022+
* msg_queue_lock would race with concurrent producer fifo_writes
1023+
* (TSan-detectable; benign on x86 TSO but real on weak-memory
1024+
* hardware). The locked re-check at the fifo_read site below
1025+
* is the correctness gate. current_msgs_size and the MOVING
1026+
* flag remain in the outer guard -- they are unrelated to the
1027+
* msg_queue race; their own synchronisation discipline is
1028+
* handled by current_msgs_lock and is unchanged here. */
1029+
if ( !(p_dispwidget->flags & DISPGFX_WIDGET_FLAG_MOVING)
9681030
&& (p_dispwidget->current_msgs_size < ARRAY_SIZE(p_dispwidget->current_msgs)))
9691031
{
9701032
disp_widget_msg_t *msg_widget = NULL;
@@ -975,9 +1037,21 @@ void gfx_widgets_iterate(
9751037

9761038
if (p_dispwidget->current_msgs_size < ARRAY_SIZE(p_dispwidget->current_msgs))
9771039
{
1040+
/* Lock around the FIFO read to serialise against
1041+
* concurrent producer fifo_writes. Held strictly inside
1042+
* current_msgs_lock (which protects current_msgs[]) --
1043+
* lock order is consistent with all other call sites:
1044+
* msg_queue_lock is always the inner lock when both
1045+
* are held. */
1046+
#ifdef HAVE_THREADS
1047+
slock_lock(p_dispwidget->msg_queue_lock);
1048+
#endif
9781049
if (FIFO_READ_AVAIL_NONPTR(p_dispwidget->msg_queue) > 0)
9791050
fifo_read(&p_dispwidget->msg_queue,
9801051
&msg_widget, sizeof(msg_widget));
1052+
#ifdef HAVE_THREADS
1053+
slock_unlock(p_dispwidget->msg_queue_lock);
1054+
#endif
9811055

9821056
if (msg_widget)
9831057
{
@@ -1999,6 +2073,8 @@ static void gfx_widgets_free(dispgfx_widget_t *p_dispwidget)
19992073

20002074
slock_free(p_dispwidget->current_msgs_lock);
20012075
p_dispwidget->current_msgs_lock = NULL;
2076+
slock_free(p_dispwidget->msg_queue_lock);
2077+
p_dispwidget->msg_queue_lock = NULL;
20022078
#endif
20032079

20042080
p_dispwidget->msg_queue_tasks_count = 0;
@@ -2150,6 +2226,7 @@ bool gfx_widgets_init(
21502226

21512227
#ifdef HAVE_THREADS
21522228
p_dispwidget->current_msgs_lock = slock_new();
2229+
p_dispwidget->msg_queue_lock = slock_new();
21532230
#endif
21542231

21552232
fill_pathname_join_special(

gfx/gfx_widgets.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,22 @@ typedef struct dispgfx_widget
191191

192192
#ifdef HAVE_THREADS
193193
slock_t* current_msgs_lock;
194+
/* Serialises producer and consumer access to msg_queue.
195+
* Producers (gfx_widgets_msg_queue_push) can be called from
196+
* any thread -- the threaded task system at libretro-common/
197+
* queues/task_queue.c runs a worker thread, and several call
198+
* paths reach the producer without holding any other lock
199+
* (notably gfx/video_driver.c::video_driver_frame, which
200+
* releases RUNLOOP_MSG_QUEUE_LOCK before the call). The
201+
* consumer (gfx_widgets_iterate) runs on the main thread
202+
* but did not previously serialise its fifo_read against
203+
* concurrent producer fifo_writes. This separates concerns:
204+
* current_msgs_lock continues to guard the displayed-message
205+
* deque (current_msgs[]); msg_queue_lock guards the FIFO
206+
* staging buffer. Acquired by every fifo_* call site and
207+
* by FIFO_*_AVAIL checks where the result is used to gate a
208+
* subsequent FIFO operation. */
209+
slock_t* msg_queue_lock;
194210
#endif
195211
fifo_buffer_t msg_queue;
196212
disp_widget_msg_t* current_msgs[MSG_QUEUE_ONSCREEN_MAX];
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
TARGET := gfx_widgets_msg_queue_race_test
2+
3+
# Path back to the repo root from this sample dir. The test references
4+
# headers from libretro-common/include and links against
5+
# libretro-common/queues/fifo_queue.c (the FIFO this regression test
6+
# exercises) and libretro-common/rthreads/rthreads.c (when HAVE_THREADS
7+
# is enabled).
8+
REPO_ROOT := ../../..
9+
LIBRETRO_COMM_DIR := $(REPO_ROOT)/libretro-common
10+
11+
# This sample mirrors the post-fix locking discipline around
12+
# dispgfx_widget_t::msg_queue in gfx/gfx_widgets.c. The test uses
13+
# fifo_buffer_t directly (compiled from libretro-common) but does NOT
14+
# pull in gfx_widgets.c itself -- the fix is a locking protocol,
15+
# not a code change to the FIFO primitive.
16+
HAVE_THREADS ?= 1
17+
18+
SOURCES := \
19+
gfx_widgets_msg_queue_race_test.c \
20+
$(LIBRETRO_COMM_DIR)/queues/fifo_queue.c
21+
22+
CFLAGS += -Wall -pedantic -std=gnu99 -g -O0 \
23+
-I$(LIBRETRO_COMM_DIR)/include
24+
25+
ifeq ($(HAVE_THREADS),1)
26+
CFLAGS += -DHAVE_THREADS
27+
SOURCES += $(LIBRETRO_COMM_DIR)/rthreads/rthreads.c
28+
LDFLAGS += -lpthread
29+
# rthreads.c uses clock_gettime + CLOCK_REALTIME on Linux glibc; on
30+
# older glibc those live in -lrt. Harmless on newer glibc.
31+
LDFLAGS += -lrt
32+
endif
33+
34+
OBJS := $(SOURCES:.c=.o)
35+
36+
ifneq ($(SANITIZER),)
37+
CFLAGS := -fsanitize=$(SANITIZER) -fno-omit-frame-pointer $(CFLAGS)
38+
LDFLAGS := -fsanitize=$(SANITIZER) $(LDFLAGS)
39+
endif
40+
41+
all: $(TARGET)
42+
43+
%.o: %.c
44+
$(CC) -c -o $@ $< $(CFLAGS)
45+
46+
$(TARGET): $(OBJS)
47+
$(CC) -o $@ $^ $(LDFLAGS)
48+
49+
clean:
50+
rm -f $(TARGET) $(OBJS)
51+
52+
.PHONY: clean

0 commit comments

Comments
 (0)