gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle by spraveenio · Pull Request #67 · ROCm/gpu-agent

spraveenio · 2026-05-28T13:19:18Z

Summary

Introduces smi_session RAII class that wraps amdsmi_init/amdsmi_shut_down with a recursive_mutex + depth counter for safe nested use on the same thread
Adds AGA_SMI_SESSION_GUARD() macro to all public smi_api.cc entry points — opens a per-request session in lazy mode, no-op in persistent mode
AGA_SMI_LAZY_INIT=1 env var selects on-demand init/shutdown (releases /dev/gim-smi0 between calls); unset keeps the existing persistent behavior
Startup log always prints which mode is active
gimamdsmi/smi_state.cc gets a proper teardown() matching the amdsmi backend — stops watcher before amdsmi_shut_down() in persistent mode, no-op in lazy mode

Test results

Scenario	Before	After
`AGA_SMI_LAZY_INIT=1`, correct devices mounted	gpuagent stuck on futex, sock never created, logs 0 bytes	Daemon starts, sock created, exporter connects
Persistent mode (default)	Unchanged	Unchanged

Files changed

Only gimamdsmi/ backend files and shared headers are touched — amdsmi, rocmsmi, main.cc, init.cc, and all common SMI plumbing are unmodified.

gimamdsmi/smi_session.hpp / smi_session.cc — new RAII session class
gimamdsmi/smi_api.cc — AGA_SMI_SESSION_GUARD() on all public entry points
gimamdsmi/smi_state.cc — lazy init logic, startup mode log, teardown() implementation
gimamdsmi/amd_smi_mock_impl.cc — init/shutdown counters for test assertions
smi_state.hpp — lazy_init_ member + accessor, all existing fields preserved
smi_api_mock_impl.hpp — counter APIs for test assertions

🤖 Generated with Claude Code

rsrikanth86 · 2026-06-02T19:27:06Z

 namespace aga {

+uint32_t
+smi_mock_get_init_count (void)


Are these needed? I don't see them used

rsrikanth86 · 2026-06-02T19:36:02Z

+}
+
+sdk_ret_t
+smi_session::processor_handles (


doesn't seem to be used

rsrikanth86 · 2026-06-02T19:38:11Z

You will need to make sure the new gpu handles are used here

rsrikanth86 · 2026-06-02T19:39:03Z

+
+    // use an smi_session for the one-shot discovery regardless of mode;
+    // in persistent mode we immediately re-init after the session ends
+    {


why do this regardless? Can't this just be the else case of the if (!lazy_init_) check below?

rsrikanth86 · 2026-06-03T00:00:36Z

 sdk_ret_t
-smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle, aga_gpu_spec_t *spec)
+smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle,
+                   const aga_obj_key_t *uuid,


uuid->gpu_key and remove the (void)uuid calls

MAke this change for all the functions in this file

rsrikanth86 · 2026-06-03T00:10:34Z

                    aga_gpu_handle_t first_partition_handle,
                    aga_gpu_stats_t *stats)
 {
+    AGA_SMI_SESSION_GUARD(uuid, gpu_handle_in);


We can move this to after all the variable declarations with an empty line inbetween

follow this for all functions in this file

rsrikanth86 · 2026-06-03T00:13:11Z

 sdk_ret_t
 smi_discover_gpus (uint32_t *num_gpu, aga_gpu_profile_t *gpu)
 {
+    // discovery has no UUID to refresh against; open the session manually


again move this to after the variable declarations with an empty line inbetween

rsrikanth86 · 2026-06-03T00:15:57Z

+    : ok_(false),
+      amdsmi_ret_(AMDSMI_STATUS_SUCCESS), ret_(SDK_RET_OK)
+{
+    std::lock_guard<std::mutex> lk(init_mutex_());


There will still be a case where mutliple threads have called amdsmi_init without calling amdsmi_shutdown. This might be an issue

Moves the gim amdsmi backend from a persistent (init-once-at-startup, never-shut-down) model to an optional per-request session model that releases /dev/gim-smi0 between calls, freeing the device for other processes (e.g. the GIM daemon). Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing persistent behavior is preserved. The active mode is logged at startup. Concurrency smi_session uses std::mutex + refcount. The lock is held ONLY around refcount transitions and the amdsmi_init / amdsmi_shut_down calls themselves. While sessions are open, amdsmi API calls run UNLOCKED so concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the watcher thread all run in parallel on the same refcount-gated init. This is the same shared-init concurrency model used by persistent mode today. Invariants (under init_mutex_): refcount_() == 0 <=> amdsmi is NOT initialized refcount_() > 0 <=> amdsmi IS initialized; exactly ONE successful amdsmi_init has occurred without a matching amdsmi_shut_down amdsmi_init runs ONLY on the 0 -> 1 transition amdsmi_shut_down runs ONLY on the 1 -> 0 transition Handle lifecycle amdsmi handles can change across init/shutdown cycles, so cached handles are unsafe in lazy mode. Public smi_api entry points take a new const aga_obj_key_t *gpu_key trailing parameter; the gim AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and re-resolves a local gpu_handle via amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the per-request session. The watcher walks gpu_db() per tick and resolves each entry's handle the same way - no cached UUID/handle state. The amdsmi backend takes the gpu_key for signature parity and ignores it (handles are stable there). smi_state::init() is restructured into a clean if/else: persistent mode does init+discover only; lazy mode does init+discover+shutdown. UUID->string conversion reuses the canonical aga_obj_key_t::str() helper from api/include/base.hpp - no new helper added. Test results ([email protected] device-metrics-exporter): AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd is released between calls. Persistent mode (default): unchanged behavior, fd held for life. amdsmi backend untouched in behavior; new gpu_key param is unused there. Co-Authored-By: Claude Sonnet 4 (1M context) <[email protected]>

spraveenio requested review from rsrikanth86 and sarat-k May 28, 2026 13:40

rsrikanth86 reviewed Jun 1, 2026

View reviewed changes

Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated

rsrikanth86 reviewed Jun 2, 2026

View reviewed changes

spraveenio force-pushed the feature/gimamdsmi-smi-session branch from a9657c6 to 565fabb Compare June 2, 2026 22:28

rsrikanth86 reviewed Jun 3, 2026

View reviewed changes

spraveenio force-pushed the feature/gimamdsmi-smi-session branch from 565fabb to cdf7990 Compare June 3, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session

spraveenio commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

rsrikanth86 Jun 2, 2026

Uh oh!

rsrikanth86 Jun 2, 2026

Uh oh!

rsrikanth86 Jun 2, 2026

Uh oh!

rsrikanth86 Jun 2, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

rsrikanth86 Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spraveenio commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test results

Files changed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spraveenio commented May 28, 2026 •

edited

Loading