gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67spraveenio wants to merge 1 commit into
Conversation
| namespace aga { | ||
|
|
||
| uint32_t | ||
| smi_mock_get_init_count (void) |
There was a problem hiding this comment.
Are these needed? I don't see them used
| } | ||
|
|
||
| sdk_ret_t | ||
| smi_session::processor_handles ( |
There was a problem hiding this comment.
doesn't seem to be used
There was a problem hiding this comment.
You will need to make sure the new gpu handles are used here
|
|
||
| // use an smi_session for the one-shot discovery regardless of mode; | ||
| // in persistent mode we immediately re-init after the session ends | ||
| { |
There was a problem hiding this comment.
why do this regardless? Can't this just be the else case of the if (!lazy_init_) check below?
a9657c6 to
565fabb
Compare
| sdk_ret_t | ||
| smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle, aga_gpu_spec_t *spec) | ||
| smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle, | ||
| const aga_obj_key_t *uuid, |
There was a problem hiding this comment.
uuid->gpu_key and remove the (void)uuid calls
There was a problem hiding this comment.
MAke this change for all the functions in this file
| aga_gpu_handle_t first_partition_handle, | ||
| aga_gpu_stats_t *stats) | ||
| { | ||
| AGA_SMI_SESSION_GUARD(uuid, gpu_handle_in); |
There was a problem hiding this comment.
We can move this to after all the variable declarations with an empty line inbetween
There was a problem hiding this comment.
follow this for all functions in this file
| sdk_ret_t | ||
| smi_discover_gpus (uint32_t *num_gpu, aga_gpu_profile_t *gpu) | ||
| { | ||
| // discovery has no UUID to refresh against; open the session manually |
There was a problem hiding this comment.
again move this to after the variable declarations with an empty line inbetween
| : ok_(false), | ||
| amdsmi_ret_(AMDSMI_STATUS_SUCCESS), ret_(SDK_RET_OK) | ||
| { | ||
| std::lock_guard<std::mutex> lk(init_mutex_()); |
There was a problem hiding this comment.
There will still be a case where mutliple threads have called amdsmi_init without calling amdsmi_shutdown. This might be an issue
Moves the gim amdsmi backend from a persistent (init-once-at-startup,
never-shut-down) model to an optional per-request session model that
releases /dev/gim-smi0 between calls, freeing the device for other
processes (e.g. the GIM daemon).
Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing
persistent behavior is preserved. The active mode is logged at startup.
Concurrency
smi_session uses std::mutex + refcount. The lock is held ONLY around
refcount transitions and the amdsmi_init / amdsmi_shut_down calls
themselves. While sessions are open, amdsmi API calls run UNLOCKED so
concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the
watcher thread all run in parallel on the same refcount-gated init.
This is the same shared-init concurrency model used by persistent
mode today.
Invariants (under init_mutex_):
refcount_() == 0 <=> amdsmi is NOT initialized
refcount_() > 0 <=> amdsmi IS initialized; exactly ONE successful
amdsmi_init has occurred without a matching
amdsmi_shut_down
amdsmi_init runs ONLY on the 0 -> 1 transition
amdsmi_shut_down runs ONLY on the 1 -> 0 transition
Handle lifecycle
amdsmi handles can change across init/shutdown cycles, so cached
handles are unsafe in lazy mode. Public smi_api entry points take a
new const aga_obj_key_t *gpu_key trailing parameter; the gim
AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and
re-resolves a local gpu_handle via
amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the
per-request session. The watcher walks gpu_db() per tick and resolves
each entry's handle the same way - no cached UUID/handle state.
The amdsmi backend takes the gpu_key for signature parity and ignores
it (handles are stable there).
smi_state::init() is restructured into a clean if/else: persistent mode
does init+discover only; lazy mode does init+discover+shutdown.
UUID->string conversion reuses the canonical aga_obj_key_t::str() helper
from api/include/base.hpp - no new helper added.
Test results ([email protected] device-metrics-exporter):
AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy
mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd
is released between calls.
Persistent mode (default): unchanged behavior, fd held for life.
amdsmi backend untouched in behavior; new gpu_key param is unused
there.
Co-Authored-By: Claude Sonnet 4 (1M context) <[email protected]>
565fabb to
cdf7990
Compare
Summary
smi_sessionRAII class that wrapsamdsmi_init/amdsmi_shut_downwith arecursive_mutex+ depth counter for safe nested use on the same threadAGA_SMI_SESSION_GUARD()macro to all publicsmi_api.ccentry points — opens a per-request session in lazy mode, no-op in persistent modeAGA_SMI_LAZY_INIT=1env var selects on-demand init/shutdown (releases/dev/gim-smi0between calls); unset keeps the existing persistent behaviorgimamdsmi/smi_state.ccgets a properteardown()matching the amdsmi backend — stops watcher beforeamdsmi_shut_down()in persistent mode, no-op in lazy modeTest results
AGA_SMI_LAZY_INIT=1, correct devices mountedFiles changed
Only
gimamdsmi/backend files and shared headers are touched —amdsmi,rocmsmi,main.cc,init.cc, and all common SMI plumbing are unmodified.gimamdsmi/smi_session.hpp/smi_session.cc— new RAII session classgimamdsmi/smi_api.cc—AGA_SMI_SESSION_GUARD()on all public entry pointsgimamdsmi/smi_state.cc— lazy init logic, startup mode log,teardown()implementationgimamdsmi/amd_smi_mock_impl.cc— init/shutdown counters for test assertionssmi_state.hpp—lazy_init_member + accessor, all existing fields preservedsmi_api_mock_impl.hpp— counter APIs for test assertions🤖 Generated with Claude Code