Skip to content

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67

Open
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session
Open

gimamdsmi: add smi_session RAII helper for on-demand amdsmi lifecycle#67
spraveenio wants to merge 1 commit into
ROCm:mainfrom
spraveenio:feature/gimamdsmi-smi-session

Conversation

@spraveenio
Copy link
Copy Markdown
Contributor

@spraveenio spraveenio commented May 28, 2026

Summary

  • Introduces smi_session RAII class that wraps amdsmi_init/amdsmi_shut_down with a recursive_mutex + depth counter for safe nested use on the same thread
  • Adds AGA_SMI_SESSION_GUARD() macro to all public smi_api.cc entry points — opens a per-request session in lazy mode, no-op in persistent mode
  • AGA_SMI_LAZY_INIT=1 env var selects on-demand init/shutdown (releases /dev/gim-smi0 between calls); unset keeps the existing persistent behavior
  • Startup log always prints which mode is active
  • gimamdsmi/smi_state.cc gets a proper teardown() matching the amdsmi backend — stops watcher before amdsmi_shut_down() in persistent mode, no-op in lazy mode

Test results

Scenario Before After
AGA_SMI_LAZY_INIT=1, correct devices mounted gpuagent stuck on futex, sock never created, logs 0 bytes Daemon starts, sock created, exporter connects
Persistent mode (default) Unchanged Unchanged

Files changed

Only gimamdsmi/ backend files and shared headers are touched — amdsmi, rocmsmi, main.cc, init.cc, and all common SMI plumbing are unmodified.

  • gimamdsmi/smi_session.hpp / smi_session.cc — new RAII session class
  • gimamdsmi/smi_api.ccAGA_SMI_SESSION_GUARD() on all public entry points
  • gimamdsmi/smi_state.cc — lazy init logic, startup mode log, teardown() implementation
  • gimamdsmi/amd_smi_mock_impl.cc — init/shutdown counters for test assertions
  • smi_state.hpplazy_init_ member + accessor, all existing fields preserved
  • smi_api_mock_impl.hpp — counter APIs for test assertions

🤖 Generated with Claude Code

@spraveenio spraveenio requested review from rsrikanth86 and sarat-k May 28, 2026 13:40
Comment thread sw/nic/gpuagent/api/smi/gimamdsmi/smi_api.cc Outdated
namespace aga {

uint32_t
smi_mock_get_init_count (void)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these needed? I don't see them used

}

sdk_ret_t
smi_session::processor_handles (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to be used

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to make sure the new gpu handles are used here


// use an smi_session for the one-shot discovery regardless of mode;
// in persistent mode we immediately re-init after the session ends
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do this regardless? Can't this just be the else case of the if (!lazy_init_) check below?

@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch from a9657c6 to 565fabb Compare June 2, 2026 22:28
sdk_ret_t
smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle, aga_gpu_spec_t *spec)
smi_gpu_fill_spec (aga_gpu_handle_t gpu_handle,
const aga_obj_key_t *uuid,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid->gpu_key and remove the (void)uuid calls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAke this change for all the functions in this file

aga_gpu_handle_t first_partition_handle,
aga_gpu_stats_t *stats)
{
AGA_SMI_SESSION_GUARD(uuid, gpu_handle_in);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to after all the variable declarations with an empty line inbetween

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow this for all functions in this file

sdk_ret_t
smi_discover_gpus (uint32_t *num_gpu, aga_gpu_profile_t *gpu)
{
// discovery has no UUID to refresh against; open the session manually
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again move this to after the variable declarations with an empty line inbetween

: ok_(false),
amdsmi_ret_(AMDSMI_STATUS_SUCCESS), ret_(SDK_RET_OK)
{
std::lock_guard<std::mutex> lk(init_mutex_());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will still be a case where mutliple threads have called amdsmi_init without calling amdsmi_shutdown. This might be an issue

Moves the gim amdsmi backend from a persistent (init-once-at-startup,
never-shut-down) model to an optional per-request session model that
releases /dev/gim-smi0 between calls, freeing the device for other
processes (e.g. the GIM daemon).

Behavior is controlled by AGA_SMI_LAZY_INIT=1; when unset the existing
persistent behavior is preserved. The active mode is logged at startup.

Concurrency

  smi_session uses std::mutex + refcount. The lock is held ONLY around
  refcount transitions and the amdsmi_init / amdsmi_shut_down calls
  themselves. While sessions are open, amdsmi API calls run UNLOCKED so
  concurrent gRPC handlers (up to AGA_MAX_GRPC_THREADS = 256) and the
  watcher thread all run in parallel on the same refcount-gated init.
  This is the same shared-init concurrency model used by persistent
  mode today.

  Invariants (under init_mutex_):
    refcount_() == 0  <=>  amdsmi is NOT initialized
    refcount_() >  0  <=>  amdsmi IS initialized; exactly ONE successful
                            amdsmi_init has occurred without a matching
                            amdsmi_shut_down
    amdsmi_init runs ONLY on the 0 -> 1 transition
    amdsmi_shut_down runs ONLY on the 1 -> 0 transition

Handle lifecycle

  amdsmi handles can change across init/shutdown cycles, so cached
  handles are unsafe in lazy mode. Public smi_api entry points take a
  new const aga_obj_key_t *gpu_key trailing parameter; the gim
  AGA_SMI_SESSION_GUARD(gpu_key, handle) macro opens the session and
  re-resolves a local gpu_handle via
  amdsmi_get_processor_handle_from_uuid(gpu_key->str(), ...) inside the
  per-request session. The watcher walks gpu_db() per tick and resolves
  each entry's handle the same way - no cached UUID/handle state.
  The amdsmi backend takes the gpu_key for signature parity and ignores
  it (handles are stable there).

smi_state::init() is restructured into a clean if/else: persistent mode
does init+discover only; lazy mode does init+discover+shutdown.

UUID->string conversion reuses the canonical aga_obj_key_t::str() helper
from api/include/base.hpp - no new helper added.

Test results ([email protected] device-metrics-exporter):
  AGA_SMI_LAZY_INIT=1: daemon starts cleanly, startup log shows lazy
  mode, gpuagent.sock is created, exporter connects, /dev/gim-smi0 fd
  is released between calls.
  Persistent mode (default): unchanged behavior, fd held for life.
  amdsmi backend untouched in behavior; new gpu_key param is unused
  there.

Co-Authored-By: Claude Sonnet 4 (1M context) <[email protected]>
@spraveenio spraveenio force-pushed the feature/gimamdsmi-smi-session branch from 565fabb to cdf7990 Compare June 3, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants