Skip to content

mi: Introduce asynchronous event message handling#967

Merged
igaw merged 1 commit intolinux-nvme:masterfrom
chorkin:user/chorkin/aem_work_upstream
May 7, 2025
Merged

mi: Introduce asynchronous event message handling#967
igaw merged 1 commit intolinux-nvme:masterfrom
chorkin:user/chorkin/aem_work_upstream

Conversation

@chorkin
Copy link
Copy Markdown
Contributor

@chorkin chorkin commented Mar 4, 2025

Added new functionality to mi.c and mi-mctp.c to handle AEMs. Included new example mi-mctp-ae.c for usage.

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Mar 4, 2025

@jk-ozlabs, there's something funky going on with the existing mi-mctp tests. I was hoping you might shed some light here. Also looking for guidance on how we can add new tests to handle these changes.

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Mar 4, 2025

@igaw, it seems we have build checks that have been added since the last time I upstreamed something. But they're all failing for me here. Any guidance you can provide?

@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch from ac53516 to 90ad63d Compare March 4, 2025 00:23
Comment thread src/nvme/mi.c Outdated
Comment thread src/nvme/mi.c Outdated
Comment thread src/libnvme-mi.map Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi.c
@jk-ozlabs
Copy link
Copy Markdown
Collaborator

Neat, I'll take a look!

@jk-ozlabs
Copy link
Copy Markdown
Collaborator

I think the overall concept here is sound, but we are missing a few implementation details for an overall feasibility check too. There are some style things, but we can address those once the general framework is settled. I have added some comments in a review.

I've been thinking about how we should be handling the multiple-endpoint cases, as we can only have one socket listening on incoming NVMe-MI message types. A couple of options there:

  • a shared socket for all AEM events on all endpoints (eg, on the nvme_mi_root_t), where a call to fetch AEM events on any enpoint will receive all events, and queue those that are not for the current. A subsequent call for any other endpoints would first check that queue before receiving more directly from the socket

or:

  • we add socket syscall semantics for AF_MCTP, to specify the remote MCTP EID on the listening socket, and stick with one per nvme_mi_ep_t. this would require kernel changes, but I think that's probably a common use-case that would be good to support.

I figure the latter is probably best, so will take a look at what's required for that.

Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi.h Outdated
Comment thread src/nvme/mi.h Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread examples/mi-mctp-ae.c Outdated
@jk-ozlabs
Copy link
Copy Markdown
Collaborator

(BTW, I would suggest converting to a draft while still working out details)

@igaw
Copy link
Copy Markdown
Collaborator

igaw commented Mar 4, 2025

@igaw, it seems we have build checks that have been added since the last time I upstreamed something. But they're all failing for me here. Any guidance you can provide?

The build container cleanup task eagerly removed containers.

@chorkin chorkin marked this pull request as draft March 4, 2025 16:31
@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Mar 4, 2025

Collaborator

I think the overall concept here is sound, but we are missing a few implementation details for an overall feasibility check too. There are some style things, but we can address those once the general framework is settled. I have added some comments in a review.

I've been thinking about how we should be handling the multiple-endpoint cases, as we can only have one socket listening on incoming NVMe-MI message types. A couple of options there:

  • a shared socket for all AEM events on all endpoints (eg, on the nvme_mi_root_t), where a call to fetch AEM events on any enpoint will receive all events, and queue those that are not for the current. A subsequent call for any other endpoints would first check that queue before receiving more directly from the socket

or:

  • we add socket syscall semantics for AF_MCTP, to specify the remote MCTP EID on the listening socket, and stick with one per nvme_mi_ep_t. this would require kernel changes, but I think that's probably a common use-case that would be good to support.

I figure the latter is probably best, so will take a look at what's required for that.

@jk-ozlabs , I wasn't aware of this limitation. I think the kernel changes are probably the best solution here, but the case we are talking about is probably not going to be a primary use case. I was wondering if we can come up with a good strategy here to solve this multi-endpoint AEM issue without holding up the initial PR. Some thought's I had were:

  1. Provide this as a known limitation for the current PR and work on kernel changes in background. Add a follow up to remove this limitation.
  2. Implement the shared socket for a short term solution in this PR so there will be no limitation. Follow up with kernel changes later to simplify the code and memory usage of the application. I'll see if I can prototype this if I have time.

@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch 8 times, most recently from 9213c29 to a0155f1 Compare March 4, 2025 19:35
@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Mar 4, 2025

@igaw , I've addressed most of the checkpatch issues, but I can't resolve the packed ones. When I add __packed the structures aren't actually getting packed correctly. I also see plenty of usage in the codebase in the style I'm using.

I also see kdoc failure. What's the best way for me to dig into what's going on here? Is there a way to build the documentation locally?

@jk-ozlabs , I believe the mi-mctp tests are failing due to issue I flagged for you earlier.

@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch from a0155f1 to 9025c6f Compare March 4, 2025 22:00
Comment thread src/libnvme-mi.map
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread src/nvme/mi.c
Comment thread src/libnvme-mi.map Outdated
@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch 3 times, most recently from e9bd028 to a7826d5 Compare April 24, 2025 15:54
@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch from a7826d5 to 638045b Compare April 26, 2025 00:42
@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Apr 26, 2025

Sounds like we're going to be enabled to build kernel changes so I will test this

super!

Do I just need to copy your example code as is to do it or are there other things I'll need to do?

You'll just need to:

  1. incorporate that mctp-peer-bind branch in to your kernel tree.
  2. Add the diff I included above to libnvme

for (1), easiest approach (if you already have a custom kernel) would be to grab commit CodeConstruct/linux@2e9c0d5 and onwards, and apply those to your custom kernel. Or, just build that mctp-peer-bind branch directly if you have no other changes that you need

for (2) feel free to incorporate that into your change, or I can provide as an actual commit if you prefer.

Alrighty! I was able to build our QEMU today and test with your commits on top. I did two tests:

#1) Without your changes, ran with the new peer-bind code in libnvme-mi, and verified an error is returned in the example app since this is not supported yet
#2) With your changes, did the same test. Verified no error and verified I can trigger AEs which show up in my example app.

What I'm not able to test is multiple endpoints with AEs because that's just not something this QEMU setup will allow at this point, but if you are confident in your kernel changes, or can test that bit on your end, I think we're good to go!

Cheers and have a great weekend!

@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch 2 times, most recently from ae38ff9 to dd1bd0f Compare April 26, 2025 00:51
@jk-ozlabs
Copy link
Copy Markdown
Collaborator

nice one!

The multiple-endpoints functionality (at the MCTP level) is something we have tests for, so I'm fine to cover that on our side.

[We have a facility for multiple NVMe-MI-device emulation under qemu, but the device-side doesn't support AEs at present. I'll give this a go once that is implemented though]

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented Apr 28, 2025

@jk-ozlabs are we good to complete this, then?

Copy link
Copy Markdown
Collaborator

@jk-ozlabs jk-ozlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

I have done a bit of a nit-picky review here; most is not super important.

However: we will want to make sure we have the mi.h changes correct, as they'll become very difficult to modify once merged. If nothing else take a look through those and let me know if you disagree on any points there.

Thanks for all the work on this!

Comment thread examples/mi-mctp-ae.c Outdated
Comment thread examples/mi-mctp-ae.c Outdated
Comment thread examples/mi-mctp-ae.c Outdated
Comment thread examples/mi-mctp-ae.c Outdated
Comment thread src/nvme/mi-mctp.c Outdated
Comment thread test/mi.c Outdated
Comment thread test/mi-mctp.c Outdated
Comment thread src/nvme/mi.h Outdated
Comment thread src/nvme/mi.h Outdated
Comment thread src/nvme/mi.h Outdated
@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch 3 times, most recently from 114aa22 to 945c5f6 Compare May 1, 2025 17:41
Copy link
Copy Markdown
Collaborator

@jk-ozlabs jk-ozlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API all looks good to me!

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented May 5, 2025

@igaw , this is ready to merge.

@chorkin chorkin requested a review from igaw May 6, 2025 16:14
@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 7, 2025

Oh, it's a non trivial rebase. @chorkin could you rebase it against the current master, I think I'll break a thing or two if I do it myself. Sorry.

@chorkin chorkin force-pushed the user/chorkin/aem_work_upstream branch from 945c5f6 to cfdff5d Compare May 7, 2025 15:55
@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented May 7, 2025

I need to do one final check, @igaw . Will post here when done.

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented May 7, 2025

I need to do one final check, @igaw . Will post here when done.

@igaw, rebased and ready to merge.

Added new functionality to mi.c and mi-mctp.c to handle AEMs.  Included
new example mi-mctp-ae.c for usage.  Added tests for mi-mctp.

Signed-off-by: Chuck Horkin <[email protected]>
@igaw igaw force-pushed the user/chorkin/aem_work_upstream branch from cfdff5d to e9bd17c Compare May 7, 2025 16:47
@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 7, 2025

just updated the map file, sorted the lines alphabetically.

@igaw igaw merged commit 1c5c394 into linux-nvme:master May 7, 2025
12 checks passed
@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 7, 2025

Thanks!

@chorkin chorkin deleted the user/chorkin/aem_work_upstream branch May 7, 2025 18:17
@jk-ozlabs
Copy link
Copy Markdown
Collaborator

🥳

@chorkin
Copy link
Copy Markdown
Contributor Author

chorkin commented May 8, 2025

Thanks everyone! This is such a cool new feature; I'm glad we could get it in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants