Skip to content

Fix ProcessGroupGloo crash when setting unsupported backend options#324

Open
wanna-01 wants to merge 1 commit into
meta-pytorch:mainfrom
wanna-01:fix/gloo-unsupported-backend-options
Open

Fix ProcessGroupGloo crash when setting unsupported backend options#324
wanna-01 wants to merge 1 commit into
meta-pytorch:mainfrom
wanna-01:fix/gloo-unsupported-backend-options

Conversation

@wanna-01
Copy link
Copy Markdown

@wanna-01 wanna-01 commented May 6, 2026

Title

Fix ProcessGroupGloo crash when setting unsupported backend options

Summary

This MR fixes a runtime crash in ProcessGroupGloo._create_pg() caused by unconditional assignment to Gloo backend options that are not present in PyTorch's ProcessGroupGloo._Options.

Why this change is needed

Current code in ProcessGroupGloo._create_pg() does:

  • backend_class.options.global_ranks_in_group = ...
  • backend_class.options.group_name = ...

For Gloo, backend_class.options is torch._C._distributed_c10d._Options, which does not expose these attributes (it typically only has backend). Entering quorum on Gloo can therefore raise AttributeError and fail immediately.

NCCL path is not affected because NCCL Options does provide these fields.

Root cause

The code assumes Gloo options support the same attributes as NCCL options.

What this MR changes

In torchft/process_group.py (ProcessGroupGloo._create_pg):

  • wrap assignment of global_ranks_in_group in try/except AttributeError
  • wrap assignment of group_name in try/except AttributeError

So unsupported optional fields are skipped on Gloo instead of crashing.

Behavioral impact

  • prevents Gloo quorum initialization from hard-failing with AttributeError
  • no behavior change for NCCL path
  • keeps compatibility across PyTorch versions where Gloo options surface differs

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 6, 2026

Hi @wanna-01!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 6, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant