TileXR (eXtreme Rendezvous for Asynchronous Tile Communication) is a data-centric asynchronous communication runtime for Huawei Ascend NPUs. It moves communication control from coarse BSP-style kernel phases toward tile-level, AICore-driven rendezvous: data readiness, transport choice, and synchronization become explicit runtime state instead of a fixed all-ranks barrier.
The project currently contains a core communication library, an optional TileXR collectives library, MC2 fused collective operators, a registered-memory UDMA prototype for A5 / Ascend950 hardware, an opt-in on-card SDMA copy transport, and simulator/test infrastructure for Ascend C kernels.
TileXR is designed around three ideas from the current architecture deck:
- Tile as the unit of progress: split large BSP communication phases into smaller data tiles that can be produced, transferred, synchronized, and consumed independently.
- AICore-driven asynchronous rendezvous: let device code observe data readiness and runtime state, then advance communication without repeatedly returning to host scheduling.
- Dynamic communication semantics: choose among IPC/MTE, direct-drive UDMA/RDMA-style paths, notify/data-as-flag synchronization, and future offload paths according to data size, link state, peer readiness, and resource pressure.
The current codebase implements the base communication runtime, flag-based synchronization, MC2 examples, and an A5 UDMA registered-memory path. Broader dynamic scheduling, CMO best-effort scheduling, and CCU offload are design targets and should be treated as roadmap unless a specific implementation file says otherwise.
- Core communication runtime:
libtile-comm.soinitializes ranks, shared buffers, peer memory mappings, socket exchange, deviceCommArgs, and DFX state. It builds only against CANN runtime/ACL/driver APIs and TileXR-owned types — it does not include or link hcomm, HCCL, shmem, or ops-transformer. - Optional TileXR collectives:
libtilexr-collectives.so, built only whenTILEXR_BUILD_COLLECTIVES=ON, layers standaloneTileXRAllGatherand equal-sizeTileXRAllToAllAPIs on top oflibtile-comm.so. - Tile-level synchronization: device-side flag regions and magic values support reusable fine-grained synchronization rounds.
- MC2 fused operators: AllGather+Add and AllGather+MatMul examples under
src/mc2/. - Registered-memory UDMA path: host code registers ordinary
aclrtMallocdevice memory withTileXRUDMARegister; device kernels usetilexr_udma.hwrappers for put/get/signal. - On-card SDMA transport: an opt-in (
TILEXR_ENABLE_SDMA=1) local GM-to-GM copy path. Host code queries it withTileXRSDMAAvailable/TileXRGetSDMAWorkspaceDev; device kernels usetilexr_sdma.h(SDMACopyNbi,SDMAWait). Separate from UDMA: SDMA is local to one device, UDMA targets registered remote memory. - Operator simulator:
op-simulator/supports functional/performance simulation for selected AICore kernels without physical hardware.
- OS: Ubuntu 20.04 LTS
- User: root access is typically required for NPU device operations
- NPU driver: 25.5.0 or later, check with
npu-smi info - CANN: current build scripts and CMake are aligned to CANN 9.1.0
- Core supported chips: Ascend 910B, 910A5, 310P3
- UDMA runtime validation target: A5 / Ascend950 / 950 only
UDMA builds or smoke tests on 910B, 310P, or other non-A5 devices are not valid UDMA data-plane validation.
apt install -y build-essential git git-lfs rdma-core kmod net-tools \
libssl-dev libz-dev libeigen3-dev python3 python3-pipgit clone --recursive https://gitcode.com/LingquLab/TileXR.git
cd TileXRIf the repository was cloned without submodules:
git submodule update --init --recursivesource scripts/common_env.shscripts/common_env.sh sets TILEXR_HOME, TILEXR_CANN_HOME, TILEXR_TEMP_HOME, architecture, SOC name, and CANN paths.
For first-time setup of local utilities and optional operator dependencies:
bash scripts/prepare.shFor the full optional MC2/operator stack, also build hcomm and ops-transformer:
bash scripts/cann_download_install.sh
bash scripts/hcomm_build_install.sh
bash scripts/ops_build_run.shOnly building libtile-comm.so does not require hcomm_build_install.sh or ops_build_run.sh.
source scripts/common_env.sh
cmake -S . -B build -DCMAKE_INSTALL_PREFIX="$PWD/install"
cmake --build build -j"$(nproc)"
cmake --install buildExpected output:
install/lib*/libtile-comm.so
To build the optional TileXR collectives library and its tests/tools:
source scripts/common_env.sh
cmake -S . -B build-collectives \
-DTILEXR_BUILD_COLLECTIVES=ON \
-DTILEXR_BUILD_TESTS=ON \
-DBUILD_TESTING=OFF \
-DCMAKE_INSTALL_PREFIX="$PWD/install"
cmake --build build-collectives -j"$(nproc)"
cmake --install build-collectivesAdditional expected output:
install/lib*/libtilexr-collectives.so
install/include/tilexr_collectives.h
bash scripts/test_build.sh
bash scripts/test_allreduce.sh
bash scripts/plog_grep.sh ERRORTileXR/
|-- src/
| |-- comm/ # Core communication runtime
| | |-- udma/ # TileXR-owned HCCP/RA UDMA transport
| | `-- sdma/ # On-card PTO SDMA local copy transport
| |-- collectives/ # Optional TileXR collectives library
| |-- include/ # Public C/C++ and device headers
| `-- mc2/ # Fused collective operators
| |-- all_gather_add/
| |-- all_gather_matmul/
| `-- common/
|-- op-simulator/ # Ascend C kernel simulation
|-- tests/ # Host, communication, integration, and UDMA tests
| |-- collectives/ # Collectives source/unit checks and manual runners
| |-- comm/
| |-- udma/
| `-- sdma/ # SDMA unit tests, integration test, and data-plane demo
|-- scripts/ # Build, setup, test, and utility scripts
|-- 3rdparty/ # spdlog plus optional hcomm, ops-transformer, shmem
`-- docs/ # Design, migration, and validation notes
src/comm/ builds libtile-comm.so and exposes the public API in src/include/tilexr_api.h. This library is intentionally independent of hcomm, HCCL, shmem, and ops-transformer. It uses CANN runtime/ACL/driver APIs plus TileXR-owned communication metadata and datatypes.
Important host-side entry points, grouped by role:
- Lifecycle:
TileXRGetUniqueId,TileXRCommInitRankLocal,TileXRCommInitRank,TileXRCommInitRankWithDomain,TileXRCommDestroy. - CommArgs access:
TileXRGetCommArgsHost(host view),TileXRGetCommArgsDev(device pointer for kernels). - Synchronization rounds:
TileXRCommNextMagichands out a fresh magic value so callers can reuse flag memory across rounds; the optional collectives library uses it to schedule per-launch synchronization.
The runtime allocates shared IPC buffers, exchanges peer mappings, uploads CommArgs to device memory, and records topology/capability flags in CommArgs::extraFlag.
src/include/tilexr_sync.h provides device-side flag synchronization. Flags use magic values so multiple rounds can reuse the same flag memory without a full reset.
src/collectives/ builds libtilexr-collectives.so when TILEXR_BUILD_COLLECTIVES=ON. The split is intentional:
libtile-comm.soowns communicator setup, peer memory,CommArgs, UDMA metadata, and the infra public API intilexr_api.h.libtilexr-collectives.soowns collectives host validation, launch, embedded CCE kernel registration, and the public collectives API intilexr_collectives.h.- Installing only the default core runtime does not install
tilexr_collectives.h.
Initial collectives APIs:
TileXRAllGatherTileXRAllToAllfor equal per-peer counts
TileXRAllGather supports the validated multi-rank path. Multi-rank TileXRAllToAll is currently enabled only when the communicator reports the supported TOPO_910_93 topology; unsupported topologies return a parameter-check error instead of launching an invalid kernel path. Single-rank loopback is supported for both APIs.
The current UDMA path is TileXR-owned:
TileXRComm::InitUDMA()tries to initialize UDMA for multi-rank communicators.src/comm/udma/tilexr_hccp_loader.*dynamically loads CANN HCCP/RA runtime libraries such aslibra.soandlibtsdclient.so.src/comm/udma/tilexr_udma_transport.*creates contexts, queues, route metadata, and a device-sideTileXR::UDMAInfoimage.TileXRUDMARegisterregisters ordinary device memory and exchanges remote region metadata.CommArgs::udmaInfoPtrandCommArgs::udmaRegistryPtrmake queue and registered-memory metadata visible to kernels.src/include/tilexr_udma.hprovidesUDMAPutNbi,UDMAGetNbi,UDMAPutSignalNbi, andUDMAQuiet.
If UDMA is unavailable, communicator initialization continues without setting ExtraFlag::UDMA. UDMA-specific registration or demo paths then report that UDMA is unavailable.
SDMA is a first-class local on-card GM-to-GM copy path, separate from UDMA. It is disabled by default and enabled with TILEXR_ENABLE_SDMA=1.
TileXRComm::InitSDMA()owns aTileXRSDMATransportbeside the UDMA transport. When enabled, it creates a PTOpto::comm::sdma::SdmaWorkspaceManager, stores its device workspace address inCommArgs::sdmaWorkspacePtr, and setsExtraFlag::SDMA.- Host queries:
TileXRSDMAAvailable(comm, &available)andTileXRGetSDMAWorkspaceDev(comm, &workspace). The workspace pointer is owned byTileXRCommand must not be freed. - Device API:
src/include/tilexr_sdma.hprovidesTileXR::SDMACopyNbiandTileXR::SDMAWait, accepting raw same-device GM pointers. It does not register memory or validate buffer ownership. - PTO SDMA header differences across CANN 9.0.0 / 9.1.0 are isolated in
src/include/tilexr_sdma_compat.h.
Enabled initialization is best-effort: if PTO SDMA headers or runtime resources are unavailable, communicator initialization continues without setting ExtraFlag::SDMA, and SDMACopyNbi returns event handle 0 while SDMAWait reports completion. See docs/SDMA_TRANSPORT.md for the full transport guide.
src/mc2/ contains fused communication+compute examples following the ops-transformer host/tiling/kernel split:
all_gather_add: example AllGather plus element-wise Add, fixed shape and rank-size constraints.all_gather_matmul: AllGather plus MatMul with aclnn API, graph integration, and tests.common: shared MC2 tiling, topology, HCCL, and matrix multiplication utilities.
| Component | Version / Source | Purpose |
|---|---|---|
| CANN | 9.1.0 | Required for libtile-comm.so and optional libtilexr-collectives.so: Ascend ACL/runtime/driver headers and libraries |
| spdlog | submodule | Header-only optional backend for TileXR logging; src/comm/tilexr_log.h falls back to direct stdout/stderr logging when unavailable |
Optional components:
| Component | Version / Source | Used by | Notes |
|---|---|---|---|
| hcomm / HCCL | submodule / CANN communication stack | MC2 fused-operator examples and HCCL tests | Not included or linked by src/comm / libtile-comm.so |
| ops-transformer | submodule | src/mc2 operator build, packaging, and run scripts |
Not needed when only compiling libtile-comm.so |
| shmem | submodule, reference/optional | Historical UDMA experiments and comparison examples | Not included or linked by current src/comm |
Use the dedicated UDMA guides when validating A5 / Ascend950 / 950 hardware:
cd tests/udma
bash build.sh
./install/bin/test_tilexr_no_shmem_dependency
./install/bin/test_tilexr_udma_transport_layout
./install/bin/test_tilexr_udma_registryRun data-plane demos only on A5 / Ascend950 / 950:
bash demo/run_tilexr_udma_demo.sh 0 2 16 2 0
bash demo/run_tilexr_udma_demo.sh 1 2 16 2 0See:
Build and run the SDMA unit tests against a selected CANN install, then run the data-plane demo on a device:
bash tests/sdma/build.sh /path/to/cann
bash tests/sdma/run_tests.sh /path/to/cann
bash tests/sdma/demo/run_tilexr_sdma_demo.sh /path/to/cann 0 64 4096 1048576Expected demo success line:
PASS TileXR SDMA copied <bytes> bytes correctly
The unit tests are hardware-free; the demo requires a usable driver HAL/device runtime and resolves libascend_hal.so from /usr/local/Ascend/driver/lib64/driver. See docs/SDMA_TRANSPORT.md for enablement, the host/device API, CANN 9.0.0 / 9.1.0 acceptance steps, and current validation status.
Configure the collectives build as shown in Quick Start §3, then run the hardware-free source and CLI smoke checks registered with CTest:
ctest --test-dir build-collectives --output-on-failureThese checks verify headers, the library split, scripts, docs, and tool wiring without an NPU. Physical multi-NPU runs are manual.
Manual multi-NPU correctness and performance tools live under tests/collectives/:
cd tests/collectives
TILEXR_INSTALL="$PWD/../../install"
TILEXR_LIBDIR="$(find "$TILEXR_INSTALL" -maxdepth 1 -type d -name 'lib*' | head -n 1)"
LD_LIBRARY_PATH="$TILEXR_LIBDIR:${LD_LIBRARY_PATH:-}" \
./run_collectives_correctness.sh 2 16 0 ../../install/bin allgather
LD_LIBRARY_PATH="$TILEXR_LIBDIR:${LD_LIBRARY_PATH:-}" \
./run_collective_perf.sh 2 0 ../../install/bin \
--op allgather --min-bytes 4 --max-bytes 4096 \
--step-factor 2 --iters 20 --warmup-iters 5 \
--datatype int32 --check 1The perf tool prints nccl-tests-style latency, algorithm bandwidth, bus bandwidth, and error counts, with optional CSV output. See tests/collectives/README.md for script arguments, skip behavior, timeout handling, and topology limitations.
cd op-simulator
bash compile_and_run.shUse op-simulator/src/base_test.cpp and op-simulator/test_template.cpp as templates for new operator simulations.
source scripts/common_env.sh
bash scripts/ops_only_run.sh
bash scripts/device_connect.sh
bash scripts/watch.sh
bash scripts/plog_grep.sh "search_term"
bash scripts/driver_fix.sh- scripts/README.md: script reference and workflows
- docs/BUILD_VERIFICATION.md: current build and verification checklist
- docs/UDMA_INTEGRATION_SUMMARY.md: current UDMA architecture summary
- docs/SDMA_TRANSPORT.md: on-card SDMA transport guide, enablement, and validation
- docs/SHMEM_INTEGRATION.md: shmem status and historical notes
- docs/CANN_VERSION_MIGRATION.md: CANN 9.1.0 migration notes
- tests/collectives/README.md: optional collectives correctness and performance tools
- CLAUDE.md: repository guidance for AI coding agents
Driver or device issues:
bash scripts/driver_fix.sh
npu-smi infoBuild failures:
- Run
git submodule update --init --recursive. - Run
source scripts/common_env.shbefore CMake or scripts. - Check
ASCEND_HOME_PATH,TILEXR_CANN_VER, and CANN 9.1.0 include/library layout. - Confirm
install/lib/libtile-comm.solinks only to the expected CANN runtime/driver libraries and does not require hcomm, HCCL, shmem, or ops-transformer. - Do not put
${ASCEND_HOME_PATH}/${ARCH}-linux/devlibinto runtime RPATH/RUNPATH. That path is a link-time fallback and may contain stub libraries such aslibascend_hal.so; runtime should resolve the real driver HAL from/usr/local/Ascend/driver/lib64/driver.
Log analysis:
bash scripts/plog_grep.sh ERROR
bash scripts/plog_grep.sh WARNINGCopyright (c) 2025 Huawei Technologies Co., Ltd.
This program is free software. You may redistribute it and/or modify it under the terms and conditions of CANN Open Software License Agreement Version 2.0.
See the repository license notice for details.