PyTorch bindings for Elixir — Python-free. Load and execute serialized PyTorch models (.pt2 AOTInductor, .pt2 torch.export) directly from the BEAM. No Python process, no ports, no subprocess pool, no RPC queue, no ONNX conversion, no model rewrite. Just a library you call.
Bit-for-bit identical outputs to Python (verified across 11 tested models). Inference performance comparable to Python on the equivalent path — see Benchmarks for details.
A library you embed in your BEAM application to run PyTorch models you've already serialized elsewhere. It gives you the execution primitive (load + forward) and a customization surface built on normal Elixir/OTP tools — not a packaged serving product.
- Not a serving platform. There is no built-in dynamic batching, request coalescing, multi-model scheduler, or HTTP layer. Bring Phoenix, Bandit, Broadway, or roll your own around
forward/2. - Not a replacement for Triton or TorchServe. If you want a closed-box high-throughput inference server with batching as a config knob, those are the right tools.
- Not faster than PyTorch for apples-to-apples compiled paths. We match Python AOTI (1.03x); we don't beat it.
Run PyTorch models from Elixir, directly. If you're already on BEAM and want to add inference, you no longer need a second runtime, an RPC hop to Python, or an ONNX conversion step. Load the .pt2 your ML team produced and call it.
Customization surface. Because it's a library, you wrap forward/2 in whatever your app already does — Phoenix controllers, Broadway pipelines, GenServers with custom supervision, telemetry handlers, feature flags, fallback to a second model on error, ensembling, preprocessing in Nx. This is the flipside of not being Triton: batching is your problem, but everything around the model is trivially customizable.
Distribution is whatever you already use. ExTorch.Export.Server and ExTorch.AOTI.Server are plain GenServers. If you want multiple replicas, start several under a Supervisor. Multi-node replicas? Node.connect + :global, Horde, or Swarm. Pooling? poolboy, NimblePool, or a DynamicSupervisor. Rolling model updates with drain? Custom terminate/2. There is no ExTorch-specific clustering, routing, or pool primitive — the integration point is a process, so the BEAM tools you already use work as-is.
Two supported formats, both from modern PyTorch export paths:
| Format | Loaded via | Notes |
|---|---|---|
AOTInductor .pt2 |
ExTorch.AOTI.load/2 |
Fused compiled kernels; matches Python AOTI |
torch.export .pt2 |
ExTorch.Export.load/2 |
Four execution paths — interpreter, native, compiled graph, or extract as an Elixir DSL |
Zero-copy with Nx. Share tensor memory with Nx/Torchx via raw pointer exchange. Preprocessing in Nx composes with inference in ExTorch without copying.
Extensible op set. The c10::Dispatcher bridge lets Elixir packages register new ops without C++ code. ExTorch.Vision adds torchvision ops (NMS, ROI Align, deformable conv, image I/O) this way.
- torch.export Inference -- Load
.pt2files fromtorch.export.saveand run inference through a compiled C++ graph executor (89+ ATen ops). Tested with AlexNet, ResNet, VGG, MobileNet, ViT, EfficientNet, DeepLab, DistilBERT, Whisper, LSTM, and more. - AOTI Compiled Models -- Load AOTInductor
.pt2packages for optimized inference with fused kernels. - Generic c10 Dispatcher -- Call any PyTorch op by name through
dispatch_op/3. Load external op libraries (torchvision, torchaudio) viaload_torch_library/1. - Op Extension System --
ExTorch.Export.OpHandlerbehaviour +OpRegistryfor registering custom ops from external packages. - Neural Network DSL -- Define PyTorch-compatible layers in Elixir with
deflayer, backed by libtorch's C++ nn modules (35 layer types). - Zero-Copy Tensor Exchange -- Share tensor memory with Nx/Torchx via
data_ptr/from_blob. - Telemetry hooks --
:telemetryevents for load/inference, optional ETS-backed counters, optional LiveDashboard page. - Tensor Operations -- 200+ wrapped libtorch ops for creation, manipulation, math, comparison, reduction, and indexing.
- Elixir >= 1.16
- Rust (stable toolchain)
- libtorch (automatically downloaded, or use a local PyTorch installation)
- CMake (for ExTorch.Vision)
- CUDA toolkit (optional, for GPU support)
Add extorch to your dependencies in mix.exs:
def deps do
[
{:extorch, "~> 0.4.0"}
]
endExTorch downloads libtorch automatically on first compile. To use a local installation:
config :extorch, libtorch: [
version: :local,
folder: :python # or an absolute path to libtorch
]# Python: export your model
import torch
import torchvision
model = torchvision.models.resnet50(pretrained=True).eval()
exported = torch.export.export(model, (torch.randn(1, 3, 224, 224),))
torch.export.save(exported, "resnet50.pt2")# Elixir: load and call
model = ExTorch.Export.load("resnet50.pt2", device: :cuda)
input = ExTorch.Tensor.to(ExTorch.randn({1, 3, 224, 224}), device: :cuda)
# Fastest path — pre-compiled graph, zero per-op overhead
output = ExTorch.Export.forward_compiled(model, [input])
# Or use AOTI for maximum throughput (requires pre-compilation in Python)
aoti_model = ExTorch.AOTI.load("resnet50_aoti.pt2", device_index: 0)
[output] = ExTorch.AOTI.forward(aoti_model, [input])Export with a symbolic dim and run at any batch size that fits the constraint — no re-export needed:
# Python: export once with dynamic batch
from torch.export import Dim
batch = Dim("batch", min=1, max=64)
exported = torch.export.export(
model,
(torch.randn(2, 3, 224, 224),), # example input
dynamic_shapes={"x": {0: batch}}, # 0th dim is dynamic
)
torch.export.save(exported, "resnet.pt2")# Elixir: load once, call at any batch
model = ExTorch.Export.load("resnet.pt2")
ExTorch.Export.forward(model, [ExTorch.randn({1, 3, 224, 224})]) # bs=1
ExTorch.Export.forward(model, [ExTorch.randn({4, 3, 224, 224})]) # bs=4
ExTorch.Export.forward(model, [ExTorch.randn({16, 3, 224, 224})]) # bs=16Works across all three Export inference paths (forward/2,
forward_native/2, forward_compiled/2) and with any dimension that
torch.export's tracer can express symbolically — batch size,
variable H/W on convolutional models, variable sequence length on
transformer classifiers. See
test/export/dynamic_batch_test.exs
for verified coverage (MLP, ConvNet, ResNet18 at multiple batch sizes).
Limitation — data-dependent shapes. Ops whose output shape depends
on input values (e.g. nonzero, torchvision::nms) are not yet
supported when the graph performs downstream arithmetic on the
variable-length output. This affects detection models like Mask R-CNN.
Non-data-dependent dynamic dims (batch, H/W) work today.
Because there's no built-in server, you compose inference with whatever your app already uses. A minimal supervised model behind a mailbox:
{:ok, _} = ExTorch.Export.Server.start_link(
path: "resnet50.pt2",
device: :cuda,
name: :resnet
)
{:ok, output} = ExTorch.Export.Server.predict(:resnet, [input])…but the point is you don't have to use that wrapper. A Phoenix controller that does preprocessing in Nx, zero-copies to ExTorch, runs inference, and falls back to a smaller model on OOM is all ordinary Elixir code. See examples/serving/.
ExTorch.Metrics.setup()
ExTorch.Metrics.get("resnet50.pt2")
# => %{inference_count: 1500, min_duration_ms: 4.9, max_duration_ms: 12.1, ...}# Swap models without dropping in-flight requests
# See examples/serving/hot_reload.exs for the full pattern
GenServer.cast(:resnet, {:reload, "resnet50_v2.pt2"})# Load torchvision ops (NMS, ROI Align, etc.)
ExTorch.Native.load_torch_library("/path/to/libtorchvision.so")
# Call any registered op by name
keep = ExTorch.Native.dispatch_op("torchvision::nms", "", [
{:tensor, boxes}, {:tensor, scores}, {:float, 0.5}
])
# Or use ExTorch.Vision for a clean API
ExTorch.Vision.nms(boxes, scores, 0.5)
ExTorch.Vision.roi_align(features, rois, 1.0, 7, 7)# ExTorch → Nx (via Torchx): share memory, no copy
blob = ExTorch.Tensor.Blob.to_blob(tensor)
# => %Blob{ptr: 140234567890, shape: {3, 224, 224}, dtype: :float, ...}
# Nx → ExTorch: wrap foreign memory
view = ExTorch.Tensor.Blob.from_blob(
%{ptr: torchx_ptr, shape: {3, 224, 224}, dtype: :float32},
owner: nx_tensor
)ExTorch.Native.cuda_is_available() # => true
ExTorch.Native.cuda_device_count() # => 2
model = ExTorch.Export.load("model.pt2", device: :cuda)
ExTorch.Native.cuda_memory_allocated(0) # bytes on GPU 0Four inference paths on torch.export models, all producing bit-for-bit identical outputs to Python. RTX 3060, ViT-B/16, median of 30 runs:
| Path | When to use | Latency |
|---|---|---|
ExTorch.Export.forward/2 |
Op-by-op interpreter, useful for debugging | 54.9ms |
ExTorch.Export.forward_native/2 |
Native graph execution, one NIF call | 11.9ms |
ExTorch.Export.forward_compiled/2 |
Pre-compiled graph, no per-op overhead | 9.5ms |
ExTorch.AOTI.forward/2 |
AOTInductor kernels | 8.8ms |
Against Python on the same hardware:
- AOTI: 1.03x of Python AOTI — effectively tied. Both sides are compiled shared objects, so neither has interpreter overhead in the hot path.
- Compiled Export: 1.37x of Python's
torch.export.export().module()(geomean across 12 models). The gap is Python's FX interpreter walking the graph per-call; ExTorch pre-resolves the graph to c10 dispatcher handles at load time.
The 1.37x number is the symptom, not the point. The architectural fact is that there is no Python interpreter in ExTorch's hot path — on paths where Python also skips its interpreter (AOTI), we match.
Full results for 12 models in examples/models.
See examples/serving/ for integration patterns:
- basic_inference.exs -- Three inference paths side-by-side with benchmarks
- genserver_pool.exs -- Model pool with concurrent inference and p50/p95/p99 reporting
- hot_reload.exs -- Swapping models without dropping in-flight requests
- telemetry_dashboard.exs -- Metrics via LiveDashboard
See examples/models/ for end-to-end model usage:
- 8 models: CLIP, DistilBERT, MobileNetV3, EfficientNet, ResNet50, ViT-B/16, DeepLabV3, Whisper
- Export script, multi-model benchmark, full image classification pipeline
Three-layer design: C++ (libtorch wrapper) → Rust (cxx bridge + Rustler NIFs) → Elixir (macro-generated API).
- C++ sources:
native/extorch/src/csrc/*.cc+native/extorch/include/*.h - Rust bridge:
native/extorch/src/native/*.rs.in(Tera templates rendered bybuild.rs) - Rust NIFs:
native/extorch/src/nifs/*.rs - Elixir API:
lib/extorch/
The generic c10::Dispatcher NIF bridge (dispatch_op, execute_graph, compile_graph) enables calling any PyTorch op without per-op C++ wrappers, and the OpHandler behaviour allows external packages to extend the Export interpreter.
MIT