Distributed Inference iOS App

This repository contains the source code for the Reused Mobile Devices iOS Node Application. THe app allows you to connect iOS devices as compute nodes to the rmcluster backend.

Build & Run Guide

Phones are leaf/worker nodes only. The Mac coordinator handles tokenization and sampling (CPU); all tensor compute runs on the phone(s) via GGML RPC.

1. iOS app (RPC server on phone)

Build and run the Xcode project normally. The app starts a GGML RPC server on port 50052 (configurable in the UI). The phone's Wi-Fi IP and the llama-cli command are displayed in the RPC Worker panel.

Prerequisites: run the xcframework build script first so Frameworks/ is populated.

cd scripts
./build-ggml-ios.sh

Then build & run the app from Xcode onto a physical device.

2. Mac coordinator binary (no Metal, RPC only)

Build a dedicated llama-cli that has Metal disabled. Without Metal, the only GPU backend is RPC, so all layers go to the phone(s).

Configure (one-time)

Apple Silicon note: CMake's SVE probe hangs indefinitely on M-series chips. Use the provided initial-cache file (rpc-only-init.cmake) which pre-seeds the result and skips the hang.

cmake -C rpc-only-init.cmake \
  -B build-mac-rpc-only \
  -DCMAKE_BUILD_TYPE=Release \
  ../llama.cpp-rpc

rpc-only-init.cmake sets GGML_METAL=OFF, GGML_RPC=ON, and pre-caches the ARM SVE feature-detection result.

Build

cmake --build build-mac-rpc-only --target llama-cli -j$(sysctl -n hw.logicalcpu)

Binary lands at build-mac-rpc-only/bin/llama-cli.

Note: build-mac/ is the original Metal-enabled build (kept for single-device testing). build-mac-rpc-only/ is for distributed runs where phones are the only workers.

3. Run distributed inference

Start the iOS app on the phone first, then on the Mac:

./build-mac-rpc-only/bin/llama-cli \
  --rpc <phone-ip>:50052 \
  -m ./Models/<model>.gguf \
  -ngl 99 \
  -p "Your prompt here" \
  -no-cnv

Multiple phones:

./build-mac-rpc-only/bin/llama-cli \
  --rpc <phone1-ip>:50052,<phone2-ip>:50052 \
  -m ./Models/<model>.gguf \
  -ngl 99 \
  -p "Your prompt here" \
  -no-cnv

Key flags

Flag	Purpose
`--rpc <ip:port>`	RPC worker address(es), comma-separated
`-ngl 99`	Offload all layers to GPU backends (phones via RPC)
`-no-cnv`	Disable chat-template wrapping (use for bare prompts)
`-sys "..."`	Set system prompt when using chat mode

What to expect in output

load_tensors: RPC[<ip>:50052] model buffer size = ~98 MiB
load_tensors:        CPU model buffer size = ~28 MiB   ← coordinator CPU only

Metal lines should be absent. graph splits count equals number of active backends.

4. Device registry server (optional)

Tracks which phones are online and generates the llama-cli command automatically.

pip install -r server/requirements.txt
python server/inference_server.py \
  --model /path/to/model.gguf \
  --llama-cli ./build-mac-rpc-only/bin/llama-cli

Endpoints: POST /register, GET /devices, POST /keepalive/{id}, DELETE /deregister/{id}, POST /run-inference.

5. llama.cpp version

Pinned to tag b5076 in scripts/build-ggml-ios.sh. llama_batch_add was removed at this tag; an inline helper llama_batch_add_token is defined in Bridge/LlamaBridge.mm.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
distributed-ml-ggml-client-ios.xcodeproj		distributed-ml-ggml-client-ios.xcodeproj
distributed-ml-ggml-client-ios		distributed-ml-ggml-client-ios
distributed-ml-ggml-client-iosTests		distributed-ml-ggml-client-iosTests
distributed-ml-ggml-client-iosUITests		distributed-ml-ggml-client-iosUITests
scripts		scripts
server		server
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
rpc-only-init.cmake		rpc-only-init.cmake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Inference iOS App

Build & Run Guide

1. iOS app (RPC server on phone)

2. Mac coordinator binary (no Metal, RPC only)

Configure (one-time)

Build

3. Run distributed inference

Key flags

What to expect in output

4. Device registry server (optional)

5. llama.cpp version

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Distributed Inference iOS App

Build & Run Guide

1. iOS app (RPC server on phone)

2. Mac coordinator binary (no Metal, RPC only)

Configure (one-time)

Build

3. Run distributed inference

Key flags

What to expect in output

4. Device registry server (optional)

5. llama.cpp version

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages