This repository contains the source code for the Reused Mobile Devices iOS Node Application. THe app allows you to connect iOS devices as compute nodes to the rmcluster backend.
Phones are leaf/worker nodes only. The Mac coordinator handles tokenization and sampling (CPU); all tensor compute runs on the phone(s) via GGML RPC.
Build and run the Xcode project normally. The app starts a GGML RPC server on port 50052 (configurable in the UI). The phone's Wi-Fi IP and the llama-cli command are displayed in the RPC Worker panel.
Prerequisites: run the xcframework build script first so Frameworks/ is populated.
cd scripts
./build-ggml-ios.shThen build & run the app from Xcode onto a physical device.
Build a dedicated llama-cli that has Metal disabled. Without Metal, the only GPU backend is RPC, so all layers go to the phone(s).
Apple Silicon note: CMake's SVE probe hangs indefinitely on M-series chips. Use the provided initial-cache file (
rpc-only-init.cmake) which pre-seeds the result and skips the hang.
cmake -C rpc-only-init.cmake \
-B build-mac-rpc-only \
-DCMAKE_BUILD_TYPE=Release \
../llama.cpp-rpcrpc-only-init.cmake sets GGML_METAL=OFF, GGML_RPC=ON, and pre-caches the ARM SVE feature-detection result.
cmake --build build-mac-rpc-only --target llama-cli -j$(sysctl -n hw.logicalcpu)Binary lands at build-mac-rpc-only/bin/llama-cli.
Note:
build-mac/is the original Metal-enabled build (kept for single-device testing).build-mac-rpc-only/is for distributed runs where phones are the only workers.
Start the iOS app on the phone first, then on the Mac:
./build-mac-rpc-only/bin/llama-cli \
--rpc <phone-ip>:50052 \
-m ./Models/<model>.gguf \
-ngl 99 \
-p "Your prompt here" \
-no-cnvMultiple phones:
./build-mac-rpc-only/bin/llama-cli \
--rpc <phone1-ip>:50052,<phone2-ip>:50052 \
-m ./Models/<model>.gguf \
-ngl 99 \
-p "Your prompt here" \
-no-cnv| Flag | Purpose |
|---|---|
--rpc <ip:port> |
RPC worker address(es), comma-separated |
-ngl 99 |
Offload all layers to GPU backends (phones via RPC) |
-no-cnv |
Disable chat-template wrapping (use for bare prompts) |
-sys "..." |
Set system prompt when using chat mode |
load_tensors: RPC[<ip>:50052] model buffer size = ~98 MiB
load_tensors: CPU model buffer size = ~28 MiB ← coordinator CPU only
Metal lines should be absent. graph splits count equals number of active backends.
Tracks which phones are online and generates the llama-cli command automatically.
pip install -r server/requirements.txt
python server/inference_server.py \
--model /path/to/model.gguf \
--llama-cli ./build-mac-rpc-only/bin/llama-cliEndpoints: POST /register, GET /devices, POST /keepalive/{id}, DELETE /deregister/{id}, POST /run-inference.
Pinned to tag b5076 in scripts/build-ggml-ios.sh. llama_batch_add was removed at this tag; an inline helper llama_batch_add_token is defined in Bridge/LlamaBridge.mm.