Refactor GPU backend planner by orionpapadakis · Pull Request #117 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-05-28T12:59:06Z

This PR reorganizes TornadoVM execution planning around three variant axes:

model family
quantization
forward execution mode

The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.

This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.

More specifically, the GPU inference path is now organized around four collaborating abstractions:

Layouts (*ForwardTaskGraphLayout) encode the index arithmetic for a given graph topology — for example, which integer index corresponds to the activation graph, the N layer graphs, or the logits graph. They
eliminate magic numbers and make index-dependent code self-documenting.
Components (*ForwardPlanComponents) are model-family + quantization-specific factories. Each implementation constructs the concrete TornadoVM TaskGraph objects for its model (e.g., LlamaFP16PlanComponents produces LlamaFP16FFNLayers, LogitsFP16Layer, etc.). The three component interfaces form a capability hierarchy — SingleTokenForwardPlanComponents → PrefillDecodeForwardPlanComponents → BatchPrefillDecodeForwardPlanComponents — so that Llama, which supports all three execution modes, implements one object that satisfies all three contracts.
ForwardPlans (Single/PrefillDecode/BatchPrefillDecodeForwardPlan) assemble components into an ordered ImmutableTaskGraph list and a GridScheduler. Each plan encodes the graph topology for one execution mode: N+2 graphs for single-token, N+2 for prefill-decode, 2N+3 for batch-prefill/decode. ForwardPlanFactory selects the right combination of components and plan based on quantization type, model family, and execution mode.
MasterPlans (TornadoVMMasterPlan*) own the TornadoVM execution lifecycle: they create the TornadoExecutionPlan from the ForwardPlan's graph list, handle warmup and CUDA-graph configuration, and expose the forward-pass entry points (tornadoVMForwardDecode, tornadoVMForwardPrefill, TornadoVMForwardBatchPrefill) used by the inference core. They are model-agnostic — all model-specific knowledge lives in the components layer below them.

Notes

Adds Llama Q8_0 prefill-decode support which also exhibits the necessity of this PR.
Renames task-graph abstractions for clearer roles.
Moves scheduling helpers into a dedicated TornadoVM scheduling package.
Keeps graph topology and execution behavior unchanged outside the new prefill-decode path.

Verification

use java 21 or 25
setup tornadovm
mvn clean install
llama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs
llama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

any other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsupportedOperationException: BATCH_PREFILL_DECODE not yet supported for QWEN_3 + F16
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createQwen3FP16Plan(ForwardPlanFactory.java:174)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createFP16Plan(ForwardPlanFactory.java:90)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.create(ForwardPlanFactory.java:74)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createBatchPrefillDecode(ForwardPlanFactory.java:65)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.createExecutionPlan(TornadoVMMasterPlanBatchPrefillDecode.java:70)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.<init>(TornadoVMMasterPlanBatchPrefillDecode.java:51)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlan.initializeTornadoVMPlan(TornadoVMMasterPlan.java:59)
  at org.beehive.gpullama3.model.Model.runInstructOnce(Model.java:205)
  at org.beehive.gpullama3.LlamaApp.runSingleInstruction(LlamaApp.java:18)
  at org.beehive.gpullama3.LlamaApp.main(LlamaApp.java:44)
Error: Command failed with return code 1

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

…TornadoVM components

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

…ing with updated naming conventions.

…ill-decode and CUDA-graph variants

mikepapadim

LGTM, some minor changes needed.

…e-token plan

…for consistency

# Conflicts: # src/main/java/org/beehive/gpullama3/inference/state/State.java # src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java

…tate` class, removing redundancies in model-specific implementations

…pulations for batch-prefill and single-token state initialization.

…gle-token, prefill-decode, and batch-prefill inference plans

…planning

orionpapadakis added 6 commits May 28, 2026 15:36

[prf/dec]Implement prefill-decode for Llama Q8_0

45204f1

Reorganize TornadoVM execution planning and improve naming conventions

8ebf91f

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

Update naming from ActivationGraph to ActivationTaskGraph across …

4e4478a

…TornadoVM components

Rename AbstractFFNLayers to AbstractTransformerLayerTaskGraphs an…

ea478f8

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

Refactor FFN layer comments to transformer-layer task graphs, align…

e20ebc5

…ing with updated naming conventions.

[ci] Add workflows for Llama-3.2-1B-Instruct Q8_0 inference with pref…

c7522d1

…ill-decode and CUDA-graph variants

orionpapadakis requested review from mairooni, mikepapadim and stratika May 28, 2026 12:59

orionpapadakis added enhancement New feature or request refactoring prefill-decode labels May 28, 2026