Refactor GPU backend planner#117
Merged
Merged
Conversation
Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.
…TornadoVM components
…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.
…ing with updated naming conventions.
…ill-decode and CUDA-graph variants
mikepapadim
reviewed
May 30, 2026
mikepapadim
reviewed
May 30, 2026
mikepapadim
reviewed
May 30, 2026
mikepapadim
left a comment
Member
There was a problem hiding this comment.
LGTM, some minor changes needed.
# Conflicts: # src/main/java/org/beehive/gpullama3/inference/state/State.java # src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlanBatchPrefillDecode.java
…tate` class, removing redundancies in model-specific implementations
…pulations for batch-prefill and single-token state initialization.
…gle-token, prefill-decode, and batch-prefill inference plans
mikepapadim
reviewed
Jun 7, 2026
stratika
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reorganizes TornadoVM execution planning around three variant axes:
The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.
This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.
More specifically, the GPU inference path is now organized around four collaborating abstractions:
Layouts (
*ForwardTaskGraphLayout) encode the index arithmetic for a given graph topology — for example, which integer index corresponds to the activation graph, the N layer graphs, or the logits graph. Theyeliminate magic numbers and make index-dependent code self-documenting.
Components (
*ForwardPlanComponents) are model-family + quantization-specific factories. Each implementation constructs the concrete TornadoVM TaskGraph objects for its model (e.g.,LlamaFP16PlanComponentsproducesLlamaFP16FFNLayers,LogitsFP16Layer, etc.). The three component interfaces form a capability hierarchy —SingleTokenForwardPlanComponents→PrefillDecodeForwardPlanComponents→BatchPrefillDecodeForwardPlanComponents— so that Llama, which supports all three execution modes, implements one object that satisfies all three contracts.ForwardPlans (
Single/PrefillDecode/BatchPrefillDecodeForwardPlan) assemble components into an orderedImmutableTaskGraphlist and aGridScheduler. Each plan encodes the graph topology for one execution mode:N+2graphs for single-token,N+2for prefill-decode,2N+3for batch-prefill/decode.ForwardPlanFactoryselects the right combination of components and plan based on quantization type, model family, and execution mode.MasterPlans (
TornadoVMMasterPlan*) own the TornadoVM execution lifecycle: they create theTornadoExecutionPlanfrom the ForwardPlan's graph list, handle warmup and CUDA-graph configuration, and expose the forward-pass entry points (tornadoVMForwardDecode,tornadoVMForwardPrefill,TornadoVMForwardBatchPrefill) used by the inference core. They are model-agnostic — all model-specific knowledge lives in the components layer below them.Notes
Verification
use java 21 or 25
setup tornadovm
mvn clean installllama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsllama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsany other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message: