feat: account for spatial-overflow rounds in symbolic latency model

Hi,
While working with some hardware configurations with specific fanout limits, I noticed an interesting edge case in the symbolic latency model.

Currently, ComputeStats.combine_spatial in _stats.py uses max_nonzero() to determine max_latency. This works perfectly when the spatial bound is within the hardware fanout. However, when the bound exceeds the fanout (e.g., bound=512, fanout=240), the hardware must perform multiple sequential rounds to complete the operation. In this case, the latency should be scaled by ⌈bound/fanout⌉.

At the moment, AF models this as a single parallel batch (1 round), which leads to an underestimation of the total latency, even though the total_ops (MAC count) remains correct.

I've also noticed that attempting to model this manually by wrapping a Spatial loop in a Temporal loop on the same rank variable causes a crash in the mapping validator during tensor-view reordering.

I believe updating the symbolic engine to naturally handle these overflow rounds would be a great addition to the tool's accuracy. Would you be open to a change in _stats.py to account for this?


Suggested change (conceptual):
//We need to determine the number of rounds required based on the spatial bound 
//and the hardware fanout.
rounds = math.ceil(current_spatial_bound / hardware_fanout)
self.max_latency = rounds * max_nonzero(self.max_latency, other.max_latency)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: account for spatial-overflow rounds in symbolic latency model #51

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: account for spatial-overflow rounds in symbolic latency model #51

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions