Skip to content

feat: account for spatial-overflow rounds in symbolic latency model #51

Description

@okaikov

Hi,
While working with some hardware configurations with specific fanout limits, I noticed an interesting edge case in the symbolic latency model.

Currently, ComputeStats.combine_spatial in _stats.py uses max_nonzero() to determine max_latency. This works perfectly when the spatial bound is within the hardware fanout. However, when the bound exceeds the fanout (e.g., bound=512, fanout=240), the hardware must perform multiple sequential rounds to complete the operation. In this case, the latency should be scaled by ⌈bound/fanout⌉.

At the moment, AF models this as a single parallel batch (1 round), which leads to an underestimation of the total latency, even though the total_ops (MAC count) remains correct.

I've also noticed that attempting to model this manually by wrapping a Spatial loop in a Temporal loop on the same rank variable causes a crash in the mapping validator during tensor-view reordering.

I believe updating the symbolic engine to naturally handle these overflow rounds would be a great addition to the tool's accuracy. Would you be open to a change in _stats.py to account for this?

Suggested change (conceptual):
//We need to determine the number of rounds required based on the spatial bound
//and the hardware fanout.
rounds = math.ceil(current_spatial_bound / hardware_fanout)
self.max_latency = rounds * max_nonzero(self.max_latency, other.max_latency)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions