Hi,
While working with some hardware configurations with specific fanout limits, I noticed an interesting edge case in the symbolic latency model.
Currently, ComputeStats.combine_spatial in _stats.py uses max_nonzero() to determine max_latency. This works perfectly when the spatial bound is within the hardware fanout. However, when the bound exceeds the fanout (e.g., bound=512, fanout=240), the hardware must perform multiple sequential rounds to complete the operation. In this case, the latency should be scaled by ⌈bound/fanout⌉.
At the moment, AF models this as a single parallel batch (1 round), which leads to an underestimation of the total latency, even though the total_ops (MAC count) remains correct.
I've also noticed that attempting to model this manually by wrapping a Spatial loop in a Temporal loop on the same rank variable causes a crash in the mapping validator during tensor-view reordering.
I believe updating the symbolic engine to naturally handle these overflow rounds would be a great addition to the tool's accuracy. Would you be open to a change in _stats.py to account for this?
Suggested change (conceptual):
//We need to determine the number of rounds required based on the spatial bound
//and the hardware fanout.
rounds = math.ceil(current_spatial_bound / hardware_fanout)
self.max_latency = rounds * max_nonzero(self.max_latency, other.max_latency)
Hi,
While working with some hardware configurations with specific fanout limits, I noticed an interesting edge case in the symbolic latency model.
Currently, ComputeStats.combine_spatial in _stats.py uses max_nonzero() to determine max_latency. This works perfectly when the spatial bound is within the hardware fanout. However, when the bound exceeds the fanout (e.g., bound=512, fanout=240), the hardware must perform multiple sequential rounds to complete the operation. In this case, the latency should be scaled by ⌈bound/fanout⌉.
At the moment, AF models this as a single parallel batch (1 round), which leads to an underestimation of the total latency, even though the total_ops (MAC count) remains correct.
I've also noticed that attempting to model this manually by wrapping a Spatial loop in a Temporal loop on the same rank variable causes a crash in the mapping validator during tensor-view reordering.
I believe updating the symbolic engine to naturally handle these overflow rounds would be a great addition to the tool's accuracy. Would you be open to a change in _stats.py to account for this?
Suggested change (conceptual):
//We need to determine the number of rounds required based on the spatial bound
//and the hardware fanout.
rounds = math.ceil(current_spatial_bound / hardware_fanout)
self.max_latency = rounds * max_nonzero(self.max_latency, other.max_latency)