In section 4 of the program bench paper the two metrics are defined as:
- % Resolved (primary): fraction of instances where all tests pass
- % Tests Passed (secondary): average fraction of tests passing across instances
But the BatchEvalSummary.resolve_ratio (eval_batch.py:83) function seems to compute the second one:
@computed_field # type: ignore[prop-decorator]
@property
def resolve_ratio(self) -> float:
if not self.summaries:
return 0.0
return sum(s.score for s in self.summaries) / len(self.summaries)
where s.score is n_resolved / n_tests per instance, so this is the mean test pass rate, not the fraction of fully resolved instances
Is this intentional? If % Resolved is what resolve_ratio is supposed to represent, the correct way to compute it would be:
return sum(1 for s in self.summaries if s.score == 1.0) / len(self.summaries)
In section 4 of the program bench paper the two metrics are defined as:
But the
BatchEvalSummary.resolve_ratio(eval_batch.py:83) function seems to compute the second one:where
s.scoreisn_resolved / n_testsper instance, so this is the mean test pass rate, not the fraction of fully resolved instancesIs this intentional? If % Resolved is what
resolve_ratiois supposed to represent, the correct way to compute it would be: