idea: agent performance benchmark

It's important to know that your tool reduced token usage and doesn't degrade agent performance. One idea is to re-use benchmarks like this:
https://github.com/SWE-rebench/SWE-rebench-V2

I've been using it on a subset of 50 tasks for local benchmarks of Qwen models in Pi. It's relatively cheap or free in my case.