|
| 1 | +# Candidate Benchmark Programs |
| 2 | + |
| 3 | +This directory contains the candidate programs for the benchmark suite. They are |
| 4 | +candidates, not officially part of the suite yet, because we [intend][rfc] to |
| 5 | +record various metrics about the programs and then run a principal component |
| 6 | +analysis to find a representative subset of candidates that doesn't contain |
| 7 | +effectively duplicate workloads. |
| 8 | + |
| 9 | +[rfc]: https://github.com/bytecodealliance/rfcs/pull/4 |
| 10 | + |
| 11 | +## Building |
| 12 | + |
| 13 | +Build an individual benchmark program via: |
| 14 | + |
| 15 | +``` |
| 16 | +$ ./build.sh path/to/benchmark/dir/ |
| 17 | +``` |
| 18 | + |
| 19 | +Build all benchmark programs by running: |
| 20 | + |
| 21 | +``` |
| 22 | +$ ./build-all.sh |
| 23 | +``` |
| 24 | + |
| 25 | +## Minimal Technical Requirements |
| 26 | + |
| 27 | +In order for the benchmark runner to successfully execute a Wasm program and |
| 28 | +record its execution, it must: |
| 29 | + |
| 30 | +* Export a `_start` function of type `[] -> []`. |
| 31 | + |
| 32 | +* Import `bench.start` and `bench.end` functions, both of type `[] -> []`. |
| 33 | + |
| 34 | +* Call `bench.start` exactly once during the execution of its `_start` |
| 35 | + function. This is when the benchmark runner will start recording execution |
| 36 | + time and performance counters. |
| 37 | + |
| 38 | +* Call `bench.end` exactly once during execution of its `_start` function, after |
| 39 | + `bench.start` has already been called. This is when the benchmark runner will |
| 40 | + stop recording execution time and performance counters. |
| 41 | + |
| 42 | +* Provide reproducible builds via Docker (see [`build.sh`](./build.sh)). |
| 43 | + |
| 44 | +* Be located in a `sightglass/benchmarks/$BENCHMARK_NAME` directory. Typically |
| 45 | + the benchmark is named `benchmark.wasm`, but benchmarks with multiple files |
| 46 | + should use names like `<benchmark name>-<subtest name>.wasm` (e.g., |
| 47 | + `libsodium-chacha20.wasm`). |
| 48 | + |
| 49 | +* Input workloads must be files that live in the same directory as the `.wasm` |
| 50 | + benchmark program. The benchmark program is run within the directory where it |
| 51 | + lives on the filesystem, with that directory pre-opened in WASI. The workload |
| 52 | + must be read via a relative file path. |
| 53 | + |
| 54 | + If, for example, the benchmark processes JSON input, then its input workload |
| 55 | + should live at `sightglass/benchmarks/$BENCHMARK_NAME/input.json`, and it |
| 56 | + should open that file as `"./input.json"`. |
| 57 | + |
| 58 | +* Define the expected `stdout` output in a `./<benchmark name>.stdout.expected` |
| 59 | + sibling file located next to the `benchmark.wasm` file (e.g., |
| 60 | + `benchmark.stdout.expected`). The runner will assert that the actual |
| 61 | + execution's output matches the expectation. |
| 62 | + |
| 63 | +* Define the expected `stderr` output in a `./<benchmark name>.stderr.expected` |
| 64 | + sibling file located next to the `benchmark.wasm` file. The runner will assert |
| 65 | + that the actual execution's output matches the expectation. |
| 66 | + |
| 67 | +Many of the above requirements can be checked by running the `.wasm` file |
| 68 | +through the `validate` command: |
| 69 | + |
| 70 | +``` |
| 71 | +$ cargo run -- validate path/to/benchmark.wasm |
| 72 | +``` |
| 73 | + |
| 74 | +## Compatibility Requirements for Native Execution |
| 75 | + |
| 76 | +Sightglass can also measure the performance of a subset of benchmarks compiled |
| 77 | +to native code (i.e., not WebAssembly). To compile these benchmarks without |
| 78 | +changing their source code, this involves a delicate interface with the [native |
| 79 | +engine] with some additional requirements beyond the [Minimal Technical |
| 80 | +Requirements] noted above: |
| 81 | + |
| 82 | +[native engine]: ../engines/native |
| 83 | +[Minimal Technical Requirements]: #minimal-technical-requirements |
| 84 | + |
| 85 | +* Generate an ELF shared library linked to the [native engine] shared library to |
| 86 | + provide definitions for `bench_start` and `bench_end`. |
| 87 | + |
| 88 | +* Rename the `main` function to `native_entry`. For C- and C++-based source this |
| 89 | + can be done with a simple define directive passed to `cc` (e.g., |
| 90 | + `-Dmain=native_entry`). |
| 91 | + |
| 92 | +* Provide reproducible builds via a `Dockerfile.native` file (see |
| 93 | + [`build-native.sh`](./build-native.sh)). |
| 94 | + |
| 95 | +Note that support for native execution is optional: adding a WebAssembly |
| 96 | +benchmark does not imply the need to support its native equivalent — CI |
| 97 | +will not fail if it is not included. |
| 98 | + |
| 99 | +## Additional Requirements |
| 100 | + |
| 101 | +> Note: these requirements are lifted directly from the [the benchmarking |
| 102 | +> RFC][rfc]. |
| 103 | +
|
| 104 | +In addition to the minimal technical requirements, for a benchmark program to be |
| 105 | +useful to Wasmtime and Cranelift developers, it should additionally meet the |
| 106 | +following requirements: |
| 107 | + |
| 108 | +* Candidates should be real, widely used programs, or at least extracted kernels |
| 109 | + of such programs. These programs are ideally taken from domains where Wasmtime |
| 110 | + and Cranelift are currently used, or domains where they are intended to be a |
| 111 | + good fit (e.g. serverless compute, game plugins, client Web applications, |
| 112 | + server Web applications, audio plugins, etc.). |
| 113 | + |
| 114 | +* A candidate program must be deterministic (modulo Wasm nondeterminism like |
| 115 | + `memory.grow` failure). |
| 116 | + |
| 117 | +* A candidate program must have two associated input workloads: one small and |
| 118 | + one large. The small workload may be used by developers locally to get quick, |
| 119 | + ballpark numbers for whether further investment in an optimization is worth |
| 120 | + it, without waiting for the full, thorough benchmark suite to complete. |
| 121 | + |
| 122 | +* Each workload must have an expected result, so that we can validate executions |
| 123 | + and avoid accepting "fast" but incorrect results. |
| 124 | + |
| 125 | +* Compiling and instantiating the candidate program and then executing its |
| 126 | + workload should take *roughly* one to six seconds total. |
| 127 | + |
| 128 | + > Napkin math: We want the full benchmark to run in a reasonable amount of |
| 129 | + > time, say twenty to thirty minutes, and we want somewhere around ten to |
| 130 | + > twenty programs altogether in the benchmark suite to balance diversity, |
| 131 | + > simplicity, and time spent in execution versus compilation and |
| 132 | + > instantiation. Additionally, for good statistical analyses, we need *at |
| 133 | + > least* 30 samples (ideally more like 100) from each benchmark program. That |
| 134 | + > leaves an average of about one to six seconds for each benchmark program to |
| 135 | + > compile, instantiate, and execute the workload. |
| 136 | +
|
| 137 | +* Inputs should be given through I/O and results reported through I/O. This |
| 138 | + ensures that the compiler cannot optimize the benchmark program away. |
| 139 | + |
| 140 | +* Candidate programs should only import WASI functions. They should not depend |
| 141 | + on any other non-standard imports, hooks, or runtime environment. |
| 142 | + |
| 143 | +* Candidate programs must be open source under a license that allows |
| 144 | + redistributing, modifying and redistributing modified versions. This makes |
| 145 | + distributing the benchmark easy, allows us to rebuild Wasm binaries as new |
| 146 | + versions are released, and lets us do source-level analysis of benchmark |
| 147 | + programs when necessary. |
| 148 | + |
| 149 | +* Repeated executions of a candidate program must yield independent samples |
| 150 | + (ignoring priming Wasmtime's code cache). If the execution times keep taking |
| 151 | + longer and longer, or exhibit harmonics, they are not independent and this can |
| 152 | + invalidate any statistical analyses of the results we perform. We can easily |
| 153 | + check for this property with either [the chi-squared |
| 154 | + test](https://en.wikipedia.org/wiki/Chi-squared_test) or [Fisher's exact |
| 155 | + test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test). |
| 156 | + |
| 157 | +* The corpus of candidates should include programs that use a variety of |
| 158 | + languages, compilers, and toolchains. |
0 commit comments