A benchmark for evaluating computer and browser use agents capabilities on macOS systems. This is a specialized fork of the OSWorld repository, enhanced with tasks from macOSWorld and extended with additional custom evaluation scenarios.
π Updates (June 2026)
- π Our work has been accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond @ ICML 2026 (Seoul, South Korea).
- We release Virtual Machines in public access for everyone interested in testing their agents on macOS. You can download the VM image from Hugging Face and import it into UTM.
π https://huggingface.co/buckets/macpaw-research/MacArena
Before setting up MacArena, ensure you have the following installed on your machine:
- macOS
- Homebrew package manager
- Conda or Miniconda
brew install --cask utmDownload the VM image from Hugging Face and import it into UTM.
# Create and activate conda environment
conda env create --name macarena --file environment.yml
conda activate macarenaRun the benchmark with default settings:
python3 macos_test.py --dir ./evaluation_examples/osworld/To run run benchmark on VM in manual model use the following command:
python3 macos_test.py [OPTIONS]To run automatically benchmarking use this:
python3 -m runners.run_general \
--path_to_vm "macarena" \
--max_steps 15 \
--max_trajectory_length 15 \
--history_n 15 \
--test_all_meta_path "evaluation_examples/test_macarena_second_half.json" \
--base_url "<url_to_model>" \
--model "ByteDance-Seed/UI-TARS-1.5-7B" \
--sleep_after_execution 3.0 \
--max_pixels 2352000 ; # 3000 * 28 * 28Also, create a .env file in the root directory with the following content:
API_KEY=<your_api_key_to_access_model_if_needed>| Parameter | Type | Default | Description |
|---|---|---|---|
--vm_name |
string |
osworld |
Name/path of the VM to use. If you use OSWorld/macosWorld tasks, use osworld, if you use macarena tasks, use MacArena |
--id |
integer |
0 |
ID of the first example to run |
--dir |
string |
None |
Directory containing evaluation examples. You must to specify it |
More options can be found by running:
python3 macos_test.py --help
python3 -m runners.run_general --help# Run all examples from macosworld
python3 macos_test.py --dir ./evaluation_examples/macosworld/
# Start from a specific example ID
python3 macos_test.py --id 5 --dir ./evaluation_examples/macosworld/multi_apps/
# Use a different evaluation directory
python3 macos_test.py --dir ./evaluation_examples/macosworld/productivity/
# Combine multiple options
python3 macos_test.py --vm_name macarena --id 10 --dir ./evaluation_examples/macosworld/multi_apps/The benchmark includes several pre-configured evaluation categories:
./evaluation_examples/osworld/: Tasks from original OSWorld benchmark that have been transferred to macOS./evaluation_examples/macosworld/: Tasks from macosworld benchmark./evaluation_examples/macarena/: Custom tasks created specifically for MacArena benchmark
osworld: 221
- multi_apps: 58
- libreoffice_calc: 45
- libreoffice_writer: 23
- chrome: 32
- gimp: 24
- vs_code: 19
- os: 12
- thunderbird: 8
macosworld: 151
- sys_apps: 32
- productivity: 30
- file_management: 27
- sys_and_interface: 25
- multi_apps: 20
- media: 10
- advanced: 7
MacArena tasks:
β advanced_apps 9
β file_management 6
β productivity 14
β system_and_interface 8
β system_apps 12
This benchmark is intended for research and evaluation purposes on personal computers. By using it, you agree to the following:
- macOS EULA: This benchmark runs macOS inside a virtual machine. Use of macOS is subject to applicable Appleβs Software License Agreement. You are responsible for ensuring you have a valid license to run macOS on your hardware.
- UTM: The virtual machine environment uses UTM, which is licensed under the Apache License 2.0. Download and use of UTM is governed by its own license terms. This benchmark does not redistribute UTM β users must download it independently from https://github.com/utmapp/UTM.
- Third-party applications: Some benchmark tasks involve automating third-party macOS applications. You must hold valid licenses for any such applications on your own device. This benchmark does not provide or endorse any unlicensed software.
- Non-commercial tasks: Tasks sourced from macOSWorld are licensed under CC BY-NC 4.0 and may not be used for commercial purposes.
MacPaw Way Ltd. provides this benchmark "as is" and shall not be held liable for any misuse, licensing violations, or damages arising from the use of this tool.
If you use MacArena in your research, please cite the following paper:
@misc{muryn-etal-2026-aiwild-macarena,
author = {Victor Muryn and Maksym Shamrai and Sofiia Mazepa and Yehor Khodysko},
title = {MacArena: Benchmarking Computer Use Agents on an Online macOS Environment},
month = {June},
year = {2026},
eprint = {2606.06560},
eprinttype = {arxiv},
eprintclass = {cs.LG},
url = {https://arxiv.org/abs/2606.06560},
urldate = {2026-06-08},
note = {\emph{Accepted to AIWILD @ ICML 2026}},
}