🖥️ MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

A benchmark for evaluating computer and browser use agents capabilities on macOS systems. This is a specialized fork of the OSWorld repository, enhanced with tasks from macOSWorld and extended with additional custom evaluation scenarios.

🔔 Updates (June 2026)

🎉 Our work has been accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond @ ICML 2026 (Seoul, South Korea).

We release Virtual Machines in public access for everyone interested in testing their agents on macOS. You can download the VM image from Hugging Face and import it into UTM.
👉 https://huggingface.co/buckets/macpaw-research/MacArena

Local Machine Setup

🔧 Prerequisites

Before setting up MacArena, ensure you have the following installed on your machine:

macOS
Homebrew package manager
Conda or Miniconda

🚀 Installation

1. Install Required Tools

brew install --cask utm

2. Download VM Image

Download the VM image from Hugging Face and import it into UTM.

3. Setup Python Environment

# Create and activate conda environment
conda env create --name macarena --file environment.yml
conda activate macarena

⚡ Quick Start

Run the benchmark with default settings:

python3 macos_test.py --dir ./evaluation_examples/osworld/

📖 Usage

Basic Command

To run run benchmark on VM in manual model use the following command:

python3 macos_test.py [OPTIONS]

To run automatically benchmarking use this:

python3 -m runners.run_general \
    --path_to_vm "macarena" \
    --max_steps 15 \
    --max_trajectory_length 15 \
    --history_n 15 \
    --test_all_meta_path "evaluation_examples/test_macarena_second_half.json" \
    --base_url "<url_to_model>" \ 
    --model "ByteDance-Seed/UI-TARS-1.5-7B" \
    --sleep_after_execution 3.0 \
    --max_pixels 2352000 ; # 3000 * 28 * 28

Also, create a .env file in the root directory with the following content:

API_KEY=<your_api_key_to_access_model_if_needed>

Command Line Options

Parameter	Type	Default	Description
`--vm_name`	`string`	`osworld`	Name/path of the VM to use. If you use OSWorld/macosWorld tasks, use osworld, if you use macarena tasks, use MacArena
`--id`	`integer`	`0`	ID of the first example to run
`--dir`	`string`	`None`	Directory containing evaluation examples. You must to specify it

More options can be found by running:

python3 macos_test.py --help
python3 -m runners.run_general --help

Examples

# Run all examples from macosworld
python3 macos_test.py --dir ./evaluation_examples/macosworld/

# Start from a specific example ID
python3 macos_test.py --id 5 --dir ./evaluation_examples/macosworld/multi_apps/

# Use a different evaluation directory
python3 macos_test.py --dir ./evaluation_examples/macosworld/productivity/

# Combine multiple options
python3 macos_test.py --vm_name macarena --id 10 --dir ./evaluation_examples/macosworld/multi_apps/

🔧 Configuration

Evaluation Directories

The benchmark includes several pre-configured evaluation categories:

./evaluation_examples/osworld/: Tasks from original OSWorld benchmark that have been transferred to macOS
./evaluation_examples/macosworld/: Tasks from macosworld benchmark
./evaluation_examples/macarena/: Custom tasks created specifically for MacArena benchmark

Distribution of Examples

osworld: 221
  - multi_apps: 58
  - libreoffice_calc: 45
  - libreoffice_writer: 23
  - chrome: 32
  - gimp: 24
  - vs_code: 19
  - os: 12
  - thunderbird: 8

macosworld: 151
  - sys_apps: 32
  - productivity: 30
  - file_management: 27
  - sys_and_interface: 25
  - multi_apps: 20
  - media: 10
  - advanced: 7

MacArena tasks:
  ─ advanced_apps 9
  ─ file_management 6
  ─ productivity 14
  ─ system_and_interface 8
  ─ system_apps 12

Legal Disclaimer

This benchmark is intended for research and evaluation purposes on personal computers. By using it, you agree to the following:

macOS EULA: This benchmark runs macOS inside a virtual machine. Use of macOS is subject to applicable Apple’s Software License Agreement. You are responsible for ensuring you have a valid license to run macOS on your hardware.
UTM: The virtual machine environment uses UTM, which is licensed under the Apache License 2.0. Download and use of UTM is governed by its own license terms. This benchmark does not redistribute UTM — users must download it independently from https://github.com/utmapp/UTM.
Third-party applications: Some benchmark tasks involve automating third-party macOS applications. You must hold valid licenses for any such applications on your own device. This benchmark does not provide or endorse any unlicensed software.
Non-commercial tasks: Tasks sourced from macOSWorld are licensed under CC BY-NC 4.0 and may not be used for commercial purposes.

MacPaw Way Ltd. provides this benchmark "as is" and shall not be held liable for any misuse, licensing violations, or damages arising from the use of this tool.

📚 Citation

If you use MacArena in your research, please cite the following paper:

@misc{muryn-etal-2026-aiwild-macarena,
  author      = {Victor Muryn and Maksym Shamrai and Sofiia Mazepa and Yehor Khodysko},
  title       = {MacArena: Benchmarking Computer Use Agents on an Online macOS Environment},
  month       = {June},
  year        = {2026},
  eprint      = {2606.06560},
  eprinttype  = {arxiv},
  eprintclass = {cs.LG},
  url         = {https://arxiv.org/abs/2606.06560},
  urldate     = {2026-06-08},
  note        = {\emph{Accepted to AIWILD @ ICML 2026}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
desktop_env		desktop_env
evaluation_examples		evaluation_examples
logs		logs
mm_agents		mm_agents
runners		runners
vm_files		vm_files
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.yml		environment.yml
macos_test.py		macos_test.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
show_result.py		show_result.py
versions.txt		versions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖥️ MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

📋 Table of Contents

Local Machine Setup

🔧 Prerequisites

🚀 Installation

1. Install Required Tools

2. Download VM Image

3. Setup Python Environment

⚡ Quick Start

📖 Usage

Basic Command

Command Line Options

Examples

🔧 Configuration

Evaluation Directories

Distribution of Examples

Legal Disclaimer

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🖥️ MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

📋 Table of Contents

Local Machine Setup

🔧 Prerequisites

🚀 Installation

1. Install Required Tools

2. Download VM Image

3. Setup Python Environment

⚡ Quick Start

📖 Usage

Basic Command

Command Line Options

Examples

🔧 Configuration

Evaluation Directories

Distribution of Examples

Legal Disclaimer

📚 Citation

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages