Python package for music source separation.
[English] 简体中文
If you want the CUDA build of PyTorch, install it first:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128For CLI and Python API usage, install:
pip install pymssIf you need API or WebUI, install this instead:
pip install "pymss[server]"Development requires Git, Python 3.10 or later, and uv. WebUI development also requires Node.js and npm.
Clone the Python package repository and install development dependencies:
git clone https://github.com/pymss-project/pymss
cd pymss
uv sync --group devIf you need to develop or locally serve the WebUI, the WebUI source lives in a separate repository and must be built with Node.js:
git clone https://github.com/pymss-project/pymss-webui
cd pymss-webui
npm ci
npm run buildCopy the built WebUI assets into the Python package checkout:
cp -R dist/. ../pymss/server/webui_static/Build source and wheel distributions from the Python package checkout:
cd ..
uv buildThe test suite uses pytest. The migrated integration tests live in test/ and are parameterized through test/test_all.py. They require local model weights, configs, and input audio; missing assets are skipped automatically.
uv run pytest test -qRun inference by catalog model name. If the model, config, or auxiliary files are missing locally, the CLI downloads them automatically before inference.
pymss infer bs_roformer_voc_hyperacev2 \
-i path/to/input_file_or_folder \
-o results \
--device auto \
--format wav--device auto uses CUDA first when an NVIDIA GPU is available. On Apple Silicon it uses the MLX backend by default. Use --device mlx to force MLX, or --device mps to force PyTorch MPS.
The default download source is ModelScope. You can choose another source or model directory:
pymss --model-dir /path/to/models infer bs_roformer_voc_hyperacev2 \
--source hf-mirror \
-i path/to/input_file_or_folder \
-o resultsWhen running from a source checkout without installation, use python -m pymss.cli instead of pymss.
pymss ensemble path/to/model_a_vocals.wav path/to/model_b_vocals.wav \
--algorithm avg_wave \
--weights 1 0.8 \
-o results/ensemble_vocals.wavAvailable algorithms are avg_wave, median_wave, min_wave, max_wave, avg_fft, median_fft, min_fft, and max_fft. Input files must use the same sample rate and channel count. Files with different lengths are truncated to the shortest input. If --weights is omitted, every input uses weight 1.
Install the optional server dependencies to run a HTTP server with dynamic model loading, catalog browsing, model downloads, and an optional browser WebUI:
pip install "pymss[server]"
pymss serve --webuiSee server CLI docs, server API docs, and server error docs for details.
Use a catalog model name directly. You do not need to pass model_type, model_path, or config_path.
from pymss import MSSeparator
separator = MSSeparator.from_model_name(
"bs_roformer_voc_hyperacev2",
download=True,
device="auto",
output_format="wav",
store_dirs="results",
)
separator.process_folder("path/to/input_file_or_folder")download=True downloads missing model files before loading. Omit it for strict local-only loading.
MSSeparator can also be used as a context manager. Leaving the with block automatically calls separator.close(), which releases model references and clears backend caches where possible.
from pymss import MSSeparator
with MSSeparator.from_model_name(
"bs_roformer_voc_hyperacev2",
download=True,
device="auto",
output_format="wav",
store_dirs="results",
) as separator:
separator.process_folder("path/to/input_file_or_folder")Use the full constructor for custom weights that are not in the model catalog.
from pymss import MSSeparator, get_separation_logger
# init
separator = MSSeparator(
model_type='htdemucs',
model_path='path/to/model',
config_path='path/to/config',
device='cuda',
device_ids=[0],
output_format='wav',
use_tta=True,
store_dirs={
"vocals": "./output/vocals",
"other": None # None or missing this stem will result in no output file for this stem. This example will output the vocal's stem in ./output/vocals and ignoring the other(instrumental) stem. Making sure the key(s) match the config file.
},
audio_params={"wav_bit_depth": "FLOAT", "flac_bit_depth": "PCM_24", "mp3_bit_rate": "320k", "m4a_bit_rate": "192k", "m4a_aac_at_quality": 2}, # Can be omitted
logger=get_separation_logger(), # Can be omitted
debug=False, # Can be omitted
inference_params={
"batch_size": 4,
"overlap_size": 512,
"chunk_size": 1024,
"standardize": True,
"normalize": False
} # Can be omitted
)
# process all audio files in the folder
separator.process_folder('path/to/input_folder')For a detailed explanation of every MSSeparator argument, see the MSSeparator parameter guide.
- model_type: The type of model, e.g., 'htdemucs'. Must be one of ['bs_roformer', 'mel_band_roformer', 'htdemucs', 'mdx23c', 'bandit', 'bandit_v2', 'scnet', 'apollo', 'vr']
- model_path: The path to the model file.
- config_path: The path to the configuration file.
- device: The type of device, default is 'auto'. Must be one of ['auto', 'cuda', 'mps', 'cpu']
- device_ids: List of device IDs, default is [0].
- output_format: The output audio format, default is 'wav'. Must be one of ['wav', 'flac', 'mp3', 'm4a']
- use_tta: Whether to use TTA, default is False. Using TTA will triple the processing time with a little bit improvement in quality.
- store_dirs: Storage directories, can be a single folder path or a dictionary with instrument keys.
- audio_params: Audio parameters including wav_bit_depth, flac_bit_depth, mp3_bit_rate, m4a_bit_rate, and m4a_aac_at_quality. Default is {"wav_bit_depth": "FLOAT", "flac_bit_depth": "PCM_24", "mp3_bit_rate": "320k", "m4a_bit_rate": "192k", "m4a_aac_at_quality": 2}.
- logger: Logger instance. Default is pymss.get_separation_logger()
- debug: Whether to enable debug mode, default is False.
- inference_params: Inference parameters including batch_size, overlap_size, chunk_size, standardize, normalize, and
cuda_attention_backend.standardizecontrols model input standardization and defaults to the model config'sinference.normalizevalue, orFalsewhen missing.normalizecontrols linked output peak normalization for all returned stems. Formodel_type='vr', supported keys arebatch_size,window_size,aggression,enable_tta,enable_post_process,post_process_threshold,high_end_process, and outputnormalize.
RoFormer-family models default to cuDNN attention on CUDA when the installed PyTorch build exposes it, otherwise they use PyTorch's default SDPA path. Override with inference_params={"cuda_attention_backend": "auto"} if you want fallback probing. Valid values are auto, default, flash, cudnn, efficient, math, and xformers. auto tries cuDNN attention first, then PyTorch memory-efficient SDPA, then PyTorch default SDPA. xformers is optional and only used if installed locally; it is not a required dependency.
Use device='mlx' to run the Apple Silicon MLX backend:
separator = MSSeparator.from_model_name(
"bs_roformer_voc_hyperacev2",
download=True,
device="mlx",
output_format="wav",
store_dirs="results",
)On Apple Silicon, pyproject.toml installs mlx>=0.31.0 for this backend. If MLX is missing or a non-VR backend fails, the model records _pymss_mlx_full_backend_error and falls back to Torch MPS. Advanced users can still override mps_model_backend and mps_model_compute_dtype through inference_params.
HTDemucs checkpoints whose config uses model: htdemucs and htdemucs.cac: true are supported through model_type='htdemucs'.
Legacy Demucs/TasNet .th weights can use model_type='legacy_demucs' or model_type='legacy_tasnet' without a MSST YAML config. The dependency-free legacy loader supports classic Demucs, v3 time-domain Demucs, ConvTasNet, CaC HDemucs, package-style HTDemucs, multi-frequency CaC HDemucs, and simple Demucs bag YAML files. DiffQ-quantized checkpoints and non-CaC/Wiener HDemucs still need a dedicated legacy loader.
UVR VR support is available for the supported UVR/VR series .pth weights. Use the catalog model name in the same CLI/API paths as other models. The output stems are read from the built-in VR model list, for example Vocals, Instrumental, No Echo, or Echo.
pymss infer 1_HP-UVR \
-i path/to/input_folder \
-o results \
--device auto \
--param batch_size=2 \
--param window_size=512 \
--param aggression=5separator = MSSeparator.from_model_name(
"1_HP-UVR",
download=True,
device="auto",
output_format="wav",
store_dirs="results",
inference_params={
"batch_size": 2,
"window_size": 512,
"aggression": 5,
},
)
separator.process_folder("path/to/input_folder")Some model configs downloaded from Hugging Face or MSST-WebUI use inference.num_overlap. This optimized pymss path uses inference.overlap_size instead. If the config only has num_overlap, add an explicit overlap_size or pass it through inference_params; otherwise pymss falls back to 50% overlap and inference will be much slower.
Recommended fast setting:
audio:
chunk_size: 480000
inference:
batch_size: 2
overlap_size: 24000 # 5% of chunk_sizeMeasured on an NVIDIA GeForce RTX 5090 with PyTorch 2.9.1+cu128, CUDA 12.8, no TTA, one warmup and three measured runs.
| model | type | RTFx | 1-hour audio |
|---|---|---|---|
| BS-Roformer-HyperACE_v2_voc | bs_roformer | 231.83x | 15.5s |
| model_bs_roformer_ep_368_sdr_12.9628 | bs_roformer | 109.06x | 33.0s |
| logic_bs_roformer | bs_roformer | 159.71x | 22.5s |
| mel-band-roformer-deux | mel_band_roformer | 169.93x | 21.2s |
| Mel-Band-Roformer-big | mel_band_roformer | 194.05x | 18.6s |
| model_vocals_mdx23c_sdr_10.17 | mdx23c | 209.41x | 17.2s |
| HTDemucs4 | htdemucs | 200.52x | 18.0s |
| scnet_checkpoint_musdb18 | scnet | 356.85x | 10.1s |
| model_bandit_plus_dnr_sdr_11.47 | bandit | 122.76x | 29.3s |
| checkpoint-multi_state_dict | bandit_v2 | 112.33x | 32.0s |
| Apollo_LQ_MP3_restoration | apollo | 100.62x | 35.8s |
VR models were measured with batch_size=2, window_size=512, aggression=5, TTA off, post-processing off.
| VR model | RTFx | 1-hour audio |
|---|---|---|
| UVR-DeNoise-Lite | 243.62x | 14.8s |
| Harmonic_Noise_Separation_yxlllc | 221.22x | 16.3s |
| MGM_HIGHEND_v4 | 217.39x | 16.6s |
| MGM_LOWEND_A_v4 | 133.67x | 26.9s |
| MGM_MAIN_v4 | 118.56x | 30.4s |
| 11_SP-UVR-2B-32000-2 | 109.73x | 32.8s |
| 10_SP-UVR-2B-32000-1 | 109.03x | 33.0s |
| 12_SP-UVR-3B-44100 | 104.67x | 34.4s |
| MGM_LOWEND_B_v4 | 100.64x | 35.8s |
| 15_SP-UVR-MID-44100-1 | 99.00x | 36.4s |
| 16_SP-UVR-MID-44100-2 | 98.76x | 36.5s |
| 13_SP-UVR-4B-44100-1 | 97.78x | 36.8s |
| 14_SP-UVR-4B-44100-2 | 94.97x | 37.9s |
| 5_HP-Karaoke-UVR | 94.72x | 38.0s |
| 2_HP-UVR | 93.94x | 38.3s |
| UVR-De-Echo-Aggressive | 90.99x | 39.6s |
| UVR-DeNoise | 90.39x | 39.8s |
| UVR-De-Echo-Normal | 87.25x | 41.3s |
| UVR-DeReverb-aufr33-jarredou_4band_v4_ms_fullband | 86.70x | 41.5s |
| UVR-DeEcho-DeReverb | 86.58x | 41.6s |
| 3_HP-Vocal-UVR | 85.15x | 42.3s |
| 4_HP-Vocal-UVR | 84.23x | 42.7s |
| 1_HP-UVR | 84.06x | 42.8s |
| 17_HP-Wind_Inst-UVR | 82.92x | 43.4s |
| 6_HP-Karaoke-UVR | 81.81x | 44.0s |
| UVR-BVE-4B_SN-44100-1 | 81.54x | 44.2s |
| 9_HP2-UVR | 58.48x | 61.6s |
| 8_HP2-UVR | 57.23x | 62.9s |
| 7_HP2-UVR | 56.10x | 64.2s |