AudioDrama is a full-stack application that transforms novel text into an immersive audio drama experience. It leverages a Large Language Model (LLM) to analyze the text, identify characters, and assign unique voices to each character, supporting multiple languages.
- AI-Powered Text Analysis: Uses the ZhipuAI (GLM-4) API to parse novel text, separating dialogue from narration and identifying the speaking character.
- Intelligent Voice Assignment: Automatically detects the language of the dialogue for each character and assigns a suitable voice from the system's installed TTS voices. It tries to assign a unique voice to each character.
- Multi-Language Support: Dynamically assigns voices based on the detected language of the text (e.g., English, Chinese).
- Robust TTS Generation: Generates audio files in a separate, isolated process to ensure stability and prevent server hangs.
- Modern Web Interface: A clean and simple frontend built with React and Vite to input text and play the generated audio drama.
- Automatic Cleanup: Automatically clears old audio files before generating new ones.
The project is a monorepo composed of two main parts:
audio-drama-backend: A Python server built with FastAPI that handles text processing, LLM interaction, and TTS audio generation.audio-drama-frontend: A modern web application built with React (using Vite) that provides the user interface.
- Text Submission: The user pastes novel text into the frontend and submits it to the backend.
- Clear Audio Cache: The backend first deletes all previously generated audio files.
- LLM Analysis: The FastAPI server sends the text to the GLM-4 model with a detailed prompt to be structured into a list of segments, each containing the character and their dialogue. The prompt is optimized to distinguish between narration and dialogue.
- Voice Pre-assignment: The backend aggregates all dialogue for each unique character and performs a one-time, high-accuracy language detection on the large text block.
- Voice Selection: Based on the detected language for each character, the system assigns a suitable voice from the appropriate language-specific voice pool, trying to ensure each character gets a unique voice.
- Audio Generation: The server processes each text segment individually, calling a dedicated Python script (
tts_worker.py) to generate an.aiffaudio file using the pre-assigned voice. This ensures maximum stability. - Playback: The frontend receives the list of segments with their corresponding audio URLs and plays them back in sequence, creating the audio drama experience.
- Python 3.9+
- Node.js and npm
- An API key for ZhipuAI (GLM-4)
-
Navigate to the backend directory:
cd audio-drama-backend -
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install Python dependencies:
pip install -r requirements.txt
-
Configure your API Key:
- Create a file named
.envin theaudio-drama-backenddirectory. - Add your ZhipuAI API key to it:
ZHIPUAI_API_KEY=your_zhipuai_api_key_here
- Create a file named
-
(Optional) List Available Voices:
- To see a list of all TTS voices available on your system, you can run the utility script:
python3 list_voices.py
-
Navigate to the frontend directory:
cd ../audio-drama-frontend -
Install Node.js dependencies:
npm install
-
Start the Backend Server:
- In a terminal, from the
audio-drama-backenddirectory (with the virtual environment activated):
uvicorn main:app --host 0.0.0.0 --port 8000
- In a terminal, from the
-
Start the Frontend Development Server:
- In a separate terminal, from the
audio-drama-frontenddirectory:
npm run dev
- In a separate terminal, from the
-
Access the Application:
- Open your web browser and navigate to the URL provided by the Vite development server (usually
http://localhost:5173).
- Open your web browser and navigate to the URL provided by the Vite development server (usually