JARVIS is an advanced AI-powered voice assistant built on the ESP32-S3 microcontroller. It combines voice recording, speech-to-text conversion, AI response generation, and text-to-speech capabilities to create a complete conversational AI system. The device can also control external hardware and monitor environmental conditions.
- Voice Recording: High-quality audio recording using I2S microphone
- Speech-to-Text: Real-time transcription using Deepgram API
- AI Responses: Intelligent conversations powered by Google Gemini AI
- Text-to-Speech: Natural voice synthesis using Deepgram TTS
- Environmental Monitoring: Temperature and humidity sensing with DHT22
- Hardware Control: Interface with external devices (ATmega32)
- Local Storage: Audio file management on SD card
- Push Button Interface: Physical controls for all major functions
- Main microcontroller with dual-core processing
- Built-in WiFi for cloud API access
- Multiple I2S ports for audio processing
- I2S Microphone: Digital microphone for voice input
- MAX98357A Amplifier: I2S DAC and amplifier for audio output
- Speaker: Connected through MAX98357A for voice playback
- MicroSD Card: Local storage for audio recordings
- DHT22 Sensor: Temperature and humidity monitoring
- Push Buttons: Physical controls for recording, playback, and AI interaction
- Serial Interface: Debug output and manual commands
- ATmega32 Interface: Digital outputs for controlling external devices
- WS (Word Select): GPIO 4
- SCK (Bit Clock): GPIO 5
- SD (Serial Data): GPIO 6
- LRC (Left/Right Clock): GPIO 16
- BCLK (Bit Clock): GPIO 15
- DIN (Data Input): GPIO 7
- SD (Shutdown): Connected to 3.3V (always enabled)
- CS (Chip Select): GPIO 10
- MOSI: GPIO 11
- SCK: GPIO 12
- MISO: GPIO 13
- Start Recording: GPIO 2
- Stop Recording: GPIO 14
- Play Latest: GPIO 3
- TTS/AI Response: GPIO 8
- DHT22 Data: GPIO 35
- ATmega Control Pin 1: GPIO 36
- ATmega Control Pin 2: GPIO 37
- Arduino Framework: Main development framework
- I2S Driver: Low-level audio interface
- WiFiClientSecure: Secure HTTPS communication
- ArduinoJson: JSON parsing for API responses
- DHT Library: Temperature/humidity sensor interface
- I2S Microphone Setup: Configure digital microphone with 16kHz sample rate
- Buffer Management: Use 512-sample buffers for real-time processing
- WAV File Creation: Generate standard WAV headers for compatibility
- SD Card Storage: Save recordings with timestamp-based filenames
- Audio Level Monitoring: Real-time RMS calculation for input level display
- File Reading: Stream audio data from SD card
- I2S Output: Send audio to MAX98357A amplifier
- Format Handling: Support for 16-bit mono WAV files
- Buffer Management: Continuous streaming without gaps
Audio File → Deepgram API → JSON Response → Text Extraction
- API Endpoint:
api.deepgram.com/v1/listen - Model: Nova-2-general with smart formatting
- Audio Format: 16kHz WAV files
- Features: Automatic punctuation, number formatting
Transcript → Gemini API → AI Response → Text Processing
- API Endpoint:
generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent - Model: Gemini 2.0 Flash for fast responses
- Processing: Clean text output without special characters
AI Response → Deepgram TTS → Audio Stream → Local Playback
- API Endpoint:
api.deepgram.com/v1/speak - Voice Model: Aura Asteria (English)
- Audio Format: 16kHz linear PCM
- Storage: Save TTS files for replay functionality
The system recognizes specific voice commands:
- "on": Activates external device (ATmega pin HIGH)
- "off": Deactivates external device (ATmega pin LOW)
- General queries: Processed by Gemini AI for conversational responses
Manual control through serial interface:
s/S: Start recordingx/X: Stop recordingp/P: Play latest recordingc/C: Transcribe and get AI responsel/L: List SD card filest/T: Play test toned/D: Delete all audio filesv/V: Replay last TTS audioy/Y: Read temperature/humidityb/B: Manual ATmega control (HIGH)n/N: Manual ATmega control (LOW)
- Measurement: Temperature (°C) and humidity (%)
- Stability: 2-second delays for accurate readings
- Error Handling: Automatic retry on failed readings
- Voice Format: "Temperature is X point Y degree celsius. Humidity is A point B percent."
- Deepgram Speech-to-Text:
DEEPGRAM_API_KEY1 - Deepgram Text-to-Speech:
DEEPGRAM_API_KEY2 - Google Gemini AI:
GEMINI_API_KEY
- SSID: Network name
- Password: Network password
- Security: WPA2/WPA3 supported
-
Voice Input
User speaks → I2S Microphone → ESP32-S3 → SD Card (WAV file) -
Speech Recognition
WAV file → Deepgram API → Text transcript -
Command Processing
Text analysis → Voice commands OR AI query -
Response Generation
AI query → Gemini API → Text response -
Voice Output
Text response → Deepgram TTS → Audio stream → Speaker
- Press Start Recording button
- System creates timestamped WAV file
- Records for 10 seconds (configurable)
- Press Stop Recording or automatic timeout
- WAV header written with audio data size
- Press Play Latest button
- System locates most recent recording
- Streams audio through I2S to speaker
- Real-time playback with stop capability
- Press TTS/AI Response button
- System transcribes latest recording
- Processes transcript for commands or AI query
- Generates and plays voice response
- Sample Rate: 16 kHz
- Bit Depth: 16-bit
- Channels: Mono
- Buffer Size: 512 samples
- Recording Time: 10 seconds (configurable)
- File Format: WAV with standard headers
- Recording Latency: <100ms startup
- Transcription Time: 2-5 seconds (network dependent)
- AI Response Time: 1-3 seconds (network dependent)
- TTS Generation: 2-4 seconds (network dependent)
- Total Cycle Time: 6-15 seconds end-to-end
- Global Buffers: Allocated outside stack for stability
- Buffer Sizes: Optimized for ESP32-S3 memory constraints
- File Handling: Streaming I/O to minimize RAM usage
- JSON Processing: Dynamic allocation with cleanup
- Connect I2S microphone to specified GPIO pins
- Wire MAX98357A amplifier for audio output
- Install MicroSD card and connect SPI interface
- Connect DHT22 sensor to GPIO 35
- Wire push buttons with appropriate pull-up/pull-down resistors
- Connect ATmega32 control interface if using external devices
- Install PlatformIO with ESP32-S3 support
- Configure WiFi credentials in source code
- Add API keys for Deepgram and Gemini services
- Upload firmware using PlatformIO
- Format MicroSD card (FAT32 recommended)
- Monitor serial output for system initialization
- Test audio recording with 's' and 'x' commands
- Verify SD card file creation with 'l' command
- Test playback functionality with 'p' command
- Verify WiFi connection and API access with 'c' command
- SD Card Failures: Check wiring, format, and card compatibility
- Audio Quality: Verify I2S timing and buffer sizes
- Network Issues: Confirm WiFi credentials and API keys
- Sensor Errors: Allow proper DHT22 stabilization time
- Memory Issues: Monitor heap usage and buffer allocations
- Serial Monitoring: Comprehensive logging of all operations
- Audio Level Display: Real-time input level visualization
- API Response Logging: Full HTTP response debugging
- File System Status: SD card space and file management
- Wake Word Detection: Always-listening mode with keyword activation
- Multi-language Support: Extended language models and TTS voices
- Local AI Processing: Edge AI for reduced latency and offline operation
- Advanced Audio: Noise cancellation and echo reduction
- IoT Integration: MQTT and smart home protocol support
- Mobile App: Companion application for remote control and monitoring
- Multiple Device Network: Distributed JARVIS instances
- Cloud Integration: Enhanced cloud services and data analytics
- Custom Training: Personalized AI models and voice recognition
- Hardware Expansion: Additional sensors and actuators
This project leverages several open-source libraries and cloud services:
- Arduino Framework: Arduino community
- ESP-IDF: Espressif Systems
- ArduinoJson: Benoit Blanchon
- DHT Sensor Library: Adafruit Industries
- Deepgram API: Deepgram Inc.
- Google Gemini: Google LLC
Contributions are welcome! Please consider:
- Hardware Optimizations: Improved audio quality and processing
- Software Features: Additional AI capabilities and integrations
- Documentation: Usage examples and setup guides
- Testing: Platform compatibility and edge case handling
JARVIS represents a complete, production-ready AI voice assistant implementation showcasing the capabilities of modern embedded systems combined with cloud AI services.