Skip to content

Sayjad21/JARVIS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JARVIS - ESP32-S3 AI Voice Assistant

Overview

JARVIS is an advanced AI-powered voice assistant built on the ESP32-S3 microcontroller. It combines voice recording, speech-to-text conversion, AI response generation, and text-to-speech capabilities to create a complete conversational AI system. The device can also control external hardware and monitor environmental conditions.

Features

  • Voice Recording: High-quality audio recording using I2S microphone
  • Speech-to-Text: Real-time transcription using Deepgram API
  • AI Responses: Intelligent conversations powered by Google Gemini AI
  • Text-to-Speech: Natural voice synthesis using Deepgram TTS
  • Environmental Monitoring: Temperature and humidity sensing with DHT22
  • Hardware Control: Interface with external devices (ATmega32)
  • Local Storage: Audio file management on SD card
  • Push Button Interface: Physical controls for all major functions

Hardware Components

ESP32-S3 DevKit-C-1

  • Main microcontroller with dual-core processing
  • Built-in WiFi for cloud API access
  • Multiple I2S ports for audio processing

Audio System

  • I2S Microphone: Digital microphone for voice input
  • MAX98357A Amplifier: I2S DAC and amplifier for audio output
  • Speaker: Connected through MAX98357A for voice playback

Storage & Sensors

  • MicroSD Card: Local storage for audio recordings
  • DHT22 Sensor: Temperature and humidity monitoring

User Interface

  • Push Buttons: Physical controls for recording, playback, and AI interaction
  • Serial Interface: Debug output and manual commands

External Control

  • ATmega32 Interface: Digital outputs for controlling external devices

Pin Configuration

I2S Microphone

  • WS (Word Select): GPIO 4
  • SCK (Bit Clock): GPIO 5
  • SD (Serial Data): GPIO 6

MAX98357A Amplifier

  • LRC (Left/Right Clock): GPIO 16
  • BCLK (Bit Clock): GPIO 15
  • DIN (Data Input): GPIO 7
  • SD (Shutdown): Connected to 3.3V (always enabled)

SD Card (SPI)

  • CS (Chip Select): GPIO 10
  • MOSI: GPIO 11
  • SCK: GPIO 12
  • MISO: GPIO 13

Push Buttons

  • Start Recording: GPIO 2
  • Stop Recording: GPIO 14
  • Play Latest: GPIO 3
  • TTS/AI Response: GPIO 8

Sensors & Control

  • DHT22 Data: GPIO 35
  • ATmega Control Pin 1: GPIO 36
  • ATmega Control Pin 2: GPIO 37

Software Architecture

Core Libraries

  • Arduino Framework: Main development framework
  • I2S Driver: Low-level audio interface
  • WiFiClientSecure: Secure HTTPS communication
  • ArduinoJson: JSON parsing for API responses
  • DHT Library: Temperature/humidity sensor interface

Audio Processing Pipeline

Recording Process

  1. I2S Microphone Setup: Configure digital microphone with 16kHz sample rate
  2. Buffer Management: Use 512-sample buffers for real-time processing
  3. WAV File Creation: Generate standard WAV headers for compatibility
  4. SD Card Storage: Save recordings with timestamp-based filenames
  5. Audio Level Monitoring: Real-time RMS calculation for input level display

Playback Process

  1. File Reading: Stream audio data from SD card
  2. I2S Output: Send audio to MAX98357A amplifier
  3. Format Handling: Support for 16-bit mono WAV files
  4. Buffer Management: Continuous streaming without gaps

AI Integration

Speech-to-Text (Deepgram)

Audio File → Deepgram API → JSON Response → Text Extraction
  • API Endpoint: api.deepgram.com/v1/listen
  • Model: Nova-2-general with smart formatting
  • Audio Format: 16kHz WAV files
  • Features: Automatic punctuation, number formatting

AI Response (Google Gemini)

Transcript → Gemini API → AI Response → Text Processing
  • API Endpoint: generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent
  • Model: Gemini 2.0 Flash for fast responses
  • Processing: Clean text output without special characters

Text-to-Speech (Deepgram)

AI Response → Deepgram TTS → Audio Stream → Local Playback
  • API Endpoint: api.deepgram.com/v1/speak
  • Voice Model: Aura Asteria (English)
  • Audio Format: 16kHz linear PCM
  • Storage: Save TTS files for replay functionality

Command Processing

Voice Commands

The system recognizes specific voice commands:

  • "on": Activates external device (ATmega pin HIGH)
  • "off": Deactivates external device (ATmega pin LOW)
  • General queries: Processed by Gemini AI for conversational responses

Serial Commands

Manual control through serial interface:

  • s/S: Start recording
  • x/X: Stop recording
  • p/P: Play latest recording
  • c/C: Transcribe and get AI response
  • l/L: List SD card files
  • t/T: Play test tone
  • d/D: Delete all audio files
  • v/V: Replay last TTS audio
  • y/Y: Read temperature/humidity
  • b/B: Manual ATmega control (HIGH)
  • n/N: Manual ATmega control (LOW)

Environmental Monitoring

DHT22 Sensor Integration

  • Measurement: Temperature (°C) and humidity (%)
  • Stability: 2-second delays for accurate readings
  • Error Handling: Automatic retry on failed readings
  • Voice Format: "Temperature is X point Y degree celsius. Humidity is A point B percent."

API Configuration

Required API Keys

  1. Deepgram Speech-to-Text: DEEPGRAM_API_KEY1
  2. Deepgram Text-to-Speech: DEEPGRAM_API_KEY2
  3. Google Gemini AI: GEMINI_API_KEY

WiFi Configuration

  • SSID: Network name
  • Password: Network password
  • Security: WPA2/WPA3 supported

Operation Flow

Complete Voice Interaction Cycle

  1. Voice Input

    User speaks → I2S Microphone → ESP32-S3 → SD Card (WAV file)
    
  2. Speech Recognition

    WAV file → Deepgram API → Text transcript
    
  3. Command Processing

    Text analysis → Voice commands OR AI query
    
  4. Response Generation

    AI query → Gemini API → Text response
    
  5. Voice Output

    Text response → Deepgram TTS → Audio stream → Speaker
    

Push Button Operations

Recording Workflow

  1. Press Start Recording button
  2. System creates timestamped WAV file
  3. Records for 10 seconds (configurable)
  4. Press Stop Recording or automatic timeout
  5. WAV header written with audio data size

Playback Workflow

  1. Press Play Latest button
  2. System locates most recent recording
  3. Streams audio through I2S to speaker
  4. Real-time playback with stop capability

AI Interaction Workflow

  1. Press TTS/AI Response button
  2. System transcribes latest recording
  3. Processes transcript for commands or AI query
  4. Generates and plays voice response

Technical Specifications

Audio Specifications

  • Sample Rate: 16 kHz
  • Bit Depth: 16-bit
  • Channels: Mono
  • Buffer Size: 512 samples
  • Recording Time: 10 seconds (configurable)
  • File Format: WAV with standard headers

Performance Characteristics

  • Recording Latency: <100ms startup
  • Transcription Time: 2-5 seconds (network dependent)
  • AI Response Time: 1-3 seconds (network dependent)
  • TTS Generation: 2-4 seconds (network dependent)
  • Total Cycle Time: 6-15 seconds end-to-end

Memory Management

  • Global Buffers: Allocated outside stack for stability
  • Buffer Sizes: Optimized for ESP32-S3 memory constraints
  • File Handling: Streaming I/O to minimize RAM usage
  • JSON Processing: Dynamic allocation with cleanup

Setup Instructions

Hardware Assembly

  1. Connect I2S microphone to specified GPIO pins
  2. Wire MAX98357A amplifier for audio output
  3. Install MicroSD card and connect SPI interface
  4. Connect DHT22 sensor to GPIO 35
  5. Wire push buttons with appropriate pull-up/pull-down resistors
  6. Connect ATmega32 control interface if using external devices

Software Configuration

  1. Install PlatformIO with ESP32-S3 support
  2. Configure WiFi credentials in source code
  3. Add API keys for Deepgram and Gemini services
  4. Upload firmware using PlatformIO
  5. Format MicroSD card (FAT32 recommended)

Initial Testing

  1. Monitor serial output for system initialization
  2. Test audio recording with 's' and 'x' commands
  3. Verify SD card file creation with 'l' command
  4. Test playback functionality with 'p' command
  5. Verify WiFi connection and API access with 'c' command

Troubleshooting

Common Issues

  • SD Card Failures: Check wiring, format, and card compatibility
  • Audio Quality: Verify I2S timing and buffer sizes
  • Network Issues: Confirm WiFi credentials and API keys
  • Sensor Errors: Allow proper DHT22 stabilization time
  • Memory Issues: Monitor heap usage and buffer allocations

Debug Features

  • Serial Monitoring: Comprehensive logging of all operations
  • Audio Level Display: Real-time input level visualization
  • API Response Logging: Full HTTP response debugging
  • File System Status: SD card space and file management

Future Enhancements

Planned Features

  • Wake Word Detection: Always-listening mode with keyword activation
  • Multi-language Support: Extended language models and TTS voices
  • Local AI Processing: Edge AI for reduced latency and offline operation
  • Advanced Audio: Noise cancellation and echo reduction
  • IoT Integration: MQTT and smart home protocol support
  • Mobile App: Companion application for remote control and monitoring

Scalability Options

  • Multiple Device Network: Distributed JARVIS instances
  • Cloud Integration: Enhanced cloud services and data analytics
  • Custom Training: Personalized AI models and voice recognition
  • Hardware Expansion: Additional sensors and actuators

License and Credits

This project leverages several open-source libraries and cloud services:

  • Arduino Framework: Arduino community
  • ESP-IDF: Espressif Systems
  • ArduinoJson: Benoit Blanchon
  • DHT Sensor Library: Adafruit Industries
  • Deepgram API: Deepgram Inc.
  • Google Gemini: Google LLC

Contributing

Contributions are welcome! Please consider:

  • Hardware Optimizations: Improved audio quality and processing
  • Software Features: Additional AI capabilities and integrations
  • Documentation: Usage examples and setup guides
  • Testing: Platform compatibility and edge case handling

JARVIS represents a complete, production-ready AI voice assistant implementation showcasing the capabilities of modern embedded systems combined with cloud AI services.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages