A unified async FastAPI proxy for accessing NVIDIA AI models through a single API endpoint. This project provides a clean interface to route requests to various NVIDIA NIM (Nvidia Inference Microservices) models with automatic request adjustment and response normalization.
Status: WORK IN PROGRESS
- Core architecture is functional and can be used locally
- Public API endpoint (api.hyena-industries.com) is NOT YET LIVE
- You can clone this repo and run it locally right now
- Full production deployment coming soon
- Async/Await Support: Fully asynchronous FastAPI endpoints for high concurrency
- 12+ Models Supported: DeepSeek, Mistral, Gemma, GLM, Kimi, and more
- Streaming Responses: Real-time token streaming via Server-Sent Events
- Automatic Request Adjustment: Model-specific parameter optimization (temperature defaults, thinking models, etc.)
- Unified Response Format: Consistent response structure across all models
- Thinking Models: Support for models with reasoning/thinking blocks (DeepSeek, GLM, Kimi)
- Error Handling: Comprehensive error handling with detailed error messages
- Model Registry: Easy model discovery and configuration
The proxy is organized into three main layers:
- API Layer (
app/api/) - FastAPI routes and endpoints - Handler Layer (
app/core/) - Request routing and response transformation - Model Layer (
app/base_models/) - Pydantic data models for validation
See ARCHITECTURE.md for detailed component documentation.
| Model | ID | Type | Thinking |
|---|---|---|---|
| DeepSeek V3.2 | deepseek-v3.2 |
Standard | Yes |
| DeepSeek V4 Flash | deepseek-v4-flash |
Standard | No |
| DeepSeek V4 Pro | deepseek-v4-pro |
Standard | No |
| Mistral Large 3 | mistral-large-3 |
Standard | No |
| Devstral 2 (123B) | devstral-2-123b |
Standard | No |
| Google Gemma 3 | gemma-3-27b |
Standard | No |
| Kimi K2 Thinking | kimi-k2-thinking |
Thinking | Yes |
| GLM 4.7 | glm4.7 |
Thinking | Yes |
| MiniMax M2.7 | minimax-m2.7 |
Standard | No |
| Step 3.5 Flash | step-3.5-flash |
Standard | No |
| Nemotron Content Safety | nemotron-3-content-safety |
Safety | No |
| Nemotron Safety Reasoning | nemotron-content-safety-reasoning |
Safety | Yes |
- Python 3.11+
uvpackage manager (install here)- NVIDIA API key from api.nvidia.com
# Clone the repository
git clone https://github.com/hyena-industries/Hyena-PAM-Proxy.git
cd Hyena-PAM-Proxy
# Install dependencies with uv
uv pip install -e .Create a .env file:
NVIDIA_API_KEY=your-nvidia-api-key-here
SERVER_HOST=0.0.0.0
SERVER_PORT=8000Locally
# Option 1: Direct uvicorn (recommended for development)
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload
# Option 2: Via main.py
uv run python main.py
# Option 3: Custom port
uv run uvicorn app.api:app --host 0.0.0.0 --port 8082 --reloadServer will be available at http://localhost:8000 (or your custom port)
# Set your NVIDIA API key
$env:NVIDIA_API_KEY="your-nvidia-api-key-here"
# Then run the server
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reloadYou can use this API locally right now!
Server will be available at http://localhost:8000
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
curl http://localhost:8000/api/v1/healthcurl http://localhost:8000/api/v1/modelscurl http://localhost:8000/api/v1/models/mistral-large-3curl -X POST http://localhost:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-large-3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 1024,
"top_p": 0.95
}'curl -X POST http://localhost:8000/api/v1/chat/completions/stream \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3.2",
"messages": [
{"role": "user", "content": "Write a short story"}
],
"stream": true,
"max_tokens": 512
}'curl -X POST http://localhost:8000/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm4.7",
"messages": [
{"role": "user", "content": "Solve: 2x + 5 = 13"}
],
"enable_thinking": true,
"max_tokens": 2048
}'- `Public Deployment (Coming Soon)
The API will soon be available publicly at https://api.hyena-industries.com for production use.
- [PENDING] Setting up public API endpoint
- [PLANNED] Railway.app or Render.com deployment
- [PLANNED] Custom domain setup and DNS configuration
- [PLANNED] Rate limiting and authentication
Until the public API is ready, you can:
- Clone this repo and run it locally on your machine
- Deploy yourself using Railway.app or Render.com following the deployment guide (see below)
- Use Docker to containerize and run anywhere
For local-to-local networking:
- Run on your machine at
http://localhost:8000 - Access from other machines on your network:
http://your-machine-ip:8000
If you want to host it yourself before the official public API is ready:
- Push to GitHub
- Sign up at railway.app with GitHub
- Create new project from your repo
- Add environment variable:
NVIDIA_API_KEY - Add custom domain: `your-subdomaint, -2.0-2.0): Presence penalty (default: 0.0)
seed(int): Random seed for reproducibilitystream(boolean): Enable streaming response (default: false)enable_thinking(boolean): Enable reasoning blocks (thinking models only)reasoning_effort(string): Reasoning effort level (some models)
- Push to GitHub
- Sign up at railway.app with GitHub
- Create new project from your repo
- Add environment variable:
NVIDIA_API_KEY - Add custom d - Ready to Use Locally
- Base models and Pydantic schemas
- Model registry with 12+ models
- Request handler with model routing
- Async FastAPI endpoints
- Streaming support
- Thinking model support
- Error handling and validation
- API documentation (Swagger/ReDoc)
- Architecture documentation
- Local server setup and running
- README with usage examples
- Comprehensive test suite
- Rate limiting
- Production server configuration
- Public API deployment (api.hyena-industries.com)
- Docker containerization
- Load testing and optimization
- Request/response caching
- Analytics and logging # Request models (Pydantic) │ │ └── response.py # Response models (Pydantic) │ ├── core/ │ │ ├── init.py # Core package exports │ │ ├── handler.py # Request handler & routing logic │ │ ├── model_registry.py # Model configuration registry │ │ └── model_calls/ # Original model example files │ ├── init.py # App package exports │ ├── config.py # Configuration │ └── settings.py # Settings definitions ├── main.py # Server entry point ├── pyproject.toml # Project dependencies ├── README.md # This file ├── ARCHITECTURE.md # Detailed architecture guide └── DEPLOYMENT.md # Deployment instructions
## Development Status
### COMPLETED - Ready to Use Locally
- [x] Base models and Pydantic schemas
- [x] Model registry with 12+ models
- [x] Request handler with model routing
- [x] Async FastAPI endpoints
- [x] Streaming support
- [x] Thinking model support
- [x] Error handling and validation
- [x] API documentation (Swagger/ReDoc)
- [x] Architecture documentation
### IN PROGRESS / PLANNED
- [ ] Comprehensive test suite
- [ ] Rate limiting
- [ ] Request/response caching
- [ ] Analytics and logging
- [ ] Docker containerization
- [ ] Kubernetes deployment
- [ ] WebSocket support for persistent connections
- [ ] Authentication/API keys
- [ ] Cost tracking per model
- [ ] Load balancing across providers
## What's Next?
1. **Using Locally** - You can do this now:
```bash
git clone <repo>
cd Hyena-PAM-Proxy
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000
-
Self-Host Publicly - Instructions in DEPLOYMENT.md
-
Official Public API - Coming soon at
https://api.hyena-industries.com
Last Updated: April 26, 2026 Current Status:
- LOCALLY FUNCTIONAL - Pull and use it right now
- BEFORE PUBLIC RELEASE - Test suite and deployment config
- PUBLIC API - Coming soon Contributions are welcome! Areas needing work:
- Tests and test coverage
- Additional models
- Performance optimizations
- Deployment examples
- Documentation improvements
MIT License - See LICENSE file for details
For issues, questions, or suggestions:
- Check ARCHITECTURE.md for technical details
- Review existing model configurations in
app/core/model_registry.py - Check NVIDIA API documentation at api.nvidia.com
- Built for Hyena Industries
- Powered by NVIDIA NIM (Nvidia Inference Microservices)
- FastAPI and Pydantic for excellent frameworks
Last Updated: April 26, 2026
Status: Work in Progress