Skip to content

CosmonautCode/Hyena-PAM-Proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hyena PAM Proxy

A unified async FastAPI proxy for accessing NVIDIA AI models through a single API endpoint. This project provides a clean interface to route requests to various NVIDIA NIM (Nvidia Inference Microservices) models with automatic request adjustment and response normalization.

Status: WORK IN PROGRESS

  • Core architecture is functional and can be used locally
  • Public API endpoint (api.hyena-industries.com) is NOT YET LIVE
  • You can clone this repo and run it locally right now
  • Full production deployment coming soon

Features

  • Async/Await Support: Fully asynchronous FastAPI endpoints for high concurrency
  • 12+ Models Supported: DeepSeek, Mistral, Gemma, GLM, Kimi, and more
  • Streaming Responses: Real-time token streaming via Server-Sent Events
  • Automatic Request Adjustment: Model-specific parameter optimization (temperature defaults, thinking models, etc.)
  • Unified Response Format: Consistent response structure across all models
  • Thinking Models: Support for models with reasoning/thinking blocks (DeepSeek, GLM, Kimi)
  • Error Handling: Comprehensive error handling with detailed error messages
  • Model Registry: Easy model discovery and configuration

Architecture

The proxy is organized into three main layers:

  1. API Layer (app/api/) - FastAPI routes and endpoints
  2. Handler Layer (app/core/) - Request routing and response transformation
  3. Model Layer (app/base_models/) - Pydantic data models for validation

See ARCHITECTURE.md for detailed component documentation.

Supported Models

Model ID Type Thinking
DeepSeek V3.2 deepseek-v3.2 Standard Yes
DeepSeek V4 Flash deepseek-v4-flash Standard No
DeepSeek V4 Pro deepseek-v4-pro Standard No
Mistral Large 3 mistral-large-3 Standard No
Devstral 2 (123B) devstral-2-123b Standard No
Google Gemma 3 gemma-3-27b Standard No
Kimi K2 Thinking kimi-k2-thinking Thinking Yes
GLM 4.7 glm4.7 Thinking Yes
MiniMax M2.7 minimax-m2.7 Standard No
Step 3.5 Flash step-3.5-flash Standard No
Nemotron Content Safety nemotron-3-content-safety Safety No
Nemotron Safety Reasoning nemotron-content-safety-reasoning Safety Yes

Quick Start - Run Locally

Prerequisites

Installation

# Clone the repository
git clone https://github.com/hyena-industries/Hyena-PAM-Proxy.git
cd Hyena-PAM-Proxy

# Install dependencies with uv
uv pip install -e .

Environment Setup

Create a .env file:

NVIDIA_API_KEY=your-nvidia-api-key-here
SERVER_HOST=0.0.0.0
SERVER_PORT=8000

Locally

# Option 1: Direct uvicorn (recommended for development)
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

# Option 2: Via main.py
uv run python main.py

# Option 3: Custom port
uv run uvicorn app.api:app --host 0.0.0.0 --port 8082 --reload

Server will be available at http://localhost:8000 (or your custom port)

First-Time Setup

# Set your NVIDIA API key
$env:NVIDIA_API_KEY="your-nvidia-api-key-here"

# Then run the server
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

You can use this API locally right now! Server will be available at http://localhost:8000

API Usage

Interactive Documentation

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Endpoints

Health Check

curl http://localhost:8000/api/v1/health

List Models

curl http://localhost:8000/api/v1/models

Get Model Info

curl http://localhost:8000/api/v1/models/mistral-large-3

Synchronous Completion

curl -X POST http://localhost:8000/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-large-3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024,
    "top_p": 0.95
  }'

Streaming Completion

curl -X POST http://localhost:8000/api/v1/chat/completions/stream \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v3.2",
    "messages": [
      {"role": "user", "content": "Write a short story"}
    ],
    "stream": true,
    "max_tokens": 512
  }'

With Thinking (for thinking models)

curl -X POST http://localhost:8000/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm4.7",
    "messages": [
      {"role": "user", "content": "Solve: 2x + 5 = 13"}
    ],
    "enable_thinking": true,
    "max_tokens": 2048
  }'

Request Parameters

Required

  • `Public Deployment (Coming Soon)

The API will soon be available publicly at https://api.hyena-industries.com for production use.

Current Deployment Status

  • [PENDING] Setting up public API endpoint
  • [PLANNED] Railway.app or Render.com deployment
  • [PLANNED] Custom domain setup and DNS configuration
  • [PLANNED] Rate limiting and authentication

For Now: Use Locally

Until the public API is ready, you can:

  1. Clone this repo and run it locally on your machine
  2. Deploy yourself using Railway.app or Render.com following the deployment guide (see below)
  3. Use Docker to containerize and run anywhere

For local-to-local networking:

  • Run on your machine at http://localhost:8000
  • Access from other machines on your network: http://your-machine-ip:8000

Self-Deployment with Railway.app

If you want to host it yourself before the official public API is ready:

  1. Push to GitHub
  2. Sign up at railway.app with GitHub
  3. Create new project from your repo
  4. Add environment variable: NVIDIA_API_KEY
  5. Add custom domain: `your-subdomaint, -2.0-2.0): Presence penalty (default: 0.0)
  • seed (int): Random seed for reproducibility
  • stream (boolean): Enable streaming response (default: false)
  • enable_thinking (boolean): Enable reasoning blocks (thinking models only)
  • reasoning_effort (string): Reasoning effort level (some models)

Deployment

GitHub Hosting with Railway.app

  1. Push to GitHub
  2. Sign up at railway.app with GitHub
  3. Create new project from your repo
  4. Add environment variable: NVIDIA_API_KEY
  5. Add custom d - Ready to Use Locally
  • Base models and Pydantic schemas
  • Model registry with 12+ models
  • Request handler with model routing
  • Async FastAPI endpoints
  • Streaming support
  • Thinking model support
  • Error handling and validation
  • API documentation (Swagger/ReDoc)
  • Architecture documentation
  • Local server setup and running
  • README with usage examples

IN PROGRESS - Before Public Release

  • Comprehensive test suite
  • Rate limiting
  • Production server configuration
  • Public API deployment (api.hyena-industries.com)
  • Docker containerization
  • Load testing and optimization

PLANNED - Future Features

  • Request/response caching
  • Analytics and logging # Request models (Pydantic) │ │ └── response.py # Response models (Pydantic) │ ├── core/ │ │ ├── init.py # Core package exports │ │ ├── handler.py # Request handler & routing logic │ │ ├── model_registry.py # Model configuration registry │ │ └── model_calls/ # Original model example files │ ├── init.py # App package exports │ ├── config.py # Configuration │ └── settings.py # Settings definitions ├── main.py # Server entry point ├── pyproject.toml # Project dependencies ├── README.md # This file ├── ARCHITECTURE.md # Detailed architecture guide └── DEPLOYMENT.md # Deployment instructions

## Development Status

### COMPLETED - Ready to Use Locally
- [x] Base models and Pydantic schemas
- [x] Model registry with 12+ models
- [x] Request handler with model routing
- [x] Async FastAPI endpoints
- [x] Streaming support
- [x] Thinking model support
- [x] Error handling and validation
- [x] API documentation (Swagger/ReDoc)
- [x] Architecture documentation

### IN PROGRESS / PLANNED
- [ ] Comprehensive test suite
- [ ] Rate limiting
- [ ] Request/response caching
- [ ] Analytics and logging
- [ ] Docker containerization
- [ ] Kubernetes deployment
- [ ] WebSocket support for persistent connections
- [ ] Authentication/API keys
- [ ] Cost tracking per model
- [ ] Load balancing across providers

## What's Next?

1. **Using Locally** - You can do this now:
   ```bash
   git clone <repo>
   cd Hyena-PAM-Proxy
   uv run uvicorn app.api:app --host 0.0.0.0 --port 8000
  1. Self-Host Publicly - Instructions in DEPLOYMENT.md

  2. Official Public API - Coming soon at https://api.hyena-industries.com


Last Updated: April 26, 2026 Current Status:

  • LOCALLY FUNCTIONAL - Pull and use it right now
  • BEFORE PUBLIC RELEASE - Test suite and deployment config
  • PUBLIC API - Coming soon Contributions are welcome! Areas needing work:
  • Tests and test coverage
  • Additional models
  • Performance optimizations
  • Deployment examples
  • Documentation improvements

License

MIT License - See LICENSE file for details

Support

For issues, questions, or suggestions:

  1. Check ARCHITECTURE.md for technical details
  2. Review existing model configurations in app/core/model_registry.py
  3. Check NVIDIA API documentation at api.nvidia.com

Acknowledgments

  • Built for Hyena Industries
  • Powered by NVIDIA NIM (Nvidia Inference Microservices)
  • FastAPI and Pydantic for excellent frameworks

Last Updated: April 26, 2026
Status: Work in Progress

About

A unified async FastAPI proxy for accessing NVIDIA AI models through a single API endpoint. This project provides a clean interface to route requests to various NVIDIA NIM (Nvidia Inference Microservices) models with automatic request adjustment and response normalization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages