Hyena PAM Proxy

A unified async FastAPI proxy for accessing NVIDIA AI models through a single API endpoint. This project provides a clean interface to route requests to various NVIDIA NIM (Nvidia Inference Microservices) models with automatic request adjustment and response normalization.

Status: WORK IN PROGRESS

Core architecture is functional and can be used locally

Public API endpoint (api.hyena-industries.com) is NOT YET LIVE

You can clone this repo and run it locally right now

Full production deployment coming soon

Features

Async/Await Support: Fully asynchronous FastAPI endpoints for high concurrency
12+ Models Supported: DeepSeek, Mistral, Gemma, GLM, Kimi, and more
Streaming Responses: Real-time token streaming via Server-Sent Events
Automatic Request Adjustment: Model-specific parameter optimization (temperature defaults, thinking models, etc.)
Unified Response Format: Consistent response structure across all models
Thinking Models: Support for models with reasoning/thinking blocks (DeepSeek, GLM, Kimi)
Error Handling: Comprehensive error handling with detailed error messages
Model Registry: Easy model discovery and configuration

Architecture

The proxy is organized into three main layers:

API Layer (app/api/) - FastAPI routes and endpoints
Handler Layer (app/core/) - Request routing and response transformation
Model Layer (app/base_models/) - Pydantic data models for validation

See ARCHITECTURE.md for detailed component documentation.

Supported Models

Model	ID	Type	Thinking
DeepSeek V3.2	`deepseek-v3.2`	Standard	Yes
DeepSeek V4 Flash	`deepseek-v4-flash`	Standard	No
DeepSeek V4 Pro	`deepseek-v4-pro`	Standard	No
Mistral Large 3	`mistral-large-3`	Standard	No
Devstral 2 (123B)	`devstral-2-123b`	Standard	No
Google Gemma 3	`gemma-3-27b`	Standard	No
Kimi K2 Thinking	`kimi-k2-thinking`	Thinking	Yes
GLM 4.7	`glm4.7`	Thinking	Yes
MiniMax M2.7	`minimax-m2.7`	Standard	No
Step 3.5 Flash	`step-3.5-flash`	Standard	No
Nemotron Content Safety	`nemotron-3-content-safety`	Safety	No
Nemotron Safety Reasoning	`nemotron-content-safety-reasoning`	Safety	Yes

Quick Start - Run Locally

Prerequisites

Python 3.11+
uv package manager (install here)
NVIDIA API key from api.nvidia.com

Installation

# Clone the repository
git clone https://github.com/hyena-industries/Hyena-PAM-Proxy.git
cd Hyena-PAM-Proxy

# Install dependencies with uv
uv pip install -e .

Environment Setup

Create a .env file:

NVIDIA_API_KEY=your-nvidia-api-key-here
SERVER_HOST=0.0.0.0
SERVER_PORT=8000

Locally

# Option 1: Direct uvicorn (recommended for development)
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

# Option 2: Via main.py
uv run python main.py

# Option 3: Custom port
uv run uvicorn app.api:app --host 0.0.0.0 --port 8082 --reload

Server will be available at http://localhost:8000 (or your custom port)

First-Time Setup

# Set your NVIDIA API key
$env:NVIDIA_API_KEY="your-nvidia-api-key-here"

# Then run the server
uv run uvicorn app.api:app --host 0.0.0.0 --port 8000 --reload

You can use this API locally right now! Server will be available at http://localhost:8000

API Usage

Interactive Documentation

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Endpoints

Health Check

curl http://localhost:8000/api/v1/health

List Models

curl http://localhost:8000/api/v1/models

Get Model Info

curl http://localhost:8000/api/v1/models/mistral-large-3

Synchronous Completion

curl -X POST http://localhost:8000/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-large-3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024,
    "top_p": 0.95
  }'

Streaming Completion

curl -X POST http://localhost:8000/api/v1/chat/completions/stream \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v3.2",
    "messages": [
      {"role": "user", "content": "Write a short story"}
    ],
    "stream": true,
    "max_tokens": 512
  }'

With Thinking (for thinking models)

curl -X POST http://localhost:8000/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm4.7",
    "messages": [
      {"role": "user", "content": "Solve: 2x + 5 = 13"}
    ],
    "enable_thinking": true,
    "max_tokens": 2048
  }'

Request Parameters

Required

`Public Deployment (Coming Soon)

The API will soon be available publicly at https://api.hyena-industries.com for production use.

Current Deployment Status

[PENDING] Setting up public API endpoint
[PLANNED] Railway.app or Render.com deployment
[PLANNED] Custom domain setup and DNS configuration
[PLANNED] Rate limiting and authentication

For Now: Use Locally

Until the public API is ready, you can:

Clone this repo and run it locally on your machine
Deploy yourself using Railway.app or Render.com following the deployment guide (see below)
Use Docker to containerize and run anywhere

For local-to-local networking:

Run on your machine at http://localhost:8000
Access from other machines on your network: http://your-machine-ip:8000

Self-Deployment with Railway.app

If you want to host it yourself before the official public API is ready:

Push to GitHub
Sign up at railway.app with GitHub
Create new project from your repo
Add environment variable: NVIDIA_API_KEY
Add custom domain: `your-subdomaint, -2.0-2.0): Presence penalty (default: 0.0)

seed (int): Random seed for reproducibility
stream (boolean): Enable streaming response (default: false)
enable_thinking (boolean): Enable reasoning blocks (thinking models only)
reasoning_effort (string): Reasoning effort level (some models)

Deployment

GitHub Hosting with Railway.app

Push to GitHub
Sign up at railway.app with GitHub
Create new project from your repo
Add environment variable: NVIDIA_API_KEY
Add custom d - Ready to Use Locally

IN PROGRESS - Before Public Release

Comprehensive test suite
Rate limiting
Production server configuration
Public API deployment (api.hyena-industries.com)
Docker containerization
Load testing and optimization

PLANNED - Future Features

Request/response caching
Analytics and logging # Request models (Pydantic) │ │ └── response.py # Response models (Pydantic) │ ├── core/ │ │ ├── init.py # Core package exports │ │ ├── handler.py # Request handler & routing logic │ │ ├── model_registry.py # Model configuration registry │ │ └── model_calls/ # Original model example files │ ├── init.py # App package exports │ ├── config.py # Configuration │ └── settings.py # Settings definitions ├── main.py # Server entry point ├── pyproject.toml # Project dependencies ├── README.md # This file ├── ARCHITECTURE.md # Detailed architecture guide └── DEPLOYMENT.md # Deployment instructions


## Development Status

### COMPLETED - Ready to Use Locally
- [x] Base models and Pydantic schemas
- [x] Model registry with 12+ models
- [x] Request handler with model routing
- [x] Async FastAPI endpoints
- [x] Streaming support
- [x] Thinking model support
- [x] Error handling and validation
- [x] API documentation (Swagger/ReDoc)
- [x] Architecture documentation

### IN PROGRESS / PLANNED
- [ ] Comprehensive test suite
- [ ] Rate limiting
- [ ] Request/response caching
- [ ] Analytics and logging
- [ ] Docker containerization
- [ ] Kubernetes deployment
- [ ] WebSocket support for persistent connections
- [ ] Authentication/API keys
- [ ] Cost tracking per model
- [ ] Load balancing across providers

## What's Next?

1. **Using Locally** - You can do this now:
   ```bash
   git clone <repo>
   cd Hyena-PAM-Proxy
   uv run uvicorn app.api:app --host 0.0.0.0 --port 8000

Self-Host Publicly - Instructions in DEPLOYMENT.md
Official Public API - Coming soon at https://api.hyena-industries.com

Last Updated: April 26, 2026 Current Status:

LOCALLY FUNCTIONAL - Pull and use it right now
BEFORE PUBLIC RELEASE - Test suite and deployment config
PUBLIC API - Coming soon Contributions are welcome! Areas needing work:
Tests and test coverage
Additional models
Performance optimizations
Deployment examples
Documentation improvements

License

MIT License - See LICENSE file for details

Support

For issues, questions, or suggestions:

Check ARCHITECTURE.md for technical details
Review existing model configurations in app/core/model_registry.py
Check NVIDIA API documentation at api.nvidia.com

Acknowledgments

Built for Hyena Industries
Powered by NVIDIA NIM (Nvidia Inference Microservices)
FastAPI and Pydantic for excellent frameworks

Last Updated: April 26, 2026
Status: Work in Progress

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Hyena PAM Proxy

Features

Architecture

Supported Models

Quick Start - Run Locally

Prerequisites

Installation

Environment Setup

First-Time Setup

API Usage

Interactive Documentation

Endpoints

Health Check

List Models

Get Model Info

Synchronous Completion

Streaming Completion

With Thinking (for thinking models)

Request Parameters

Required

Current Deployment Status

For Now: Use Locally

Self-Deployment with Railway.app

Deployment

GitHub Hosting with Railway.app

IN PROGRESS - Before Public Release

PLANNED - Future Features

License

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages