Dedicated GPU
Dedicated GPU deployments are for models that need pinned capacity, gated weights, custom quantization, or predictable throughput.
Create a deployment
Section titled “Create a deployment”curl https://api.parel.cloud/v1/deployments \ -H "Authorization: Bearer pk-dev-YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "my-llama", "huggingface_id": "meta-llama/Llama-3-8B-Instruct", "gpu_tier": "NVIDIA A40", "quantization": "awq", "idle_timeout_minutes": 15, "budget_limit_usd": 50.0, "max_model_len": 8192 }'Manage deployments
Section titled “Manage deployments”| Endpoint | Description |
|---|---|
GET /v1/gpu-tiers | Available GPU tiers with cached pricing |
GET /v1/gpu-tiers/live | Fresh GPU pricing |
GET /v1/deployment-templates | Recommended deployment templates |
GET /v1/deployments/preview | Estimate compatibility, time, and hourly cost |
POST /v1/hf/validate | Validate a Hugging Face model |
POST /v1/deployments | Create a deployment |
GET /v1/deployments | List deployments |
GET /v1/deployments/{id} | Deployment detail |
POST /v1/deployments/{id}/start | Start or wake |
POST /v1/deployments/{id}/stop | Stop |
DELETE /v1/deployments/{id} | Terminate |
GET /v1/deployments/{id}/events | State transition history |
GET /v1/deployments/{id}/metrics | Throughput and latency metrics |
GET /v1/deployments/{id}/billing | Hourly burn |
POST /v1/deployments/{id}/chat | Deployment-scoped chat |
Use a running deployment
Section titled “Use a running deployment”response = client.chat.completions.create( model="byom-DEPLOYMENT_ID", messages=[{"role": "user", "content": "Hello"}],)