Skip to content

Dedicated GPU

Dedicated GPU deployments are for models that need pinned capacity, gated weights, custom quantization, or predictable throughput.

Terminal window
curl https://api.parel.cloud/v1/deployments \
-H "Authorization: Bearer pk-dev-YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "my-llama",
"huggingface_id": "meta-llama/Llama-3-8B-Instruct",
"gpu_tier": "NVIDIA A40",
"quantization": "awq",
"idle_timeout_minutes": 15,
"budget_limit_usd": 50.0,
"max_model_len": 8192
}'
EndpointDescription
GET /v1/gpu-tiersAvailable GPU tiers with cached pricing
GET /v1/gpu-tiers/liveFresh GPU pricing
GET /v1/deployment-templatesRecommended deployment templates
GET /v1/deployments/previewEstimate compatibility, time, and hourly cost
POST /v1/hf/validateValidate a Hugging Face model
POST /v1/deploymentsCreate a deployment
GET /v1/deploymentsList deployments
GET /v1/deployments/{id}Deployment detail
POST /v1/deployments/{id}/startStart or wake
POST /v1/deployments/{id}/stopStop
DELETE /v1/deployments/{id}Terminate
GET /v1/deployments/{id}/eventsState transition history
GET /v1/deployments/{id}/metricsThroughput and latency metrics
GET /v1/deployments/{id}/billingHourly burn
POST /v1/deployments/{id}/chatDeployment-scoped chat
response = client.chat.completions.create(
model="byom-DEPLOYMENT_ID",
messages=[{"role": "user", "content": "Hello"}],
)