Dedicated GPU

Dedicated GPU deployments are for models that need pinned capacity, gated weights, custom quantization, or predictable throughput.

Create a deployment

curl https://api.parel.cloud/v1/deployments \
  -H "Authorization: Bearer pk-dev-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-llama",
    "huggingface_id": "meta-llama/Llama-3-8B-Instruct",
    "gpu_tier": "NVIDIA A40",
    "quantization": "awq",
    "idle_timeout_minutes": 15,
    "budget_limit_usd": 50.0,
    "max_model_len": 8192
  }'

Manage deployments

Endpoint	Description
`GET /v1/gpu-tiers`	Available GPU tiers with cached pricing
`GET /v1/gpu-tiers/live`	Fresh GPU pricing
`GET /v1/deployment-templates`	Recommended deployment templates
`GET /v1/deployments/preview`	Estimate compatibility, time, and hourly cost
`POST /v1/hf/validate`	Validate a Hugging Face model
`POST /v1/deployments`	Create a deployment
`GET /v1/deployments`	List deployments
`GET /v1/deployments/{id}`	Deployment detail
`POST /v1/deployments/{id}/start`	Start or wake
`POST /v1/deployments/{id}/stop`	Stop
`DELETE /v1/deployments/{id}`	Terminate
`GET /v1/deployments/{id}/events`	State transition history
`GET /v1/deployments/{id}/metrics`	Throughput and latency metrics
`GET /v1/deployments/{id}/billing`	Hourly burn
`POST /v1/deployments/{id}/chat`	Deployment-scoped chat

Use a running deployment

response = client.chat.completions.create(
    model="byom-DEPLOYMENT_ID",
    messages=[{"role": "user", "content": "Hello"}],
)