Model Inference Guide¶
This document provides comprehensive guidance for using XPULink's model inference services.
Overview¶
XPULink's inference platform provides high-performance, scalable model serving powered by vLLM framework. Our platform supports various state-of-the-art language models and provides OpenAI-compatible APIs for easy integration.
Supported Models¶
Large Language Models¶
- Qwen3-32B: Advanced Chinese-English bilingual model with excellent reasoning capabilities
- LLaMA Series: Various sizes available for different use cases
- ChatGLM: Chinese-optimized conversational models
- Baichuan: High-performance Chinese language models
Embedding Models¶
- BGE-M3: State-of-the-art multilingual embedding model
- text-embedding-ada-002: OpenAI compatible embedding model
Quick Start¶
1. Authentication¶
First, obtain your API key from the XPULink console and set it as an environment variable:
export XPULINK_API_KEY="your_api_key_here"
2. Basic Inference Request¶
Here's a simple example using Python:
import requests
import os
# API configuration
API_KEY = os.getenv("XPULINK_API_KEY")
BASE_URL = "https://www.xpulink.ai/v1/chat/completions"
# Request headers
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Request payload
payload = {
"model": "qwen3-32b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"temperature": 0.7,
"max_tokens": 500
}
# Send request
response = requests.post(BASE_URL, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])
Advanced Features¶
Streaming Response¶
Enable streaming for real-time token generation:
payload = {
"model": "qwen3-32b",
"messages": messages,
"stream": True
}
# Process streaming response
response = requests.post(BASE_URL, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
if line:
print(line.decode('utf-8'))
Batch Processing¶
For processing multiple requests efficiently:
batch_requests = [
{"model": "qwen3-32b", "messages": [...], "max_tokens": 100},
{"model": "qwen3-32b", "messages": [...], "max_tokens": 100},
# More requests...
]
# Process batch (implementation depends on your needs)
results = []
for request in batch_requests:
response = requests.post(BASE_URL, headers=headers, json=request)
results.append(response.json())
Performance Optimization¶
1. Use Appropriate Parameters¶
- temperature: Lower values (0.1-0.3) for deterministic outputs, higher (0.7-0.9) for creativity
- max_tokens: Set reasonable limits to control response length and cost
- top_p: Use nucleus sampling for better quality (typically 0.9-0.95)
2. Caching Strategies¶
Implement caching for frequently requested prompts to reduce latency and cost.
3. Connection Pooling¶
Use persistent connections for multiple requests:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(total=3, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry)
session.mount('https://', adapter)
Error Handling¶
Implement robust error handling:
try:
response = requests.post(BASE_URL, headers=headers, json=payload, timeout=30)
response.raise_for_status()
result = response.json()
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except KeyError as e:
print(f"Unexpected response format: {e}")
Best Practices¶
- Rate Limiting: Respect API rate limits and implement backoff strategies
- Monitoring: Log requests and responses for debugging and optimization
- Security: Never expose API keys in client-side code
- Testing: Thoroughly test with various input types and edge cases
- Documentation: Keep your integration documentation updated
Troubleshooting¶
Common Issues and Solutions¶
| Issue | Possible Cause | Solution |
|---|---|---|
| 401 Unauthorized | Invalid API key | Verify API key in environment variables |
| 429 Too Many Requests | Rate limit exceeded | Implement backoff and retry logic |
| 500 Internal Server Error | Server issue | Contact support if persistent |
| Timeout errors | Network or server load | Increase timeout, retry with backoff |
Support¶
For additional help: - Documentation: https://docs.xpulink.net - Email: support@xpulink.ai - Community Forum: https://community.xpulink.ai