Skip to content

Model Inference Guide

This document provides comprehensive guidance for using XPULink's model inference services.

Overview

XPULink's inference platform provides high-performance, scalable model serving powered by vLLM framework. Our platform supports various state-of-the-art language models and provides OpenAI-compatible APIs for easy integration.

Supported Models

Large Language Models

  • Qwen3-32B: Advanced Chinese-English bilingual model with excellent reasoning capabilities
  • LLaMA Series: Various sizes available for different use cases
  • ChatGLM: Chinese-optimized conversational models
  • Baichuan: High-performance Chinese language models

Embedding Models

  • BGE-M3: State-of-the-art multilingual embedding model
  • text-embedding-ada-002: OpenAI compatible embedding model

Quick Start

1. Authentication

First, obtain your API key from the XPULink console and set it as an environment variable:

export XPULINK_API_KEY="your_api_key_here"

2. Basic Inference Request

Here's a simple example using Python:

import requests
import os

# API configuration
API_KEY = os.getenv("XPULINK_API_KEY")
BASE_URL = "https://www.xpulink.ai/v1/chat/completions"

# Request headers
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Request payload
payload = {
    "model": "qwen3-32b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 500
}

# Send request
response = requests.post(BASE_URL, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["message"]["content"])

Advanced Features

Streaming Response

Enable streaming for real-time token generation:

payload = {
    "model": "qwen3-32b",
    "messages": messages,
    "stream": True
}

# Process streaming response
response = requests.post(BASE_URL, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

Batch Processing

For processing multiple requests efficiently:

batch_requests = [
    {"model": "qwen3-32b", "messages": [...], "max_tokens": 100},
    {"model": "qwen3-32b", "messages": [...], "max_tokens": 100},
    # More requests...
]

# Process batch (implementation depends on your needs)
results = []
for request in batch_requests:
    response = requests.post(BASE_URL, headers=headers, json=request)
    results.append(response.json())

Performance Optimization

1. Use Appropriate Parameters

  • temperature: Lower values (0.1-0.3) for deterministic outputs, higher (0.7-0.9) for creativity
  • max_tokens: Set reasonable limits to control response length and cost
  • top_p: Use nucleus sampling for better quality (typically 0.9-0.95)

2. Caching Strategies

Implement caching for frequently requested prompts to reduce latency and cost.

3. Connection Pooling

Use persistent connections for multiple requests:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(total=3, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry)
session.mount('https://', adapter)

Error Handling

Implement robust error handling:

try:
    response = requests.post(BASE_URL, headers=headers, json=payload, timeout=30)
    response.raise_for_status()
    result = response.json()
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
except KeyError as e:
    print(f"Unexpected response format: {e}")

Best Practices

  1. Rate Limiting: Respect API rate limits and implement backoff strategies
  2. Monitoring: Log requests and responses for debugging and optimization
  3. Security: Never expose API keys in client-side code
  4. Testing: Thoroughly test with various input types and edge cases
  5. Documentation: Keep your integration documentation updated

Troubleshooting

Common Issues and Solutions

Issue Possible Cause Solution
401 Unauthorized Invalid API key Verify API key in environment variables
429 Too Many Requests Rate limit exceeded Implement backoff and retry logic
500 Internal Server Error Server issue Contact support if persistent
Timeout errors Network or server load Increase timeout, retry with backoff

Support

For additional help: - Documentation: https://docs.xpulink.net - Email: support@xpulink.ai - Community Forum: https://community.xpulink.ai