Skip to content

Evaluating XPULink Models with OpenBench

This guide will help you use the OpenBench framework to evaluate and test AI models hosted on the XPULink platform. OpenBench is a powerful model evaluation tool that supports multiple benchmark tests and evaluation metrics.

What is OpenBench?

OpenBench is an open-source AI model evaluation framework that supports: - Multiple standard evaluation benchmarks (MMLU, GSM8K, HellaSwag, etc.) - Custom evaluation tasks - OpenAI-compatible API interface - Detailed performance reports and analysis

Requirements

  • Python 3.8+
  • XPULink API Key (obtain from www.xpulink.ai)
  • OpenBench framework

Installation Steps

1. Install OpenBench

# Install OpenBench using pip
pip install openbench

# Or install from source
git clone https://github.com/OpenBMB/OpenBench.git
cd OpenBench
pip install -e .

2. Configure Environment Variables

Create a .env file or set the following environment variables in your system:

# XPULink API Key
export XPU_API_KEY=your_api_key_here

# XPULink API Base URL
export OPENAI_API_BASE=https://www.xpulink.ai/v1

Or create a .env file in your project directory:

XPU_API_KEY=your_api_key_here
OPENAI_API_BASE=https://www.xpulink.ai/v1

Basic Configuration Example

Create a configuration file xpulink_config.yaml:

# XPULink model configuration
model:
  type: openai  # Use OpenAI compatible interface
  name: qwen3-32b  # Model name on XPULink
  api_key: ${XPU_API_KEY}  # Read from environment variable
  base_url: https://www.xpulink.ai/v1  # XPULink API base URL

# Evaluation configuration
evaluation:
  benchmarks:
    - mmlu  # Multi-task language understanding
    - gsm8k  # Mathematical reasoning
    - hellaswag  # Common sense reasoning

  # Generation parameters
  generation:
    temperature: 0.0  # Deterministic output
    max_tokens: 2048
    top_p: 1.0

Python Code Example

import os
from dotenv import load_dotenv
import openai

# Load environment variables
load_dotenv()

# Configure OpenAI client to connect to XPULink
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

# Test connection
def test_xpulink_model():
    """Test if XPULink model is accessible"""
    try:
        response = openai.ChatCompletion.create(
            model="qwen3-32b",
            messages=[
                {"role": "user", "content": "Please explain artificial intelligence in one sentence."}
            ],
            max_tokens=100,
            temperature=0.7
        )
        print("Model responded successfully!")
        print("Response content:", response.choices[0].message.content)
        return True
    except Exception as e:
        print(f"Connection failed: {e}")
        return False

if __name__ == "__main__":
    test_xpulink_model()

Running Evaluation

Method 1: Using Command Line

# Run single benchmark test
openbench evaluate \
  --model-type openai \
  --model-name qwen3-32b \
  --api-key $XPU_API_KEY \
  --base-url https://www.xpulink.ai/v1 \
  --benchmark mmlu

# Run multiple benchmark tests
openbench evaluate \
  --config xpulink_config.yaml \
  --output results/xpulink_evaluation.json

Method 2: Using Python Script

Create run_evaluation.py:

import os
from dotenv import load_dotenv
from openbench import Evaluator

# Load environment variables
load_dotenv()

# Configure evaluator
evaluator = Evaluator(
    model_type="openai",
    model_name="qwen3-32b",
    api_key=os.getenv("XPU_API_KEY"),
    base_url="https://www.xpulink.ai/v1"
)

# Run evaluation
results = evaluator.run_benchmarks([
    "mmlu",      # Multi-task language understanding
    "gsm8k",     # Mathematical reasoning
    "hellaswag"  # Common sense reasoning
])

# Save results
evaluator.save_results(results, "results/xpulink_evaluation.json")

# Print summary
print("\nEvaluation Results Summary:")
for benchmark, scores in results.items():
    print(f"{benchmark}: {scores['accuracy']:.2%}")

Run the script:

python run_evaluation.py

Advanced Configuration

Custom Evaluation Tasks

Create custom_evaluation.py:

import os
from dotenv import load_dotenv
import openai

load_dotenv()

# Configure XPULink API
openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

def evaluate_custom_task(questions, model="qwen3-32b"):
    """
    Custom evaluation task

    Args:
        questions: List of questions
        model: Model name

    Returns:
        Evaluation results
    """
    results = []

    for q in questions:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a professional AI assistant."},
                {"role": "user", "content": q["question"]}
            ],
            max_tokens=512,
            temperature=0.0
        )

        answer = response.choices[0].message.content
        results.append({
            "question": q["question"],
            "model_answer": answer,
            "expected_answer": q.get("expected_answer", None),
            "correct": answer.strip() == q.get("expected_answer", "").strip()
        })

    return results

# Sample question set
questions = [
    {
        "question": "What is machine learning?",
        "expected_answer": "Machine learning is a branch of artificial intelligence that allows computers to automatically improve performance through data and experience."
    },
    {
        "question": "What type of programming language is Python?",
        "expected_answer": "Python is a high-level, interpreted, object-oriented programming language."
    }
]

# Run evaluation
results = evaluate_custom_task(questions)

# Calculate accuracy
accuracy = sum(1 for r in results if r["correct"]) / len(results)
print(f"Accuracy: {accuracy:.2%}")

Batch Testing Multiple Models

import os
from dotenv import load_dotenv
import openai

load_dotenv()

openai.api_key = os.getenv("XPU_API_KEY")
openai.api_base = "https://www.xpulink.ai/v1"

# Define models to test
models_to_test = [
    "qwen3-32b",
    "qwen3-14b",
    "llama3-70b"
]

def benchmark_models(models, test_prompt):
    """Compare performance of multiple models"""
    results = {}

    for model in models:
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": test_prompt}],
                max_tokens=100,
                temperature=0.0
            )
            results[model] = {
                "success": True,
                "response": response.choices[0].message.content,
                "tokens": response.usage.total_tokens
            }
        except Exception as e:
            results[model] = {
                "success": False,
                "error": str(e)
            }

    return results

# Run comparison test
test_prompt = "Please explain what deep learning is, in no more than 50 words."
comparison = benchmark_models(models_to_test, test_prompt)

# Output results
for model, result in comparison.items():
    print(f"\nModel: {model}")
    if result["success"]:
        print(f"Response: {result['response']}")
        print(f"Token Usage: {result['tokens']}")
    else:
        print(f"Error: {result['error']}")

Supported Evaluation Benchmarks

OpenBench supports the following standard evaluation benchmarks for testing XPULink models:

Benchmark Description Evaluates
MMLU Massive Multitask Language Understanding Knowledge breadth, domain expertise
GSM8K Grade School Math Mathematical reasoning, problem-solving
HellaSwag Common sense reasoning Common sense understanding, context completion
TruthfulQA Truthful question answering Factual accuracy, honesty
HumanEval Code generation Programming capability, code understanding
MBPP Python programming benchmark Basic programming skills

Result Analysis

After evaluation completes, OpenBench generates detailed reports including:

  • Accuracy: Proportion of correctly answered questions
  • F1 Score: Harmonic mean of precision and recall
  • Inference Time: Average response time per task
  • Token Usage: Token consumption of API calls
  • Cost Estimation: Cost estimates based on token usage

View results example:

import json

# Load evaluation results
with open("results/xpulink_evaluation.json", "r") as f:
    results = json.load(f)

# Print detailed results
print(json.dumps(results, indent=2, ensure_ascii=False))

FAQ

Q: How to obtain XPU_API_KEY?

A: Visit www.xpulink.ai to register an account, then create and obtain your API Key from the API key management page in the console.

Q: What if API timeout occurs during evaluation?

A: You can increase the timeout in the configuration:

openai.request_timeout = 60  # Set to 60 seconds

Q: How to view detailed evaluation logs?

A: Enable verbose logging mode:

openbench evaluate --config config.yaml --verbose --log-file evaluation.log

A: OpenBench supports all models on the XPULink platform that are compatible with the OpenAI API format. Common ones include: - qwen3-32b - qwen3-14b - llama3-70b - deepseek-chat

Please visit XPULink official documentation for the complete model list.

Q: How to control evaluation costs?

A: Recommend taking the following measures: 1. Test on small datasets first 2. Use max_tokens to limit generation length 3. Set temperature=0.0 for deterministic output, avoiding repeated tests 4. Use caching mechanisms to avoid duplicate calls

Best Practices

  1. API Key Security:
  2. Always use environment variables to store API Keys
  3. Add .env files to .gitignore
  4. Don't hardcode keys in code

  5. Evaluation Strategy:

  6. Start testing with small-scale datasets
  7. Gradually increase evaluation task complexity
  8. Save intermediate results regularly

  9. Error Handling:

  10. Implement retry mechanisms to handle network fluctuations
  11. Log failed test cases
  12. Monitor API quota usage

  13. Result Comparison:

  14. Save historical evaluation results for comparison
  15. Use the same random seed to ensure reproducibility
  16. Record model version and configuration during evaluation

Example Project Structure

Evaluation/
├── README.md                    # This document
├── config/
│   ├── xpulink_config.yaml     # XPULink configuration
│   └── benchmarks.yaml          # Benchmark test configuration
├── scripts/
│   ├── test_connection.py       # Test connection script
│   ├── run_evaluation.py        # Run evaluation script
│   └── custom_evaluation.py     # Custom evaluation
└── results/
    └── xpulink_evaluation.json  # Evaluation results

Technical Support

For questions or suggestions, please: 1. Visit XPULink Official Website 2. Check the OpenBench project's Issue page 3. Submit an Issue in this project


Note: Please use API quota reasonably to avoid unnecessary costs. It is recommended to estimate costs before conducting large-scale evaluations.