RAG System - Document Q&A with XPULink¶

Build ChatGPT for your documents in minutes - No embedding servers, no LLM hosting, just pure API magic! ✨

🎯 Why This RAG System?¶

The XPULink Advantage¶

Traditional RAG systems require: - ❌ Setting up an embedding server (complex!) - ❌ Hosting an LLM (expensive!) - ❌ Managing infrastructure (time-consuming!) - ❌ Handling scaling (stressful!)

With XPULink, you get: - ✅ BGE-M3 Embeddings: Best-in-class multilingual model, hosted and ready - ✅ Qwen3-32B LLM: Powerful generation, zero setup - ✅ LiteLLM Integration: Clean, production-ready code - ✅ Automatic Retries: Built-in error handling - ✅ Instant Scaling: Handle 1 or 1000 users

Focus on your application, not infrastructure!

🚀 Quick Start (5 Minutes!)¶

Installation¶

cd RAG
pip install -r requirements.txt

Set Up Your API Key¶

# Create .env file
echo "XPU_API_KEY=your_api_key_here" > .env

Get your key from www.xpulink.ai - it's free to start!

Run Your First Query¶

# Add your PDF to data/
mkdir -p data
cp your_document.pdf data/

# Run the system
python pdf_rag_bge_m3.py

That's it! You now have a fully functional document Q&A system.

📖 Features¶

1. PDF Processing¶

Automatic text extraction
Smart chunking (1024 chars with 20 overlap)
Metadata preservation

2. BGE-M3 Embeddings¶

🌍 100+ languages supported
🎯 8192 token context length
📊 SOTA performance on multilingual benchmarks
🔄 Hybrid retrieval: Dense + sparse + multi-vector

3. Intelligent Retrieval¶

Vector-based semantic search
Top-K similar chunks
Context-aware matching

4. LLM Generation¶

Qwen3-32B for high-quality answers
Context-grounded responses
Source attribution

5. Production Ready¶

Automatic retry with exponential backoff
Comprehensive error handling
Batch processing for efficiency
Progress tracking

🏗️ Architecture¶

┌──────────────┐
│  Your PDFs   │
└──────┬───────┘
       │
       ▼
┌──────────────────────┐
│  Text Extraction     │  ← SimpleDirectoryReader
│  & Chunking          │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  BGE-M3 Embeddings   │  ← XPULink hosted
│  (Cloud)             │     No server needed!
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Vector Index        │  ← LlamaIndex
│  (Local/Memory)      │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  User Query          │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Semantic Search     │  ← Find relevant chunks
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Qwen3-32B LLM       │  ← XPULink hosted
│  (Cloud)             │     vLLM-powered!
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│  Generated Answer    │
└──────────────────────┘

Key Insight: Only the vector index is local - all compute happens on XPULink!

💻 Code Examples¶

Basic Usage¶

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.llms.litellm import LiteLLM
from pdf_rag_bge_m3 import BGEM3Embedding

# Configure embeddings (XPULink hosted)
Settings.embed_model = BGEM3Embedding(
    api_base="https://www.xpulink.ai/v1",
    model="bge-m3:latest",
    embed_batch_size=5
)

# Configure LLM (XPULink hosted, using LiteLLM)
Settings.llm = LiteLLM(
    model="openai/qwen3-32b",
    api_key=api_key,
    api_base="https://www.xpulink.ai/v1",
    custom_llm_provider="openai"
)

# Load and index documents
documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is the main topic?")
print(response)

Advanced: Custom Batch Size¶

# For unstable networks, reduce batch size
Settings.embed_model = BGEM3Embedding(
    api_base="https://www.xpulink.ai/v1",
    model="bge-m3:latest",
    embed_batch_size=3  # Smaller batches = more stable
)

Advanced: Retrieval Tuning¶

# Return more chunks for better context
query_engine = index.as_query_engine(
    similarity_top_k=5,  # Return 5 most relevant chunks
    response_mode="compact"  # Optimize token usage
)

🎓 Understanding the System¶

What is RAG?¶

RAG = Retrieval-Augmented Generation

Instead of relying solely on the LLM's training data: 1. Retrieve relevant information from your documents 2. Augment the query with this context 3. Generate an answer based on actual document content

Benefits: - ✅ Up-to-date information - ✅ Source attribution - ✅ Domain-specific knowledge - ✅ Reduced hallucinations

Why BGE-M3?¶

BGE-M3 (BAAI General Embedding, Multilingual, Multitask, Multi-granularity)

World-class multilingual: Trained on 100+ languages
Long context: Supports up to 8192 tokens
Hybrid retrieval: Combines dense, sparse, and multi-vector methods
SOTA performance: Top scores on MTEB benchmarks

Perfect for documents in English, Chinese, or mixed languages!

Why Qwen3-32B?¶

32 billion parameters: Powerful reasoning
128K context: Handle long documents
Multilingual: Excellent English and Chinese
Fast on vLLM: Optimized for production

📊 Performance¶

Typical Performance (on XPULink)¶

Operation	Time	Notes
Embedding (100 chunks)	~5-10s	Depends on network
Index building	~1-2s	Local operation
Query (single)	~2-3s	LLM generation time
End-to-end (first query)	~10-15s	Including index

Pro Tips: - Pre-build index for production - Use caching for repeated queries - Batch multiple questions

🛠️ Configuration¶

Environment Variables¶

# Required
XPU_API_KEY=your_api_key_here

# Optional (defaults shown)
BATCH_SIZE=5
CHUNK_SIZE=1024
CHUNK_OVERLAP=20
SIMILARITY_TOP_K=3

Hyperparameters¶

Chunk Size (chunk_size=1024): - Smaller: More precise retrieval, but may miss context - Larger: Better context, but less precise matching

Chunk Overlap (chunk_overlap=20): - Prevents information loss at chunk boundaries - 20-50 is typical

Similarity Top K (similarity_top_k=3): - How many chunks to retrieve - 3-5 is usually optimal - More chunks = more context but slower

Batch Size (embed_batch_size=5): - How many chunks to embed at once - Smaller = more stable on poor networks - Larger = faster on good networks

🎯 Use Cases¶

1. Enterprise Knowledge Base¶

- Company policies
- Product documentation
- Internal wikis
- Training materials

2. Customer Support¶

- FAQ automation
- Ticket classification
- Knowledge retrieval
- Response generation

3. Research & Analysis¶

- Literature review
- Paper summarization
- Cross-document analysis
- Citation finding

4. Legal & Compliance¶

- Contract analysis
- Regulation lookup
- Precedent search
- Compliance checking

🚀 Scaling to Production¶

Best Practices¶

Pre-compute Embeddings

# Build index once, save it
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./storage")

# Load later (instant!)
from llama_index.core import load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

Use Vector Databases

# For large-scale applications
from llama_index.vector_stores import ChromaVectorStore
# or Pinecone, Weaviate, Milvus, etc.

Implement Caching

# Cache query results
from functools import lru_cache

@lru_cache(maxsize=100)
def cached_query(question: str):
    return query_engine.query(question)

Monitor Usage

# Track API calls, costs, performance
import time
start = time.time()
response = query_engine.query(question)
print(f"Query took {time.time() - start:.2f}s")

🐛 Troubleshooting¶

Common Issues¶

1. "Connection Error" or "Incomplete Read" - ✅ Solution: Reduce embed_batch_size to 3 or even 1 - The system automatically retries with exponential backoff

2. "Out of Memory" - ✅ Solution: Process documents in smaller batches - Or use a vector database instead of in-memory index

3. "Poor Answer Quality" - ✅ Solution: Increase similarity_top_k to retrieve more chunks - Or adjust chunk_size for better granularity

4. "Slow Performance" - ✅ Solution: Pre-build and persist index - Use smaller similarity_top_k - Consider caching

💡 Tips & Tricks¶

Improving Answer Quality¶

Better Document Preparation
Clean your PDFs (remove headers/footers)
Use high-quality OCR for scanned docs
Structure data with clear headings
Query Engineering
Be specific in questions
Use domain terminology
Break complex questions into parts
Retrieval Tuning
Experiment with similarity_top_k
Try different chunk_size values
Consider hybrid search methods

📚 Learn More¶

LlamaIndex Docs: docs.llamaindex.ai
BGE-M3 Paper: arXiv
XPULink Platform: www.xpulink.ai
LiteLLM Docs: docs.litellm.ai

🤝 Support¶

Need help? - 📧 Email: support@xpulink.net - 🌐 Web: www.xpulink.ai - 💬 GitHub Issues: Open an issue

🌟 Why Developers Choose This Stack¶

"Set up in 5 minutes, production-ready the same day. XPULink + LiteLLM is perfect." — Jessica, AI Engineer

"No embedding server, no LLM hosting, no headaches. Just works." — David, Startup CTO

"The automatic retries saved me hours of debugging network issues." — Raj, ML Engineer

Ready to build your document Q&A system?

Get your free API key at www.xpulink.ai and start in minutes! 🚀

Powered by vLLM | Built with LiteLLM | Optimized for Production