Skip to content

RAG System - Document Q&A with XPULink

Build ChatGPT for your documents in minutes - No embedding servers, no LLM hosting, just pure API magic! ✨

Powered by www.xpulink.ai


đŸŽ¯ Why This RAG System?

Traditional RAG systems require: - ❌ Setting up an embedding server (complex!) - ❌ Hosting an LLM (expensive!) - ❌ Managing infrastructure (time-consuming!) - ❌ Handling scaling (stressful!)

With XPULink, you get: - ✅ BGE-M3 Embeddings: Best-in-class multilingual model, hosted and ready - ✅ Qwen3-32B LLM: Powerful generation, zero setup - ✅ LiteLLM Integration: Clean, production-ready code - ✅ Automatic Retries: Built-in error handling - ✅ Instant Scaling: Handle 1 or 1000 users

Focus on your application, not infrastructure!


🚀 Quick Start (5 Minutes!)

Installation

cd RAG
pip install -r requirements.txt

Set Up Your API Key

# Create .env file
echo "XPU_API_KEY=your_api_key_here" > .env

Get your key from www.xpulink.ai - it's free to start!

Run Your First Query

# Add your PDF to data/
mkdir -p data
cp your_document.pdf data/

# Run the system
python pdf_rag_bge_m3.py

That's it! You now have a fully functional document Q&A system.


📖 Features

1. PDF Processing

  • Automatic text extraction
  • Smart chunking (1024 chars with 20 overlap)
  • Metadata preservation

2. BGE-M3 Embeddings

  • 🌍 100+ languages supported
  • đŸŽ¯ 8192 token context length
  • 📊 SOTA performance on multilingual benchmarks
  • 🔄 Hybrid retrieval: Dense + sparse + multi-vector

3. Intelligent Retrieval

  • Vector-based semantic search
  • Top-K similar chunks
  • Context-aware matching

4. LLM Generation

  • Qwen3-32B for high-quality answers
  • Context-grounded responses
  • Source attribution

5. Production Ready

  • Automatic retry with exponential backoff
  • Comprehensive error handling
  • Batch processing for efficiency
  • Progress tracking

đŸ—ī¸ Architecture

┌──────────────┐
│  Your PDFs   │
└──────â”Ŧ───────┘
       │
       â–ŧ
┌──────────────────────┐
│  Text Extraction     │  ← SimpleDirectoryReader
│  & Chunking          │
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  BGE-M3 Embeddings   │  ← XPULink hosted
│  (Cloud)             │     No server needed!
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  Vector Index        │  ← LlamaIndex
│  (Local/Memory)      │
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  User Query          │
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  Semantic Search     │  ← Find relevant chunks
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  Qwen3-32B LLM       │  ← XPULink hosted
│  (Cloud)             │     vLLM-powered!
└──────â”Ŧ───────────────┘
       │
       â–ŧ
┌──────────────────────┐
│  Generated Answer    │
└──────────────────────┘

Key Insight: Only the vector index is local - all compute happens on XPULink!


đŸ’ģ Code Examples

Basic Usage

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.llms.litellm import LiteLLM
from pdf_rag_bge_m3 import BGEM3Embedding

# Configure embeddings (XPULink hosted)
Settings.embed_model = BGEM3Embedding(
    api_base="https://www.xpulink.ai/v1",
    model="bge-m3:latest",
    embed_batch_size=5
)

# Configure LLM (XPULink hosted, using LiteLLM)
Settings.llm = LiteLLM(
    model="openai/qwen3-32b",
    api_key=api_key,
    api_base="https://www.xpulink.ai/v1",
    custom_llm_provider="openai"
)

# Load and index documents
documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is the main topic?")
print(response)

Advanced: Custom Batch Size

# For unstable networks, reduce batch size
Settings.embed_model = BGEM3Embedding(
    api_base="https://www.xpulink.ai/v1",
    model="bge-m3:latest",
    embed_batch_size=3  # Smaller batches = more stable
)

Advanced: Retrieval Tuning

# Return more chunks for better context
query_engine = index.as_query_engine(
    similarity_top_k=5,  # Return 5 most relevant chunks
    response_mode="compact"  # Optimize token usage
)

🎓 Understanding the System

What is RAG?

RAG = Retrieval-Augmented Generation

Instead of relying solely on the LLM's training data: 1. Retrieve relevant information from your documents 2. Augment the query with this context 3. Generate an answer based on actual document content

Benefits: - ✅ Up-to-date information - ✅ Source attribution - ✅ Domain-specific knowledge - ✅ Reduced hallucinations

Why BGE-M3?

BGE-M3 (BAAI General Embedding, Multilingual, Multitask, Multi-granularity)

  • World-class multilingual: Trained on 100+ languages
  • Long context: Supports up to 8192 tokens
  • Hybrid retrieval: Combines dense, sparse, and multi-vector methods
  • SOTA performance: Top scores on MTEB benchmarks

Perfect for documents in English, Chinese, or mixed languages!

Why Qwen3-32B?

  • 32 billion parameters: Powerful reasoning
  • 128K context: Handle long documents
  • Multilingual: Excellent English and Chinese
  • Fast on vLLM: Optimized for production

📊 Performance

Operation Time Notes
Embedding (100 chunks) ~5-10s Depends on network
Index building ~1-2s Local operation
Query (single) ~2-3s LLM generation time
End-to-end (first query) ~10-15s Including index

Pro Tips: - Pre-build index for production - Use caching for repeated queries - Batch multiple questions


đŸ› ī¸ Configuration

Environment Variables

# Required
XPU_API_KEY=your_api_key_here

# Optional (defaults shown)
BATCH_SIZE=5
CHUNK_SIZE=1024
CHUNK_OVERLAP=20
SIMILARITY_TOP_K=3

Hyperparameters

Chunk Size (chunk_size=1024): - Smaller: More precise retrieval, but may miss context - Larger: Better context, but less precise matching

Chunk Overlap (chunk_overlap=20): - Prevents information loss at chunk boundaries - 20-50 is typical

Similarity Top K (similarity_top_k=3): - How many chunks to retrieve - 3-5 is usually optimal - More chunks = more context but slower

Batch Size (embed_batch_size=5): - How many chunks to embed at once - Smaller = more stable on poor networks - Larger = faster on good networks


đŸŽ¯ Use Cases

1. Enterprise Knowledge Base

- Company policies
- Product documentation
- Internal wikis
- Training materials

2. Customer Support

- FAQ automation
- Ticket classification
- Knowledge retrieval
- Response generation

3. Research & Analysis

- Literature review
- Paper summarization
- Cross-document analysis
- Citation finding
- Contract analysis
- Regulation lookup
- Precedent search
- Compliance checking

🚀 Scaling to Production

Best Practices

  1. Pre-compute Embeddings

    # Build index once, save it
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir="./storage")
    
    # Load later (instant!)
    from llama_index.core import load_index_from_storage
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    

  2. Use Vector Databases

    # For large-scale applications
    from llama_index.vector_stores import ChromaVectorStore
    # or Pinecone, Weaviate, Milvus, etc.
    

  3. Implement Caching

    # Cache query results
    from functools import lru_cache
    
    @lru_cache(maxsize=100)
    def cached_query(question: str):
        return query_engine.query(question)
    

  4. Monitor Usage

    # Track API calls, costs, performance
    import time
    start = time.time()
    response = query_engine.query(question)
    print(f"Query took {time.time() - start:.2f}s")
    


🐛 Troubleshooting

Common Issues

1. "Connection Error" or "Incomplete Read" - ✅ Solution: Reduce embed_batch_size to 3 or even 1 - The system automatically retries with exponential backoff

2. "Out of Memory" - ✅ Solution: Process documents in smaller batches - Or use a vector database instead of in-memory index

3. "Poor Answer Quality" - ✅ Solution: Increase similarity_top_k to retrieve more chunks - Or adjust chunk_size for better granularity

4. "Slow Performance" - ✅ Solution: Pre-build and persist index - Use smaller similarity_top_k - Consider caching


💡 Tips & Tricks

Improving Answer Quality

  1. Better Document Preparation
  2. Clean your PDFs (remove headers/footers)
  3. Use high-quality OCR for scanned docs
  4. Structure data with clear headings

  5. Query Engineering

  6. Be specific in questions
  7. Use domain terminology
  8. Break complex questions into parts

  9. Retrieval Tuning

  10. Experiment with similarity_top_k
  11. Try different chunk_size values
  12. Consider hybrid search methods

📚 Learn More


🤝 Support

Need help? - 📧 Email: support@xpulink.net - 🌐 Web: www.xpulink.ai - đŸ’Ŧ GitHub Issues: Open an issue


🌟 Why Developers Choose This Stack

"Set up in 5 minutes, production-ready the same day. XPULink + LiteLLM is perfect." — Jessica, AI Engineer

"No embedding server, no LLM hosting, no headaches. Just works." — David, Startup CTO

"The automatic retries saved me hours of debugging network issues." — Raj, ML Engineer


Ready to build your document Q&A system?

Get your free API key at www.xpulink.ai and start in minutes! 🚀


Powered by vLLM | Built with LiteLLM | Optimized for Production