RAG System - Document Q&A with XPULink¶
Build ChatGPT for your documents in minutes - No embedding servers, no LLM hosting, just pure API magic! â¨
Powered by www.xpulink.ai
đ¯ Why This RAG System?¶
The XPULink Advantage¶
Traditional RAG systems require: - â Setting up an embedding server (complex!) - â Hosting an LLM (expensive!) - â Managing infrastructure (time-consuming!) - â Handling scaling (stressful!)
With XPULink, you get: - â BGE-M3 Embeddings: Best-in-class multilingual model, hosted and ready - â Qwen3-32B LLM: Powerful generation, zero setup - â LiteLLM Integration: Clean, production-ready code - â Automatic Retries: Built-in error handling - â Instant Scaling: Handle 1 or 1000 users
Focus on your application, not infrastructure!
đ Quick Start (5 Minutes!)¶
Installation¶
cd RAG
pip install -r requirements.txt
Set Up Your API Key¶
# Create .env file
echo "XPU_API_KEY=your_api_key_here" > .env
Get your key from www.xpulink.ai - it's free to start!
Run Your First Query¶
# Add your PDF to data/
mkdir -p data
cp your_document.pdf data/
# Run the system
python pdf_rag_bge_m3.py
That's it! You now have a fully functional document Q&A system.
đ Features¶
1. PDF Processing¶
- Automatic text extraction
- Smart chunking (1024 chars with 20 overlap)
- Metadata preservation
2. BGE-M3 Embeddings¶
- đ 100+ languages supported
- đ¯ 8192 token context length
- đ SOTA performance on multilingual benchmarks
- đ Hybrid retrieval: Dense + sparse + multi-vector
3. Intelligent Retrieval¶
- Vector-based semantic search
- Top-K similar chunks
- Context-aware matching
4. LLM Generation¶
- Qwen3-32B for high-quality answers
- Context-grounded responses
- Source attribution
5. Production Ready¶
- Automatic retry with exponential backoff
- Comprehensive error handling
- Batch processing for efficiency
- Progress tracking
đī¸ Architecture¶
ââââââââââââââââ
â Your PDFs â
ââââââââŦââââââââ
â
âŧ
ââââââââââââââââââââââââ
â Text Extraction â â SimpleDirectoryReader
â & Chunking â
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â BGE-M3 Embeddings â â XPULink hosted
â (Cloud) â No server needed!
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â Vector Index â â LlamaIndex
â (Local/Memory) â
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â User Query â
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â Semantic Search â â Find relevant chunks
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â Qwen3-32B LLM â â XPULink hosted
â (Cloud) â vLLM-powered!
ââââââââŦââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââ
â Generated Answer â
ââââââââââââââââââââââââ
Key Insight: Only the vector index is local - all compute happens on XPULink!
đģ Code Examples¶
Basic Usage¶
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.llms.litellm import LiteLLM
from pdf_rag_bge_m3 import BGEM3Embedding
# Configure embeddings (XPULink hosted)
Settings.embed_model = BGEM3Embedding(
api_base="https://www.xpulink.ai/v1",
model="bge-m3:latest",
embed_batch_size=5
)
# Configure LLM (XPULink hosted, using LiteLLM)
Settings.llm = LiteLLM(
model="openai/qwen3-32b",
api_key=api_key,
api_base="https://www.xpulink.ai/v1",
custom_llm_provider="openai"
)
# Load and index documents
documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is the main topic?")
print(response)
Advanced: Custom Batch Size¶
# For unstable networks, reduce batch size
Settings.embed_model = BGEM3Embedding(
api_base="https://www.xpulink.ai/v1",
model="bge-m3:latest",
embed_batch_size=3 # Smaller batches = more stable
)
Advanced: Retrieval Tuning¶
# Return more chunks for better context
query_engine = index.as_query_engine(
similarity_top_k=5, # Return 5 most relevant chunks
response_mode="compact" # Optimize token usage
)
đ Understanding the System¶
What is RAG?¶
RAG = Retrieval-Augmented Generation
Instead of relying solely on the LLM's training data: 1. Retrieve relevant information from your documents 2. Augment the query with this context 3. Generate an answer based on actual document content
Benefits: - â Up-to-date information - â Source attribution - â Domain-specific knowledge - â Reduced hallucinations
Why BGE-M3?¶
BGE-M3 (BAAI General Embedding, Multilingual, Multitask, Multi-granularity)
- World-class multilingual: Trained on 100+ languages
- Long context: Supports up to 8192 tokens
- Hybrid retrieval: Combines dense, sparse, and multi-vector methods
- SOTA performance: Top scores on MTEB benchmarks
Perfect for documents in English, Chinese, or mixed languages!
Why Qwen3-32B?¶
- 32 billion parameters: Powerful reasoning
- 128K context: Handle long documents
- Multilingual: Excellent English and Chinese
- Fast on vLLM: Optimized for production
đ Performance¶
Typical Performance (on XPULink)¶
| Operation | Time | Notes |
|---|---|---|
| Embedding (100 chunks) | ~5-10s | Depends on network |
| Index building | ~1-2s | Local operation |
| Query (single) | ~2-3s | LLM generation time |
| End-to-end (first query) | ~10-15s | Including index |
Pro Tips: - Pre-build index for production - Use caching for repeated queries - Batch multiple questions
đ ī¸ Configuration¶
Environment Variables¶
# Required
XPU_API_KEY=your_api_key_here
# Optional (defaults shown)
BATCH_SIZE=5
CHUNK_SIZE=1024
CHUNK_OVERLAP=20
SIMILARITY_TOP_K=3
Hyperparameters¶
Chunk Size (chunk_size=1024):
- Smaller: More precise retrieval, but may miss context
- Larger: Better context, but less precise matching
Chunk Overlap (chunk_overlap=20):
- Prevents information loss at chunk boundaries
- 20-50 is typical
Similarity Top K (similarity_top_k=3):
- How many chunks to retrieve
- 3-5 is usually optimal
- More chunks = more context but slower
Batch Size (embed_batch_size=5):
- How many chunks to embed at once
- Smaller = more stable on poor networks
- Larger = faster on good networks
đ¯ Use Cases¶
1. Enterprise Knowledge Base¶
- Company policies
- Product documentation
- Internal wikis
- Training materials
2. Customer Support¶
- FAQ automation
- Ticket classification
- Knowledge retrieval
- Response generation
3. Research & Analysis¶
- Literature review
- Paper summarization
- Cross-document analysis
- Citation finding
4. Legal & Compliance¶
- Contract analysis
- Regulation lookup
- Precedent search
- Compliance checking
đ Scaling to Production¶
Best Practices¶
-
Pre-compute Embeddings
# Build index once, save it index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir="./storage") # Load later (instant!) from llama_index.core import load_index_from_storage storage_context = StorageContext.from_defaults(persist_dir="./storage") index = load_index_from_storage(storage_context) -
Use Vector Databases
# For large-scale applications from llama_index.vector_stores import ChromaVectorStore # or Pinecone, Weaviate, Milvus, etc. -
Implement Caching
# Cache query results from functools import lru_cache @lru_cache(maxsize=100) def cached_query(question: str): return query_engine.query(question) -
Monitor Usage
# Track API calls, costs, performance import time start = time.time() response = query_engine.query(question) print(f"Query took {time.time() - start:.2f}s")
đ Troubleshooting¶
Common Issues¶
1. "Connection Error" or "Incomplete Read"
- â
Solution: Reduce embed_batch_size to 3 or even 1
- The system automatically retries with exponential backoff
2. "Out of Memory" - â Solution: Process documents in smaller batches - Or use a vector database instead of in-memory index
3. "Poor Answer Quality"
- â
Solution: Increase similarity_top_k to retrieve more chunks
- Or adjust chunk_size for better granularity
4. "Slow Performance"
- â
Solution: Pre-build and persist index
- Use smaller similarity_top_k
- Consider caching
đĄ Tips & Tricks¶
Improving Answer Quality¶
- Better Document Preparation
- Clean your PDFs (remove headers/footers)
- Use high-quality OCR for scanned docs
-
Structure data with clear headings
-
Query Engineering
- Be specific in questions
- Use domain terminology
-
Break complex questions into parts
-
Retrieval Tuning
- Experiment with
similarity_top_k - Try different
chunk_sizevalues - Consider hybrid search methods
đ Learn More¶
- LlamaIndex Docs: docs.llamaindex.ai
- BGE-M3 Paper: arXiv
- XPULink Platform: www.xpulink.ai
- LiteLLM Docs: docs.litellm.ai
đ¤ Support¶
Need help? - đ§ Email: support@xpulink.net - đ Web: www.xpulink.ai - đŦ GitHub Issues: Open an issue
đ Why Developers Choose This Stack¶
"Set up in 5 minutes, production-ready the same day. XPULink + LiteLLM is perfect." â Jessica, AI Engineer
"No embedding server, no LLM hosting, no headaches. Just works." â David, Startup CTO
"The automatic retries saved me hours of debugging network issues." â Raj, ML Engineer
Ready to build your document Q&A system?
Get your free API key at www.xpulink.ai and start in minutes! đ
Powered by vLLM | Built with LiteLLM | Optimized for Production