Building Production-Ready RAG Systems with LangChain
Retrieval-Augmented Generation (RAG) has become the go-to architecture for building AI applications that need to work with private data. In this post, I'll share our experience building production RAG systems at Nordic Oculus, including practical tips and code examples.
What is RAG?
RAG combines the power of large language models with your own data by:
This approach solves key limitations of LLMs: outdated training data, hallucinations, and lack of domain-specific knowledge.
Architecture Overview
Here's a typical RAG architecture we implement:
Key Components Explained:
- Client Layer: Web applications, mobile apps, and API clients interact with your RAG system
- API Gateway: Handles authentication, rate limiting, and load balancing
- Spring Boot Server: Orchestrates requests and manages business logic
- LangChain Layer: Handles document processing, chunking, and chain management
- Vector Database (Qdrant): Stores and retrieves high-dimensional embeddings
- LLM Integration: Multiple model options including Claude 3 Opus/Sonnet and Gemini 2.0 Pro
- Caching Layer: Redis for performance optimization
- Data Sources: Integration with various databases and storage systems
Here's a typical RAG architecture we implement:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import AzureOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
1. Document Processing Pipeline
The first step is ingesting and processing your documents:
def process_documents(file_paths):
documents = []
for path in file_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())
# Smart chunking is crucial
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " "]
)
chunks = text_splitter.split_documents(documents)
return chunks
2. Embedding and Vector Storage
Next, we convert text chunks into vector embeddings:
def create_vector_store(chunks):
embeddings = OpenAIEmbeddings(
model="text-embedding-ada-002"
)
# Initialize Pinecone
pinecone.init(
api_key=os.environ["PINECONE_API_KEY"],
environment="us-east-1"
)
# Create or update index
vector_store = Pinecone.from_documents(
chunks,
embeddings,
index_name="production-rag"
)
return vector_store
3. Retrieval Chain
The retrieval chain combines everything:
def create_rag_chain(vector_store):
llm = AzureOpenAI(
deployment_name="gpt-4",
temperature=0.1
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
),
return_source_documents=True
)
return qa_chain
Production Considerations
1. Chunking Strategy
The way you split documents dramatically impacts retrieval quality:
2. Hybrid Search
Combine semantic and keyword search for better results:
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
def create_hybrid_retriever(documents, vector_store):
# Keyword-based retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 3
# Semantic retriever
semantic_retriever = vector_store.as_retriever(
search_kwargs={"k": 3}
)
# Combine both
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.3, 0.7]
)
return ensemble_retriever
Common Pitfalls and Solutions
1. Context Window Limits
Problem: Retrieved documents exceed LLM context window.
Solution: Implement intelligent truncation:
def fit_to_context_window(chunks, max_tokens=3000):
selected_chunks = []
current_tokens = 0
for chunk in chunks:
chunk_tokens = len(chunk.page_content.split())
if current_tokens + chunk_tokens <= max_tokens:
selected_chunks.append(chunk)
current_tokens += chunk_tokens
else:
break
return selected_chunks
2. Hallucinations
Problem: LLM generates information not in the context.
Solution: Add explicit instructions and citations:
qa_prompt_template = """
Use ONLY the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."
Context: {context}
Question: {question}
Answer with citations [1], [2], etc:
"""
Conclusion
Building production RAG systems requires careful attention to:
At Nordic Oculus, we've successfully deployed RAG systems processing millions of documents daily. The key is starting simple and iterating based on real-world performance data.
Want to learn more about implementing RAG for your organization? Contact us for a consultation.
About the Author
The Nordic Oculus team brings together skills in modern software development, cloud architecture, and emerging technologies.
Learn more about our team