Building Production-Ready RAG Systems with LangChain

Technical Walkthrough

Nordic Oculus Team

AIRAGLangChainTutorial

Retrieval-Augmented Generation (RAG) has become the go-to architecture for building AI applications that need to work with private data. In this post, I'll share our experience building production RAG systems at Nordic Oculus, including practical tips and code examples.

What is RAG?

RAG combines the power of large language models with your own data by:

**Retrieving** relevant documents based on a query

**Augmenting** the LLM prompt with this context

**Generating** accurate, grounded responses

This approach solves key limitations of LLMs: outdated training data, hallucinations, and lack of domain-specific knowledge.

Architecture Overview

Here's a typical RAG architecture we implement:

Key Components Explained:

Client Layer: Web applications, mobile apps, and API clients interact with your RAG system
API Gateway: Handles authentication, rate limiting, and load balancing
Spring Boot Server: Orchestrates requests and manages business logic
LangChain Layer: Handles document processing, chunking, and chain management
Vector Database (Qdrant): Stores and retrieves high-dimensional embeddings
LLM Integration: Multiple model options including Claude 3 Opus/Sonnet and Gemini 2.0 Pro
Caching Layer: Redis for performance optimization
Data Sources: Integration with various databases and storage systems

Here's a typical RAG architecture we implement:

python

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import AzureOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

1. Document Processing Pipeline

The first step is ingesting and processing your documents:

python

def process_documents(file_paths):
    documents = []
    
    for path in file_paths:
        loader = PyPDFLoader(path)
        documents.extend(loader.load())
    
    # Smart chunking is crucial
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ".", " "]
    )
    
    chunks = text_splitter.split_documents(documents)
    return chunks

2. Embedding and Vector Storage

Next, we convert text chunks into vector embeddings:

python

def create_vector_store(chunks):
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002"
    )
    
    # Initialize Pinecone
    pinecone.init(
        api_key=os.environ["PINECONE_API_KEY"],
        environment="us-east-1"
    )
    
    # Create or update index
    vector_store = Pinecone.from_documents(
        chunks,
        embeddings,
        index_name="production-rag"
    )
    
    return vector_store

3. Retrieval Chain

The retrieval chain combines everything:

python

def create_rag_chain(vector_store):
    llm = AzureOpenAI(
        deployment_name="gpt-4",
        temperature=0.1
    )
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 5}
        ),
        return_source_documents=True
    )
    
    return qa_chain

Production Considerations

1. Chunking Strategy

The way you split documents dramatically impacts retrieval quality:

Semantic chunking**: Split on meaningful boundaries

Overlap**: Include context from adjacent chunks

Size optimization**: Balance between context and precision

2. Hybrid Search

Combine semantic and keyword search for better results:

python

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

def create_hybrid_retriever(documents, vector_store):
    # Keyword-based retriever
    bm25_retriever = BM25Retriever.from_documents(documents)
    bm25_retriever.k = 3
    
    # Semantic retriever
    semantic_retriever = vector_store.as_retriever(
        search_kwargs={"k": 3}
    )
    
    # Combine both
    ensemble_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, semantic_retriever],
        weights=[0.3, 0.7]
    )
    
    return ensemble_retriever

Common Pitfalls and Solutions

1. Context Window Limits

Problem: Retrieved documents exceed LLM context window.

Solution: Implement intelligent truncation:

python

def fit_to_context_window(chunks, max_tokens=3000):
    selected_chunks = []
    current_tokens = 0
    
    for chunk in chunks:
        chunk_tokens = len(chunk.page_content.split())
        if current_tokens + chunk_tokens <= max_tokens:
            selected_chunks.append(chunk)
            current_tokens += chunk_tokens
        else:
            break
    
    return selected_chunks

2. Hallucinations

Problem: LLM generates information not in the context.

Solution: Add explicit instructions and citations:

python

qa_prompt_template = """
Use ONLY the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."

Context: {context}

Question: {question}

Answer with citations [1], [2], etc:
"""

Conclusion

Building production RAG systems requires careful attention to:

Document processing and chunking strategies

Retrieval optimization and hybrid search

Prompt engineering and hallucination prevention

Performance optimization and caching

Monitoring and continuous improvement

At Nordic Oculus, we've successfully deployed RAG systems processing millions of documents daily. The key is starting simple and iterating based on real-world performance data.

Want to learn more about implementing RAG for your organization? Contact us for a consultation.

About the Author

The Nordic Oculus team brings together skills in modern software development, cloud architecture, and emerging technologies.

Learn more about our team