Back to Blog

Building Production-Ready RAG Systems with LangChain

Technical Walkthrough
Nordic Oculus Team
AIRAGLangChainTutorial

Retrieval-Augmented Generation (RAG) has become the go-to architecture for building AI applications that need to work with private data. In this post, I'll share our experience building production RAG systems at Nordic Oculus, including practical tips and code examples.

What is RAG?

RAG combines the power of large language models with your own data by:

  • **Retrieving** relevant documents based on a query
  • **Augmenting** the LLM prompt with this context
  • **Generating** accurate, grounded responses
  • This approach solves key limitations of LLMs: outdated training data, hallucinations, and lack of domain-specific knowledge.

    Architecture Overview

    Here's a typical RAG architecture we implement:

    Production RAG System ArchitectureClient ApplicationsWeb AppMobile AppAPI ClientsAPI GatewayAuthenticationRate LimitingLoad BalancingSpring Boot ServerBusiness LogicRequest OrchestrationResponse FormattingMonitoringPrometheusGrafanaELK StackLangChain Orchestration LayerDocumentLoaderPDF, DOCX, TXTText Splitter& ChunkingSemantic SplitChainManagementRetrievalQAEmbedding ModelsOpenAI Ada-002Cohere EmbedCustom BERTQdrant Vector DBHigh-PerformanceSimilarity SearchMetadata FilteringRedis CacheQuery CacheEmbedding CacheSession StoreLarge Language ModelsClaude 3 OpusClaude 3.5 SonnetGemini 2.0 ProGPT-4 TurboData SourcesPostgreSQLMongoDBS3 BucketsSharePointHTTPS/RESTgRPCAsyncEmbeddingsVector SearchContextETL Pipeline

    Key Components Explained:

    1. Client Layer: Web applications, mobile apps, and API clients interact with your RAG system
    2. API Gateway: Handles authentication, rate limiting, and load balancing
    3. Spring Boot Server: Orchestrates requests and manages business logic
    4. LangChain Layer: Handles document processing, chunking, and chain management
    5. Vector Database (Qdrant): Stores and retrieves high-dimensional embeddings
    6. LLM Integration: Multiple model options including Claude 3 Opus/Sonnet and Gemini 2.0 Pro
    7. Caching Layer: Redis for performance optimization
    8. Data Sources: Integration with various databases and storage systems

    Here's a typical RAG architecture we implement:

    python
    from langchain.embeddings import OpenAIEmbeddings
    from langchain.vectorstores import Pinecone
    from langchain.llms import AzureOpenAI
    from langchain.chains import RetrievalQA
    from langchain.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    1. Document Processing Pipeline

    The first step is ingesting and processing your documents:

    python
    def process_documents(file_paths):
        documents = []
        
        for path in file_paths:
            loader = PyPDFLoader(path)
            documents.extend(loader.load())
        
        # Smart chunking is crucial
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ".", " "]
        )
        
        chunks = text_splitter.split_documents(documents)
        return chunks

    2. Embedding and Vector Storage

    Next, we convert text chunks into vector embeddings:

    python
    def create_vector_store(chunks):
        embeddings = OpenAIEmbeddings(
            model="text-embedding-ada-002"
        )
        
        # Initialize Pinecone
        pinecone.init(
            api_key=os.environ["PINECONE_API_KEY"],
            environment="us-east-1"
        )
        
        # Create or update index
        vector_store = Pinecone.from_documents(
            chunks,
            embeddings,
            index_name="production-rag"
        )
        
        return vector_store

    3. Retrieval Chain

    The retrieval chain combines everything:

    python
    def create_rag_chain(vector_store):
        llm = AzureOpenAI(
            deployment_name="gpt-4",
            temperature=0.1
        )
        
        qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            retriever=vector_store.as_retriever(
                search_type="similarity",
                search_kwargs={"k": 5}
            ),
            return_source_documents=True
        )
        
        return qa_chain

    Production Considerations

    1. Chunking Strategy

    The way you split documents dramatically impacts retrieval quality:

  • Semantic chunking**: Split on meaningful boundaries
  • Overlap**: Include context from adjacent chunks
  • Size optimization**: Balance between context and precision
  • 2. Hybrid Search

    Combine semantic and keyword search for better results:

    python
    from langchain.retrievers import EnsembleRetriever
    from langchain.retrievers import BM25Retriever
    
    def create_hybrid_retriever(documents, vector_store):
        # Keyword-based retriever
        bm25_retriever = BM25Retriever.from_documents(documents)
        bm25_retriever.k = 3
        
        # Semantic retriever
        semantic_retriever = vector_store.as_retriever(
            search_kwargs={"k": 3}
        )
        
        # Combine both
        ensemble_retriever = EnsembleRetriever(
            retrievers=[bm25_retriever, semantic_retriever],
            weights=[0.3, 0.7]
        )
        
        return ensemble_retriever

    Common Pitfalls and Solutions

    1. Context Window Limits

    Problem: Retrieved documents exceed LLM context window.

    Solution: Implement intelligent truncation:

    python
    def fit_to_context_window(chunks, max_tokens=3000):
        selected_chunks = []
        current_tokens = 0
        
        for chunk in chunks:
            chunk_tokens = len(chunk.page_content.split())
            if current_tokens + chunk_tokens <= max_tokens:
                selected_chunks.append(chunk)
                current_tokens += chunk_tokens
            else:
                break
        
        return selected_chunks

    2. Hallucinations

    Problem: LLM generates information not in the context.

    Solution: Add explicit instructions and citations:

    python
    qa_prompt_template = """
    Use ONLY the following context to answer the question.
    If the answer is not in the context, say "I don't have enough information."
    
    Context: {context}
    
    Question: {question}
    
    Answer with citations [1], [2], etc:
    """

    Conclusion

    Building production RAG systems requires careful attention to:

  • Document processing and chunking strategies
  • Retrieval optimization and hybrid search
  • Prompt engineering and hallucination prevention
  • Performance optimization and caching
  • Monitoring and continuous improvement
  • At Nordic Oculus, we've successfully deployed RAG systems processing millions of documents daily. The key is starting simple and iterating based on real-world performance data.

    Want to learn more about implementing RAG for your organization? Contact us for a consultation.

    About the Author

    The Nordic Oculus team brings together skills in modern software development, cloud architecture, and emerging technologies.

    Learn more about our team