Codelybrary: How to do chunking of pdf files for rag using langchain.js?

Chunking a PDF in LangChain.js for a RAG pipeline involves three main steps: loading the document, splitting the text into manageable chunks, and then embedding and storing the chunks. [1, 2]

Step 1: Install Necessary Packages [3]

You will need LangChain libraries and a PDF loader (e.g., ). [4, 5, 6]

npm install langchain @langchain/community @langchain/textsplitters

Step 2: Load the PDF Document

Use the to read the content of your PDF file. This loader extracts text and often splits the document initially by page. [2, 7]

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

// Define the file path to your PDF document

const filePath = "path/to/your/document.pdf";

// Create a PDF loader

const loader = new PDFLoader(filePath);

// Load the documents

const rawDocs = await loader.load();

console.log(`Loaded ${rawDocs.length} pages/documents`);

Step 3: Split the Documents into Chunks

The initial pages may still be too large to fit into an LLM's context window. Use a text splitter, such as the , to break the text into smaller, contextually relevant chunks. This splitter attempts to split by paragraphs, then sentences, then words, to maintain semantic coherence. [2, 8, 9, 10]

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

// Initialize the text splitter with a specified chunk size and overlap

const textSplitter = new RecursiveCharacterTextSplitter({

chunkSize: 1000, // Maximum size of each chunk in characters

chunkOverlap: 200, // Number of characters to overlap between adjacent chunks to preserve context

});

// Split the loaded documents into smaller chunks

const splitDocs = await textSplitter.splitDocuments(rawDocs);

console.log(`Split into ${splitDocs.length} chunks`);

Step 4: Embed and Store the Chunks

The resulting are an array of objects, each representing a manageable chunk of text with associated metadata. These documents are ready to be converted into embeddings and stored in a vector database for use in a RAG pipeline. [2, 11, 12]

// Example of accessing a chunk's content and metadata

console.log("Example chunk content:", splitDocs[0].pageContent);

console.log("Example chunk metadata:", splitDocs[0].metadata);

// These chunks can now be embedded and stored in a vector store

// (e.g., Chroma, Pinecone, etc.) as the next step in your RAG pipeline.

import { MemoryVectorStore } from "langchain/vectorstores/memory";

import { OpenAIEmbeddings } from "@langchain/openai";

const vectorStore = await MemoryVectorStore.fromDocuments(

splitDocs,

new OpenAIEmbeddings()

);

By following these steps, you can effectively process PDF files in LangChain.js and prepare the data for an efficient and accurate RAG system. Experiment with different and values to find the optimal configuration for your specific documents. [2]

AI responses may include mistakes.

[1] https://docs.langchain.com/oss/javascript/langchain/rag

[2] https://codesignal.com/learn/courses/document-processing-and-retrieval-with-langchain-in-javascript-1/lessons/introduction-to-document-processing-with-langchain-in-javascript

[3] https://www.digitalocean.com/community/tutorials/build-rag-application-using-gpu-droplets

[4] https://codesignal.com/learn/courses/document-processing-and-retrieval-with-langchain-in-typescript/lessons/loading-and-splitting-pdf-documents-with-langchain-in-typescript

[5] https://generativeai.pub/context-is-everything-how-rag-transforms-llm-response-quality-568afd20a011

[6] https://netraneupane.medium.com/retrieval-augmented-generation-rag-using-langchain-deepseek-r1-9a450b5e3596

[7] https://harshitshah156.medium.com/building-a-local-ai-powered-pdf-chat-app-with-ollama-langchain-faiss-streamlit-879f1a9c4cf3

[8] https://philna.sh/blog/2024/09/18/how-to-chunk-text-in-javascript-for-rag-applications/

[9] https://medium.com/the-advanced-school-of-ai/document-loaders-and-text-splitters-part-5-d2c6a17f209a

[10] https://xebia.com/blog/archetype-llm-batch-use-case/

[11] https://www.ibm.com/think/tutorials/chunking-strategies-for-rag-with-langchain-watsonx-ai