Thursday, December 25, 2025

How to do chunking of pdf files for rag using langchain.js?

 Chunking a PDF in LangChain.js for a RAG pipeline involves three main steps: loading the document, splitting the text into manageable chunks, and then embedding and storing the chunks. [1, 2]


Step 1: Install Necessary Packages [3]

You will need LangChain libraries and a PDF loader (e.g., ). [4, 5, 6]

npm install langchain @langchain/community @langchain/textsplitters


Step 2: Load the PDF Document

Use the to read the content of your PDF file. This loader extracts text and often splits the document initially by page. [2, 7]

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

// Define the file path to your PDF document
const filePath = "path/to/your/document.pdf";

// Create a PDF loader
const loader = new PDFLoader(filePath);

// Load the documents
const rawDocs = await loader.load();
console.log(`Loaded ${rawDocs.length} pages/documents`);


Step 3: Split the Documents into Chunks

The initial pages may still be too large to fit into an LLM's context window. Use a text splitter, such as the , to break the text into smaller, contextually relevant chunks. This splitter attempts to split by paragraphs, then sentences, then words, to maintain semantic coherence. [2, 8, 9, 10]

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

// Initialize the text splitter with a specified chunk size and overlap
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // Maximum size of each chunk in characters
  chunkOverlap: 200, // Number of characters to overlap between adjacent chunks to preserve context
});

// Split the loaded documents into smaller chunks
const splitDocs = await textSplitter.splitDocuments(rawDocs);
console.log(`Split into ${splitDocs.length} chunks`);


Step 4: Embed and Store the Chunks


The resulting are an array of objects, each representing a manageable chunk of text with associated metadata. These documents are ready to be converted into embeddings and stored in a vector database for use in a RAG pipeline. [2, 11, 12]

// Example of accessing a chunk's content and metadata
console.log("Example chunk content:", splitDocs[0].pageContent);
console.log("Example chunk metadata:", splitDocs[0].metadata);

// These chunks can now be embedded and stored in a vector store
// (e.g., Chroma, Pinecone, etc.) as the next step in your RAG pipeline.
/*
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
const vectorStore = await MemoryVectorStore.fromDocuments(
  splitDocs,
  new OpenAIEmbeddings()
);
*/

By following these steps, you can effectively process PDF files in LangChain.js and prepare the data for an efficient and accurate RAG system. Experiment with different and values to find the optimal configuration for your specific documents. [2]


AI responses may include mistakes.

No comments:

Post a Comment

Installing langchain.js in vs code

  Installing LangChain.js in VS Code involves setting up a Node.js project and using or to add the necessary packages. The process is the ...