AI & Machine Learning Apr 14, 2026 42 views 5 min read

How to build a RAG pipeline with LangChain and Pinecone

Build a retrieval-augmented generation system using LangChain, Pinecone vector store, and Python. Install dependencies, configure the index, and load local documents for querying.

Arjun M.

Updated 5h ago

Cloud VPS — scale in minutes

Instantly deploy SSD cloud VPS with guaranteed resources, snapshots and per-hour billing. Pay only for what you use.

Build a retrieval-augmented generation system using LangChain and Pinecone. This guide targets Python 3.10, LangChain 0.2.x, and Pinecone SDK 4.x on Ubuntu 24.04. You will create a vector index, load local PDF files, and query the system to retrieve relevant context.

Prerequisites

Ubuntu 24.04 LTS or a compatible Linux distribution.
Python 3.10 or higher installed and active.
Virtual environment activated (e.g., venv or conda).
An active Pinecone API key and a cloud account.
At least 8 GB of free RAM for embedding models.
A directory containing PDF documents to load into the vector store.

Step 1: Install the required Python packages

Create a virtual environment and install the necessary libraries. This includes the LangChain core, the Pinecone integration, and the embedding model client.

python3 -m venv rag-env
source rag-env/bin/activate
pip install langchain langchain-community langchain-pinecone pinecone-client sentence-transformers

Ensure all packages install without errors. The sentence-transformers library will download the default embedding model, which requires an internet connection.

Step 2: Create a Pinecone index and configure the client

Initialize the Pinecone client using your API key. Create a new index named my-rag-index with a dimension count of 1536, which matches the default all-MiniLM-L6-v2 model. Set the metric type to cosine similarity for semantic search.

import pinecone

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")

if "my-rag-index" not in pc.list_indexes().names():
    pc.create_index(
        name="my-rag-index",
        dimension=1536,
        metric="cosine",
        pod_type="s1",
        pod_quantity=1,
        metadata_config={"indexed": ["text"]}
    )
    print(f"Created index: my-rag-index")
else:
    print(f"Index my-rag-index already exists.")

Wait for the index to finish initializing. The create_index call returns immediately, but the server-side setup takes a few seconds. Check the status by calling pc.describe_index_stats("my-rag-index") until the total_count remains zero but the index is active.

Step 3: Load documents and create embeddings

Use LangChain's DirectoryLoader to read PDF files from a local folder. Split the text into chunks of 500 characters with an overlap of 50 to preserve context. Create an embedding instance using the all-MiniLM-L6-v2 model, which runs locally without an API key.

from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore

# Load documents from a local directory
loader = DirectoryLoader("./data", glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = text_splitter.split_documents(documents)

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create the Pinecone vector store
vectorstore = PineconeVectorStore(
    index_name="my-rag-index",
    embedding=embeddings
)

# Upsert the documents into the vector store
vectorstore.add_documents(documents=splits)
print(f"Added {len(splits)} chunks to the vector store.")

The add_documents method sends the text chunks to Pinecone. Each chunk gets converted into a 1536-dimensional vector. The operation returns a count of inserted records.

Step 4: Create a retriever and an LLM chain

Create a retriever object that queries the vector store. Configure the top k value to 3 to retrieve the most relevant chunks for each query. Combine the retriever with an LLM to form a chain that answers questions based on the retrieved context.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Create the retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Define the prompt
prompt = ChatPromptTemplate.from_template(
    "Use the following context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n\nContext: {context}\nQuestion: {question}"
)

# Create the chain
rag_chain = prompt | llm | StrOutputParser()

# Test the chain
question = "What is the main topic of the first document?"
result = rag_chain.invoke({"context": retriever.invoke(question), "question": question})
print(result)

The chain retrieves the top 3 chunks, formats them into the prompt, and sends the request to the LLM. The output parser extracts the text response from the LLM output.

Verify the installation

Run a simple query to confirm the system works. The command below loads a new document, adds it to the store, and queries it immediately.

from langchain_community.document_loaders import TextLoader

# Load a small test text file
loader = TextLoader("./data/test.txt")
test_docs = loader.load()
test_splits = text_splitter.split_documents(test_docs)
vectorstore.add_documents(test_splits)

# Query the system
query = "Summarize the test file."
retrieved_docs = retriever.invoke(query)
print(f"Retrieved {len(retrieved_docs)} documents.")
print(f"Question: {query}")
print(f"Answer: {rag_chain.invoke({'context': retrieved_docs, 'question': query})}")

Expected output shows Retrieved 1 documents. followed by a concise summary of the test file. If the output is empty, check that the embedding model downloaded correctly.

Troubleshooting

Index creation fails: Verify your Pinecone API key has write permissions. Check that your network allows outbound HTTPS connections to api.pinecone.io.
Embedding errors: Ensure sentence-transformers is installed. If the model fails to download, run pip install --upgrade sentence-transformers and retry.
Empty retrieval results: Check that the text_splitter chunks contain actual text. If your PDFs are scanned images, use OCR first before splitting.
High latency: Reduce the chunk_size or increase the Pinecone pod_quantity if you are hitting rate limits.
Memory issues: Close other applications to free up RAM. The embedding model requires roughly 2 GB of memory to run locally.

Adjust the chunk_size parameter if your documents are very large. A smaller chunk size improves retrieval precision but increases the number of vectors stored. Monitor your Pinecone dashboard for usage metrics and adjust the pod_type if you exceed the free tier limits.

Linux Dedicated Server

Rock-solid Linux dedicated servers with root access, KVM-IPMI and fully managed options. CentOS, Ubuntu, Debian, Rocky and AlmaLinux.

Tags: RAGLangChainPythonPineconeVector Search

Was this helpful?

How to build a RAG pipeline with LangChain and Pinecone

Cloud VPS — scale in minutes

Prerequisites

Step 1: Install the required Python packages

Step 2: Create a Pinecone index and configure the client

Step 3: Load documents and create embeddings

Step 4: Create a retriever and an LLM chain

Verify the installation

Troubleshooting

Linux Dedicated Server

Related tutorials

How to set up a RAG pipeline with LangChain and Weaviate

Installing PyTorch 2.3 with CUDA on Ubuntu 24.04

How to quantize a large language model for edge devices

Comments 0