How to build a RAG pipeline with LangChain and Pinecone
Build a retrieval-augmented generation system using LangChain, Pinecone vector store, and Python. Install dependencies, configure the index, and load local documents for querying.
Build a retrieval-augmented generation system using LangChain and Pinecone. This guide targets Python 3.10, LangChain 0.2.x, and Pinecone SDK 4.x on Ubuntu 24.04. You will create a vector index, load local PDF files, and query the system to retrieve relevant context.
Prerequisites
- Ubuntu 24.04 LTS or a compatible Linux distribution.
- Python 3.10 or higher installed and active.
- Virtual environment activated (e.g.,
venvorconda). - An active Pinecone API key and a cloud account.
- At least 8 GB of free RAM for embedding models.
- A directory containing PDF documents to load into the vector store.
Step 1: Install the required Python packages
Create a virtual environment and install the necessary libraries. This includes the LangChain core, the Pinecone integration, and the embedding model client.
python3 -m venv rag-env
source rag-env/bin/activate
pip install langchain langchain-community langchain-pinecone pinecone-client sentence-transformers
Ensure all packages install without errors. The sentence-transformers library will download the default embedding model, which requires an internet connection.
Step 2: Create a Pinecone index and configure the client
Initialize the Pinecone client using your API key. Create a new index named my-rag-index with a dimension count of 1536, which matches the default all-MiniLM-L6-v2 model. Set the metric type to cosine similarity for semantic search.
import pinecone
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
if "my-rag-index" not in pc.list_indexes().names():
pc.create_index(
name="my-rag-index",
dimension=1536,
metric="cosine",
pod_type="s1",
pod_quantity=1,
metadata_config={"indexed": ["text"]}
)
print(f"Created index: my-rag-index")
else:
print(f"Index my-rag-index already exists.")
Wait for the index to finish initializing. The create_index call returns immediately, but the server-side setup takes a few seconds. Check the status by calling pc.describe_index_stats("my-rag-index") until the total_count remains zero but the index is active.
Step 3: Load documents and create embeddings
Use LangChain's DirectoryLoader to read PDF files from a local folder. Split the text into chunks of 500 characters with an overlap of 50 to preserve context. Create an embedding instance using the all-MiniLM-L6-v2 model, which runs locally without an API key.
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
# Load documents from a local directory
loader = DirectoryLoader("./data", glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = text_splitter.split_documents(documents)
# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Create the Pinecone vector store
vectorstore = PineconeVectorStore(
index_name="my-rag-index",
embedding=embeddings
)
# Upsert the documents into the vector store
vectorstore.add_documents(documents=splits)
print(f"Added {len(splits)} chunks to the vector store.")
The add_documents method sends the text chunks to Pinecone. Each chunk gets converted into a 1536-dimensional vector. The operation returns a count of inserted records.
Step 4: Create a retriever and an LLM chain
Create a retriever object that queries the vector store. Configure the top k value to 3 to retrieve the most relevant chunks for each query. Combine the retriever with an LLM to form a chain that answers questions based on the retrieved context.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Create the retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Define the prompt
prompt = ChatPromptTemplate.from_template(
"Use the following context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n\nContext: {context}\nQuestion: {question}"
)
# Create the chain
rag_chain = prompt | llm | StrOutputParser()
# Test the chain
question = "What is the main topic of the first document?"
result = rag_chain.invoke({"context": retriever.invoke(question), "question": question})
print(result)
The chain retrieves the top 3 chunks, formats them into the prompt, and sends the request to the LLM. The output parser extracts the text response from the LLM output.
Verify the installation
Run a simple query to confirm the system works. The command below loads a new document, adds it to the store, and queries it immediately.
from langchain_community.document_loaders import TextLoader
# Load a small test text file
loader = TextLoader("./data/test.txt")
test_docs = loader.load()
test_splits = text_splitter.split_documents(test_docs)
vectorstore.add_documents(test_splits)
# Query the system
query = "Summarize the test file."
retrieved_docs = retriever.invoke(query)
print(f"Retrieved {len(retrieved_docs)} documents.")
print(f"Question: {query}")
print(f"Answer: {rag_chain.invoke({'context': retrieved_docs, 'question': query})}")
Expected output shows Retrieved 1 documents. followed by a concise summary of the test file. If the output is empty, check that the embedding model downloaded correctly.
Troubleshooting
- Index creation fails: Verify your Pinecone API key has write permissions. Check that your network allows outbound HTTPS connections to
api.pinecone.io. - Embedding errors: Ensure
sentence-transformersis installed. If the model fails to download, runpip install --upgrade sentence-transformersand retry. - Empty retrieval results: Check that the
text_splitterchunks contain actual text. If your PDFs are scanned images, use OCR first before splitting. - High latency: Reduce the
chunk_sizeor increase the Pineconepod_quantityif you are hitting rate limits. - Memory issues: Close other applications to free up RAM. The embedding model requires roughly 2 GB of memory to run locally.
Adjust the chunk_size parameter if your documents are very large. A smaller chunk size improves retrieval precision but increases the number of vectors stored. Monitor your Pinecone dashboard for usage metrics and adjust the pod_type if you exceed the free tier limits.