How to Build a RAG App With Your Own Documents

By Sagnik Bhattacharya 19 Mar 2026 5 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya for building a RAG app with your own documents.

Retrieval Augmented Generation (RAG) is how you make an AI model answer questions using your own documents instead of its training data. Instead of fine-tuning the model, you find the relevant documents first and include them in the prompt.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

This guide walks through building a RAG application from scratch — chunking documents, creating embeddings, storing them in a vector database, retrieving relevant chunks, and generating answers.

Quick answer

Split your documents into chunks, convert chunks to embeddings, store them in a vector database, and when a question comes in, find the most relevant chunks and include them in the prompt to the language model. The model answers based on your documents, not just its training data.

You want an AI that answers questions from your specific documents.
Fine-tuning is too expensive or complex for your use case.
You need the AI to cite specific sources for its answers.

How RAG works

RAG has two phases: retrieval and generation. In retrieval, you find the document chunks most relevant to the user's question. In generation, you include those chunks in the prompt and ask the model to answer based on them.

This is simpler than it sounds. The retrieval phase is essentially semantic search — converting text to vectors and finding the closest matches. The generation phase is a standard LLM call with extra context.

Chunking your documents

Documents need to be split into chunks that are small enough to embed meaningfully but large enough to contain useful information. Typical chunk sizes are 500-1000 tokens.

The chunking strategy matters more than most guides suggest. Bad chunking — splitting mid-sentence, losing context, or creating chunks that are too small — leads to poor retrieval and worse answers.

Split at natural boundaries: paragraphs, sections, or headings
Include overlap between chunks (100-200 tokens) so context is not lost at boundaries
Keep metadata with each chunk: source file, page number, section heading
Test different chunk sizes — what works for short FAQs differs from long reports

Creating and storing embeddings

Convert each chunk to a vector embedding using an embedding model. Store the embeddings in a vector database (ChromaDB, FAISS, Pinecone, Weaviate) along with the original text and metadata.

For local RAG, use a local embedding model through Ollama or sentence-transformers. For cloud RAG, use OpenAI or Cohere embeddings.

# Simple RAG setup with ChromaDB
import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("my_docs")

# Add documents
for i, chunk in enumerate(chunks):
    embedding = model.encode(chunk["text"]).tolist()
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk["text"]],
        metadatas=[{"source": chunk["source"]}]
    )

Retrieval and generation

When a question arrives, embed it with the same model, search the vector database for the closest chunks (typically 3-5), and include them in the prompt.

The prompt should instruct the model to answer based only on the provided context and to say when the context does not contain enough information to answer.

Evaluating and improving RAG quality

The most common RAG failure is retrieving the wrong chunks. Before blaming the generation model, check whether the retrieved chunks actually contain the answer.

Build a small test set: 10-20 questions where you know which document contains the answer. Check whether retrieval finds the right chunks. If not, improve chunking, try a different embedding model, or add reranking.

Worked example: company knowledge base Q&A

You have 200 internal documents — policies, procedures, technical docs. You chunk them, embed them with a local model, and store them in ChromaDB. A simple web interface lets employees ask questions. The system retrieves relevant chunks, passes them to the LLM, and returns an answer with source citations.

Common mistakes

Skipping the chunking step and embedding entire documents (too coarse for good retrieval).
Not testing retrieval quality separately from generation quality.
Using an embedding model that does not match your document type and language.

When to use something else

For improving RAG answers with reranking, see RAG with reranking. For better chunking strategies, see chunking documents for RAG.

Frequently asked questions

What is RAG in one sentence?

Retrieval-augmented generation: you retrieve the most relevant chunks of your own documents and put them in the prompt, so the model answers from your data instead of only its training.

What are the moving parts of a RAG app?

A chunker, an embedding model, a vector store (ChromaDB, FAISS, Pinecone, Weaviate), a retriever, and the LLM. At query time you embed the question, fetch the closest chunks, and generate an answer grounded in them.

What chunk size should I use?

Usually 500-1000 tokens (roughly two to four paragraphs) with some overlap: small enough to embed precisely, large enough to hold a complete idea. Split on natural boundaries, not mid-sentence.

Which vector database should I pick?

For a local prototype, FAISS or ChromaDB; for managed and hosted, Pinecone or Weaviate. Start local to learn the pipeline, then move to a hosted store when you need scale, filtering, or multi-user concurrency.

Why does my RAG app give wrong answers even with retrieval?

Almost always retrieval quality, not the model. If the right chunk is not retrieved, the model cannot use it, so check chunking, add reranking, and store metadata so you can trace which chunk produced an answer.

How do I keep answers grounded in my documents?

Instruct the model to answer only from the provided context and to say it does not know otherwise, and require a citation to the source chunk so every claim is verifiable.

Related guides on this site

These guides cover RAG improvements, local AI file processing, and private AI patterns.