Retrieval Augmented Generation (RAG) is how you make an AI model answer questions using your own documents instead of its training data. Instead of fine-tuning the model, you find the relevant documents first and include them in the prompt.
This guide walks through building a RAG application from scratch — chunking documents, creating embeddings, storing them in a vector database, retrieving relevant chunks, and generating answers.
Quick answer
Split your documents into chunks, convert chunks to embeddings, store them in a vector database, and when a question comes in, find the most relevant chunks and include them in the prompt to the language model. The model answers based on your documents, not just its training data.
- You want an AI that answers questions from your specific documents.
- Fine-tuning is too expensive or complex for your use case.
- You need the AI to cite specific sources for its answers.
How RAG works
RAG has two phases: retrieval and generation. In retrieval, you find the document chunks most relevant to the user's question. In generation, you include those chunks in the prompt and ask the model to answer based on them.
This is simpler than it sounds. The retrieval phase is essentially semantic search — converting text to vectors and finding the closest matches. The generation phase is a standard LLM call with extra context.
Chunking your documents
Documents need to be split into chunks that are small enough to embed meaningfully but large enough to contain useful information. Typical chunk sizes are 500-1000 tokens.
The chunking strategy matters more than most guides suggest. Bad chunking — splitting mid-sentence, losing context, or creating chunks that are too small — leads to poor retrieval and worse answers.
- Split at natural boundaries: paragraphs, sections, or headings
- Include overlap between chunks (100-200 tokens) so context is not lost at boundaries
- Keep metadata with each chunk: source file, page number, section heading
- Test different chunk sizes — what works for short FAQs differs from long reports
Creating and storing embeddings
Convert each chunk to a vector embedding using an embedding model. Store the embeddings in a vector database (ChromaDB, FAISS, Pinecone, Weaviate) along with the original text and metadata.
For local RAG, use a local embedding model through Ollama or sentence-transformers. For cloud RAG, use OpenAI or Cohere embeddings.
# Simple RAG setup with ChromaDB
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("my_docs")
# Add documents
for i, chunk in enumerate(chunks):
embedding = model.encode(chunk["text"]).tolist()
collection.add(
ids=[f"chunk_{i}"],
embeddings=[embedding],
documents=[chunk["text"]],
metadatas=[{"source": chunk["source"]}]
)Retrieval and generation
When a question arrives, embed it with the same model, search the vector database for the closest chunks (typically 3-5), and include them in the prompt.
The prompt should instruct the model to answer based only on the provided context and to say when the context does not contain enough information to answer.
Evaluating and improving RAG quality
The most common RAG failure is retrieving the wrong chunks. Before blaming the generation model, check whether the retrieved chunks actually contain the answer.
Build a small test set: 10-20 questions where you know which document contains the answer. Check whether retrieval finds the right chunks. If not, improve chunking, try a different embedding model, or add reranking.
Worked example: company knowledge base Q&A
You have 200 internal documents — policies, procedures, technical docs. You chunk them, embed them with a local model, and store them in ChromaDB. A simple web interface lets employees ask questions. The system retrieves relevant chunks, passes them to the LLM, and returns an answer with source citations.
Common mistakes
- Skipping the chunking step and embedding entire documents (too coarse for good retrieval).
- Not testing retrieval quality separately from generation quality.
- Using an embedding model that does not match your document type and language.
When to use something else
For improving RAG answers with reranking, see RAG with reranking. For better chunking strategies, see chunking documents for RAG.
How to apply this in a real AI project
How to Build a RAG App With Your Own Documents becomes much more useful once it is tied to the rest of the workflow around it. In real work, the result depends on model selection, prompt design, tool integration, evaluation, and the operational reality of shipping AI features, not only on following one local tip correctly.
That is why the biggest win rarely comes from one clever move in isolation. It comes from making the surrounding process easier to review, easier to repeat, and easier to hand over when another person inherits the workbook or codebase later.
- Test with realistic inputs before shipping, not just the examples that inspired the idea.
- Keep the human review step visible so the workflow stays trustworthy as it scales.
- Measure what matters for your use case instead of relying on general benchmarks.
How to extend the workflow after this guide
Once the core technique works, the next leverage usually comes from standardising it. That might mean naming inputs more clearly, keeping one review checklist, or pairing this page with neighbouring guides so the process becomes repeatable rather than person-dependent.
The follow-on guides below are the most natural next steps from How to Build a RAG App With Your Own Documents. They help move the reader from one useful page into a stronger connected system.
- Go next to How to Improve RAG Answers With Reranking if you want to deepen the surrounding workflow instead of treating How to Build a RAG App With Your Own Documents as an isolated trick.
- Go next to How to Chunk Documents for Better RAG Results if you want to deepen the surrounding workflow instead of treating How to Build a RAG App With Your Own Documents as an isolated trick.
- Go next to How to Use Local AI on Your Own Files if you want to deepen the surrounding workflow instead of treating How to Build a RAG App With Your Own Documents as an isolated trick.
Related guides on this site
These guides cover RAG improvements, local AI file processing, and private AI patterns.
- How to Improve RAG Answers With Reranking
- How to Chunk Documents for Better RAG Results
- How to Use Local AI on Your Own Files
Want to use AI tools more effectively?
My courses cover practical AI workflows, from spreadsheet automation to app development, with real projects and honest tool comparisons.
Browse AI courses