Every AI project we start now involves the same question: should we use retrieval-augmented generation (RAG) or fine-tune a model? After running both in production across twelve client projects, we have clear opinions.
RAG: Default Choice for Knowledge Tasks
RAG won on 9 of our 12 projects. The pattern is always the same: the application needs to answer questions about a frequently changing knowledge base (documentation, product catalogue, support articles). Fine-tuning a model on this data would require re-training every time the data changes — impractical and expensive.
Our RAG stack: chunked documents in a vector store (Pinecone for hosted, pgvector for self-hosted), embedding with text-embedding-3-small, retrieval of the top-5 most relevant chunks, then a GPT-4o generation call with retrieved context injected into the system prompt.
Fine-Tuning: When You Need Style, Not Knowledge
Fine-tuning shines when you need the model to behave differently — more concise, use specific terminology, follow a particular output format, or adopt a brand voice. For one client whose support team had a specific structured output requirement (JSON with 14 specific fields), RAG alone could not reliably produce the format. Fine-tuning on 2,000 examples solved it completely.
Cost Comparison (Real Numbers)
For a RAG system handling 10,000 queries/month: approximately $80/month (embeddings + generation). For fine-tuning the same use case: $2,400 for training + $120/month for inference on a dedicated endpoint. Fine-tuning only makes economic sense at high volume or when quality differences are significant.