What is Retrieval-Augmented Generation (RAG)?
Large language models like GPT-4o are remarkably capable, but they have a fundamental limitation: their knowledge is frozen at training time. Ask a model about your internal documentation, your product catalogue, or last quarter's financial reports — and it will either hallucinate an answer or tell you it doesn't know.
Retrieval-Augmented Generation (RAG) solves this by combining a language model with a search system. Instead of relying solely on what the model learned during training, RAG retrieves the most relevant documents from your own data at query time and passes them as context to the model. The model then generates a grounded answer based on what was retrieved.
“RAG is not a workaround for a weak model — it is the correct architecture for knowledge-intensive applications. Fine-tuning encodes facts into weights; RAG retrieves them on demand. Retrieval scales; fine-tuning gets stale.”
On Azure, RAG pipelines are built on three core services: Azure Blob Storage (document store), Azure AI Search (retrieval engine with vector + keyword support), and Azure OpenAI (the generative model). Together they form a pipeline that can answer questions grounded in documents your model has never seen before.
When to use RAG vs. fine-tuning
- Use RAG when your knowledge base changes frequently — product docs, support articles, internal wikis. Updating an index is instant; retraining a model takes hours and costs money.
- Use RAG when you need source citations — the model can reference the exact document chunk it drew from, making answers auditable.
- Use fine-tuning when you need to change the model's tone, format, or domain-specific reasoning style — not its factual knowledge.
- Use both together for the best results: fine-tune the model on your domain's style, then supply current facts via RAG.
Indexing Your Data with Azure AI Search
The quality of a RAG pipeline is largely determined by the quality of retrieval. If the wrong documents are retrieved, the model will produce a wrong or misleading answer regardless of how capable it is. Indexing is where you invest most of your engineering effort.
Step 1: Chunking your documents
Documents must be split into chunks before indexing. A chunk is the unit of retrieval — what gets passed to the model as context. Chunks that are too large dilute relevance; chunks that are too small lose context. A practical starting point is 512 tokens with a 10% overlap between consecutive chunks to avoid splitting mid-sentence.
from azure.storage.blob import BlobServiceClient
from openai import AzureOpenAI
import tiktoken
def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunks.append(enc.decode(tokens[start:end]))
start += max_tokens - overlap
return chunksStep 2: Generating embeddings
Each chunk is converted into a vector embedding — a numerical representation of its semantic meaning — using Azure OpenAI's embedding model. Chunks with similar meaning will have vectors that are close together in vector space, enabling semantic search.
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2024-02-01"
)
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
input=text,
model="text-embedding-3-large" # or text-embedding-ada-002
)
return response.data[0].embeddingStep 3: Creating the Azure AI Search index
Azure AI Search stores both the text chunks and their vector embeddings. Define your index schema with a vector field configured for cosine similarity and the HNSW algorithm, which gives fast approximate nearest-neighbour search at scale.
{
"name": "documents-index",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true },
{ "name": "content", "type": "Edm.String", "searchable": true },
{ "name": "source", "type": "Edm.String", "filterable": true },
{ "name": "page", "type": "Edm.Int32", "filterable": true },
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"dimensions": 3072,
"vectorSearchProfile": "hnsw-profile"
}
],
"vectorSearch": {
"algorithms": [{ "name": "hnsw-config", "kind": "hnsw" }],
"profiles": [{ "name": "hnsw-profile", "algorithmConfigurationName": "hnsw-config" }]
}
}Upload each chunk as a document to the index: store the chunk text in the content field, the embedding in contentVector, and metadata (source filename, page number) in filterable fields. Filterable metadata lets you scope retrieval to specific documents or sections at query time.
Retrieval and Prompt Assembly
With the index populated, the retrieval pipeline runs on every user query. The goal is to find the chunks most likely to contain the answer, then assemble them into a prompt for GPT-4o.
Hybrid search: vector + keyword
Pure vector search finds semantically similar content but can miss exact matches — product codes, names, or specific identifiers. Pure keyword search misses paraphrased content. Hybrid search combines both using Azure AI Search's Reciprocal Rank Fusion (RRF), producing better results than either alone.
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
def retrieve(query: str, top_k: int = 5) -> list[dict]:
search_client = SearchClient(
endpoint=os.environ["SEARCH_ENDPOINT"],
index_name="documents-index",
credential=AzureKeyCredential(os.environ["SEARCH_API_KEY"])
)
vector_query = VectorizedQuery(
vector=get_embedding(query),
k_nearest_neighbors=top_k,
fields="contentVector"
)
results = search_client.search(
search_text=query, # keyword search
vector_queries=[vector_query], # vector search
select=["content", "source", "page"],
top=top_k
)
return [{"content": r["content"], "source": r["source"], "page": r["page"]} for r in results]Assembling the prompt
Once you have the top-k retrieved chunks, assemble them into a prompt. Pass them as context in the system message, clearly separated from the user's question. Instruct the model to answer only from the provided context and to cite its sources.
def answer(query: str) -> str:
chunks = retrieve(query, top_k=5)
context = "\n\n".join(
f"[Source: {c['source']}, page {c['page']}]\n{c['content']}"
for c in chunks
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a helpful assistant. Answer the user's question using "
"only the context below. If the answer is not in the context, "
"say so. Always cite the source document and page number.\n\n"
f"Context:\n{context}"
)
},
{ "role": "user", "content": query }
],
temperature=0.2,
)
return response.choices[0].message.contentKeep temperature low (0.1–0.3) for factual RAG applications. Higher temperature increases creativity but also hallucination risk. For question-answering over documents, you want the model to stay close to the retrieved content.
Evaluation and Improving Retrieval Quality
A RAG pipeline is only as good as its weakest link. Most failures fall into two categories: retrieval failures (the right chunks were not returned) and generation failures (the right chunks were returned but the model answered incorrectly). Knowing which is happening tells you where to invest.
Chunking strategy matters more than model size
- Fixed-size chunking (512 tokens with overlap) is a safe default for prose documents.
- Semantic chunking splits on paragraph or section boundaries — better for structured documents like policies or manuals.
- Hierarchical chunking stores both a summary chunk and its constituent detail chunks. Retrieve by summary, pass the detail for generation.
- Smaller chunks improve retrieval precision but reduce the context available to the model. Tune chunk size against your specific document structure.
Semantic ranker in Azure AI Search
Azure AI Search includes a built-in semantic ranker that re-scores the top results from hybrid search using a language model trained for relevance. Enable it with a single parameter change and it typically improves answer quality noticeably, especially for longer or ambiguous queries.
results = search_client.search(
search_text=query,
vector_queries=[vector_query],
query_type="semantic",
semantic_configuration_name="default",
query_caption="extractive",
query_answer="extractive",
top=top_k
)Measuring pipeline quality
Build an evaluation set: 20–50 question/answer pairs grounded in your documents. Measure retrieval recall (did the correct chunk appear in the top-5?), answer correctness, and faithfulness (did the model answer from the context or hallucinate?). Run this evaluation every time you change chunking, embedding model, or index configuration.
Azure AI Studio's evaluation features support automated RAG evaluation using GPT-4 as a judge — it can score groundedness, relevance, and coherence at scale without manual review.
Want to build a RAG pipeline for your organisation?
Book a free call and we will scope out the right architecture for your data, team, and use case.
Closing Thoughts
RAG is the right default architecture for most enterprise AI applications. It is faster to build than fine-tuning, cheaper to maintain, and produces auditable answers with source citations. Azure gives you all the building blocks: Blob Storage, AI Search with hybrid retrieval and semantic ranking, and GPT-4o for generation.
Start with a small document set, a fixed chunking strategy, and hybrid search. Get a working pipeline first, then invest in evaluation and quality improvements. The teams that ship a basic RAG pipeline in week one learn far more than those who spend month one perfecting the chunking strategy in isolation.



