BlogAI & RAG

Building RAG Pipelines with Azure AI Search and GPT-4o

Large language models like GPT-4o are remarkably capable, but they have a fundamental limitation: their knowledge is frozen at training time.

Author

Artan Ajredini

Artan Ajredini

CEO & Cloud Architect

5 min read
17 March 2025

What is Retrieval-Augmented Generation (RAG)?

Large language models like GPT-4o are remarkably capable, but they have a fundamental limitation: their knowledge is frozen at training time. Ask a model about your internal documentation, your product catalogue, or last quarter's financial reports — and it will either hallucinate an answer or tell you it doesn't know.

Retrieval-Augmented Generation (RAG) solves this by combining a language model with a search system. Instead of relying solely on what the model learned during training, RAG retrieves the most relevant documents from your own data at query time and passes them as context to the model. The model then generates a grounded answer based on what was retrieved.

RAG is not a workaround for a weak model — it is the correct architecture for knowledge-intensive applications. Fine-tuning encodes facts into weights; RAG retrieves them on demand. Retrieval scales; fine-tuning gets stale.

On Azure, RAG pipelines are built on three core services: Azure Blob Storage (document store), Azure AI Search (retrieval engine with vector + keyword support), and Azure OpenAI (the generative model). Together they form a pipeline that can answer questions grounded in documents your model has never seen before.

When to use RAG vs. fine-tuning

  • Use RAG when your knowledge base changes frequently — product docs, support articles, internal wikis. Updating an index is instant; retraining a model takes hours and costs money.
  • Use RAG when you need source citations — the model can reference the exact document chunk it drew from, making answers auditable.
  • Use fine-tuning when you need to change the model's tone, format, or domain-specific reasoning style — not its factual knowledge.
  • Use both together for the best results: fine-tune the model on your domain's style, then supply current facts via RAG.

Indexing Your Data with Azure AI Search

The quality of a RAG pipeline is largely determined by the quality of retrieval. If the wrong documents are retrieved, the model will produce a wrong or misleading answer regardless of how capable it is. Indexing is where you invest most of your engineering effort.

Step 1: Chunking your documents

Documents must be split into chunks before indexing. A chunk is the unit of retrieval — what gets passed to the model as context. Chunks that are too large dilute relevance; chunks that are too small lose context. A practical starting point is 512 tokens with a 10% overlap between consecutive chunks to avoid splitting mid-sentence.

python
from azure.storage.blob import BlobServiceClient
from openai import AzureOpenAI
import tiktoken

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        start += max_tokens - overlap
    return chunks

Step 2: Generating embeddings

Each chunk is converted into a vector embedding — a numerical representation of its semantic meaning — using Azure OpenAI's embedding model. Chunks with similar meaning will have vectors that are close together in vector space, enabling semantic search.

python
client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-02-01"
)

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-large"  # or text-embedding-ada-002
    )
    return response.data[0].embedding

Step 3: Creating the Azure AI Search index

Azure AI Search stores both the text chunks and their vector embeddings. Define your index schema with a vector field configured for cosine similarity and the HNSW algorithm, which gives fast approximate nearest-neighbour search at scale.

json
{
  "name": "documents-index",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true },
    { "name": "content", "type": "Edm.String", "searchable": true },
    { "name": "source", "type": "Edm.String", "filterable": true },
    { "name": "page", "type": "Edm.Int32", "filterable": true },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Single)",
      "dimensions": 3072,
      "vectorSearchProfile": "hnsw-profile"
    }
  ],
  "vectorSearch": {
    "algorithms": [{ "name": "hnsw-config", "kind": "hnsw" }],
    "profiles": [{ "name": "hnsw-profile", "algorithmConfigurationName": "hnsw-config" }]
  }
}

Upload each chunk as a document to the index: store the chunk text in the content field, the embedding in contentVector, and metadata (source filename, page number) in filterable fields. Filterable metadata lets you scope retrieval to specific documents or sections at query time.

Retrieval and Prompt Assembly

With the index populated, the retrieval pipeline runs on every user query. The goal is to find the chunks most likely to contain the answer, then assemble them into a prompt for GPT-4o.

Hybrid search: vector + keyword

Pure vector search finds semantically similar content but can miss exact matches — product codes, names, or specific identifiers. Pure keyword search misses paraphrased content. Hybrid search combines both using Azure AI Search's Reciprocal Rank Fusion (RRF), producing better results than either alone.

python
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    search_client = SearchClient(
        endpoint=os.environ["SEARCH_ENDPOINT"],
        index_name="documents-index",
        credential=AzureKeyCredential(os.environ["SEARCH_API_KEY"])
    )
    vector_query = VectorizedQuery(
        vector=get_embedding(query),
        k_nearest_neighbors=top_k,
        fields="contentVector"
    )
    results = search_client.search(
        search_text=query,          # keyword search
        vector_queries=[vector_query],  # vector search
        select=["content", "source", "page"],
        top=top_k
    )
    return [{"content": r["content"], "source": r["source"], "page": r["page"]} for r in results]

Assembling the prompt

Once you have the top-k retrieved chunks, assemble them into a prompt. Pass them as context in the system message, clearly separated from the user's question. Instruct the model to answer only from the provided context and to cite its sources.

python
def answer(query: str) -> str:
    chunks = retrieve(query, top_k=5)
    context = "\n\n".join(
        f"[Source: {c['source']}, page {c['page']}]\n{c['content']}"
        for c in chunks
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the user's question using "
                    "only the context below. If the answer is not in the context, "
                    "say so. Always cite the source document and page number.\n\n"
                    f"Context:\n{context}"
                )
            },
            { "role": "user", "content": query }
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

Keep temperature low (0.1–0.3) for factual RAG applications. Higher temperature increases creativity but also hallucination risk. For question-answering over documents, you want the model to stay close to the retrieved content.

Evaluation and Improving Retrieval Quality

A RAG pipeline is only as good as its weakest link. Most failures fall into two categories: retrieval failures (the right chunks were not returned) and generation failures (the right chunks were returned but the model answered incorrectly). Knowing which is happening tells you where to invest.

Chunking strategy matters more than model size

  • Fixed-size chunking (512 tokens with overlap) is a safe default for prose documents.
  • Semantic chunking splits on paragraph or section boundaries — better for structured documents like policies or manuals.
  • Hierarchical chunking stores both a summary chunk and its constituent detail chunks. Retrieve by summary, pass the detail for generation.
  • Smaller chunks improve retrieval precision but reduce the context available to the model. Tune chunk size against your specific document structure.

Semantic ranker in Azure AI Search

Azure AI Search includes a built-in semantic ranker that re-scores the top results from hybrid search using a language model trained for relevance. Enable it with a single parameter change and it typically improves answer quality noticeably, especially for longer or ambiguous queries.

python
results = search_client.search(
    search_text=query,
    vector_queries=[vector_query],
    query_type="semantic",
    semantic_configuration_name="default",
    query_caption="extractive",
    query_answer="extractive",
    top=top_k
)

Measuring pipeline quality

Build an evaluation set: 20–50 question/answer pairs grounded in your documents. Measure retrieval recall (did the correct chunk appear in the top-5?), answer correctness, and faithfulness (did the model answer from the context or hallucinate?). Run this evaluation every time you change chunking, embedding model, or index configuration.

Azure AI Studio's evaluation features support automated RAG evaluation using GPT-4 as a judge — it can score groundedness, relevance, and coherence at scale without manual review.

Want to build a RAG pipeline for your organisation?

Book a free call and we will scope out the right architecture for your data, team, and use case.

Schedule a call

Closing Thoughts

RAG is the right default architecture for most enterprise AI applications. It is faster to build than fine-tuning, cheaper to maintain, and produces auditable answers with source citations. Azure gives you all the building blocks: Blob Storage, AI Search with hybrid retrieval and semantic ranking, and GPT-4o for generation.

Start with a small document set, a fixed chunking strategy, and hybrid search. Get a working pipeline first, then invest in evaluation and quality improvements. The teams that ship a basic RAG pipeline in week one learn far more than those who spend month one perfecting the chunking strategy in isolation.

More articles

View all
Kubernetes on AKS: Production Best Practices
about 1 year ago1 min read

Kubernetes on AKS: Production Best Practices

Running Kubernetes in production is very different from running it in a demo. Cluster configuration decisions made early can be difficult and costly to undo later. In this article, we share the production best practices we apply on every AKS cluster we deploy: node pool design with system and user pools separated, cluster autoscaler tuning, Pod Disruption Budgets for zero-downtime maintenance, resource requests and limits to prevent noisy-neighbour problems, and Network Policies to enforce micro-segmentation. We also cover workload identity using Azure Workload Identity (replacing the deprecated pod-managed identities), secret injection from Azure Key Vault using the Secrets Store CSI Driver, and multi-zone node pools for high availability. Each section includes real configuration examples you can adapt for your own clusters.

Read article
Azure Infrastructure as Code (IaC) Guide: 10 Best Practices
11 months ago1 min read

Azure Infrastructure as Code (IaC) Guide: 10 Best Practices

Are you still deploying Azure resources manually in the Azure Portal? What starts as a quick setup often turns into inconsistencies across environments, undocumented changes, and errors that are hard to trace. The solution is Infrastructure as Code. This guide covers what IaC is, its benefits, how it works in Azure, the best tools (Bicep, ARM, Terraform, Pulumi), and 10 best practices to get you started.

Read article
Getting Started with Azure OpenAI Service
about 1 year ago1 min read

Getting Started with Azure OpenAI Service

Azure OpenAI Service brings powerful large language models — including GPT-4o, GPT-4 Turbo, and Embeddings — directly into your Azure environment, giving you enterprise-grade security, compliance, and regional data residency. In this guide, we walk through provisioning your first Azure OpenAI resource, deploying a model, and making your first API call from a .NET or Python application. We also cover key concepts like token limits, system prompts, temperature settings, and how to structure effective prompts for consistent results. Whether you are building a customer support chatbot, a document summarisation tool, or an internal knowledge assistant, this article gives you a solid foundation to start shipping AI features with confidence.

Read article