BlogAzure AI

Getting Started with Azure OpenAI Service

Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment.

Author

Artan Ajredini

Artan Ajredini

CEO & Cloud Architect

4 min read
28 April 2025

Introduction to Azure OpenAI Service

Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment. You get the same models as OpenAI's API, but with the enterprise controls that production workloads demand: regional data residency, private networking, Azure Active Directory authentication, content filtering, and compliance certifications.

For organisations already running workloads on Azure, this is the natural starting point for AI features. Your data does not leave your Azure region, you manage access through the same identity platform you already use, and the service integrates with Azure Monitor, Key Vault, Private Endpoints, and your existing CI/CD pipelines.

Azure OpenAI is not just OpenAI with a different URL. It is OpenAI's models wrapped in Azure's enterprise security, compliance, and networking model — the difference that matters when you are handling customer data in production.

Available models

  • GPT-4o — the most capable multimodal model. Accepts text and images as input. Best for complex reasoning, document analysis, and high-quality generation.
  • GPT-4 Turbo — large context window (128k tokens). Best for long documents, summarisation, and multi-turn conversations with extensive history.
  • GPT-3.5 Turbo — fast and cost-efficient. Best for simpler tasks, high-volume applications, and latency-sensitive use cases.
  • text-embedding-3-large / text-embedding-ada-002 — converts text into vector embeddings for semantic search and RAG pipelines.
  • DALL-E 3 — generates images from text prompts. Available in select regions.

Setting Up Azure OpenAI

Before you can make API calls, you need to provision an Azure OpenAI resource and deploy a model. The resource is the billing and access container; the deployment is the specific model instance your application will call.

Step 1: Request access and provision the resource

Azure OpenAI requires an approved subscription. Submit a request through the Azure portal — approval typically takes 1–2 business days. Once approved, create the resource via the portal, Azure CLI, or Bicep.

bicep
resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'openai-${environment}'
  location: 'swedencentral'   // choose a region with your required model availability
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: {
    publicNetworkAccess: 'Disabled'   // use Private Endpoint for production
    customSubDomainName: 'mycompany-openai'
  }
}

Step 2: Deploy a model

Model deployments are separate from the resource. Each deployment has a name (which you reference in API calls), a model version, and a tokens-per-minute (TPM) capacity limit. Deploy through Azure AI Studio or via Bicep.

bicep
resource gpt4oDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
  parent: openAIAccount
  name: 'gpt-4o'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'gpt-4o'
      version: '2024-08-06'
    }
  }
  sku: {
    name: 'Standard'
    capacity: 30   // 30K tokens per minute
  }
}

Step 3: Store credentials in Key Vault

Never hardcode the API key or endpoint URL in your application code. Store them in Azure Key Vault and retrieve them at runtime using a managed identity — no secrets in environment variables, no secrets in source control.

bicep
// Grant the app's managed identity access to read Key Vault secrets
resource kvSecretAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  name: guid(keyVault.id, appIdentityPrincipalId, 'Key Vault Secrets User')
  properties: {
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6'  // Key Vault Secrets User
    )
    principalId: appIdentityPrincipalId
    principalType: 'ServicePrincipal'
  }
}

Making Your First API Call

The Azure OpenAI SDK is available for Python, .NET, JavaScript, and Java. The API is compatible with the OpenAI SDK — you only need to change the endpoint and add the Azure-specific deployment name.

Python

python
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Use managed identity (recommended for production)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

response = client.chat.completions.create(
    model="gpt-4o",        # your deployment name
    messages=[
        { "role": "system", "content": "You are a helpful Azure cloud assistant." },
        { "role": "user",   "content": "Explain Azure Blob Storage in two sentences." }
    ],
    temperature=0.3,
    max_tokens=300
)

print(response.choices[0].message.content)

.NET / C#

csharp
using Azure.AI.OpenAI;
using Azure.Identity;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!);

// Use managed identity — no API key required
var client = new AzureOpenAIClient(endpoint, new DefaultAzureCredential());
var chatClient = client.GetChatClient("gpt-4o");  // deployment name

var response = await chatClient.CompleteChatAsync(
    new SystemChatMessage("You are a helpful Azure cloud assistant."),
    new UserChatMessage("Explain Azure Blob Storage in two sentences.")
);

Console.WriteLine(response.Value.Content[0].Text);

Streaming responses

For user-facing applications, stream the response token by token rather than waiting for the full completion. This dramatically improves perceived responsiveness — users see text appearing immediately instead of waiting several seconds for a complete response.

python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{ "role": "user", "content": "Write a short summary of Zero Trust security." }],
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Production Tips

Getting a prototype working is straightforward. Getting it reliable, cost-efficient, and safe in production requires a few more considerations.

Understand token limits and costs

Every API call consumes tokens — both the input (prompt + context) and the output (completion). Token usage directly drives cost and determines whether you hit rate limits. GPT-4o costs significantly more per token than GPT-3.5 Turbo — profile your use case before choosing a model.

  • Use tiktoken (Python) or the Azure OpenAI tokenizer to estimate prompt size before sending requests.
  • Set max_tokens on every request — without it, the model may generate a very long (and expensive) response.
  • Cache responses for identical or near-identical prompts using Azure Cache for Redis.
  • Use GPT-3.5 Turbo for classification, extraction, and simple Q&A — reserve GPT-4o for tasks that genuinely need it.

Write effective system prompts

The system prompt defines the model's persona, constraints, and output format. A well-written system prompt is the single most impactful way to improve consistency and reduce hallucinations.

python
system_prompt = """
You are a customer support assistant for NativeCloud, an Azure consulting company.

Rules:
- Answer only questions related to Azure and cloud infrastructure.
- If a question is outside your scope, say: "I can only help with Azure and cloud topics."
- Always be concise — maximum 3 sentences unless the user asks for detail.
- Never make up product names, prices, or features. If unsure, say so.
- Format lists using bullet points.
"""

Handle rate limits and errors gracefully

Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits per deployment. In production, implement exponential backoff with jitter when you receive a 429 (rate limit) response. Use multiple deployments or regions as fallback for high-availability applications.

python
import time, random
from openai import RateLimitError

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

Content filtering

Azure OpenAI has built-in content filters that block harmful input and output across categories: hate speech, violence, sexual content, and self-harm. In Azure AI Studio, configure custom filter thresholds and enable prompt shields to protect against jailbreak and indirect prompt injection attacks — especially important for customer-facing applications.

Want to build an AI application on Azure?

We help teams design and ship production-ready AI features — from first prototype to scaled, secure deployment.

Schedule a call

Closing Thoughts

Azure OpenAI Service removes the gap between AI capability and enterprise requirements. You get GPT-4o and the full OpenAI model family with private networking, managed identity authentication, regional data residency, and the compliance certifications your organisation likely already requires.

Start by provisioning a resource, deploying GPT-3.5 Turbo, and making your first API call in Python or .NET. Then add Key Vault for credential management, streaming for better UX, and a well-crafted system prompt. Those foundations will carry you from prototype to production.

More articles

View all
Building RAG Pipelines with Azure AI Search and GPT-4o
about 1 year ago1 min read

Building RAG Pipelines with Azure AI Search and GPT-4o

Retrieval-Augmented Generation (RAG) is the architecture that turns a general-purpose language model into a domain expert grounded in your own data. Instead of fine-tuning — which is expensive and produces models that go stale — RAG retrieves the most relevant documents at query time and passes them as context to the model. In this article, we build a complete RAG pipeline on Azure: documents are uploaded to Azure Blob Storage, indexed by Azure AI Search using vector embeddings from Azure OpenAI, and retrieved at query time using hybrid search (vector + keyword). The retrieved chunks are then assembled into a prompt sent to GPT-4o, which generates a grounded answer with source citations. We cover chunking strategies, embedding model selection, index schema design, semantic ranking, and how to evaluate retrieval quality. Full code examples in Python using the Azure SDK are included.

Read article
Building Cloud-Native Microservices on Azure
about 1 year ago1 min read

Building Cloud-Native Microservices on Azure

Moving from a monolithic architecture to microservices unlocks independent deployability, targeted scaling, and team autonomy — but it also introduces complexity around service discovery, distributed tracing, and data consistency. In this deep-dive, we design a cloud-native order processing system using Azure Kubernetes Service, Azure Service Bus for asynchronous messaging, and Azure Cosmos DB for per-service data isolation. We implement the Outbox Pattern to ensure reliable event publishing, add distributed tracing with Azure Monitor and OpenTelemetry, and set up a service mesh using NGINX Ingress with rate limiting and TLS termination. The article concludes with practical advice on when microservices are the right choice and how to avoid the most common pitfalls teams fall into during decomposition.

Read article
CI/CD Pipelines with Azure DevOps and GitHub Actions
about 1 year ago1 min read

CI/CD Pipelines with Azure DevOps and GitHub Actions

A well-designed CI/CD pipeline is the backbone of a high-performing engineering team. In this article, we compare Azure DevOps Pipelines and GitHub Actions and explain how to combine both tools to get the best of each ecosystem. We build a complete pipeline from scratch: code commit triggers a GitHub Actions workflow that runs unit tests and builds a Docker image, pushes it to Azure Container Registry, and then hands off to an Azure DevOps release pipeline for staged deployment to AKS — with approval gates between environments. We also cover secrets management with Azure Key Vault, environment-specific configuration using variable groups, and how to set up rollback strategies using deployment slots and blue-green releases. Practical YAML examples are included throughout.

Read article