Introduction to Azure OpenAI Service
Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment. You get the same models as OpenAI's API, but with the enterprise controls that production workloads demand: regional data residency, private networking, Azure Active Directory authentication, content filtering, and compliance certifications.
For organisations already running workloads on Azure, this is the natural starting point for AI features. Your data does not leave your Azure region, you manage access through the same identity platform you already use, and the service integrates with Azure Monitor, Key Vault, Private Endpoints, and your existing CI/CD pipelines.
“Azure OpenAI is not just OpenAI with a different URL. It is OpenAI's models wrapped in Azure's enterprise security, compliance, and networking model — the difference that matters when you are handling customer data in production.”
Available models
- GPT-4o — the most capable multimodal model. Accepts text and images as input. Best for complex reasoning, document analysis, and high-quality generation.
- GPT-4 Turbo — large context window (128k tokens). Best for long documents, summarisation, and multi-turn conversations with extensive history.
- GPT-3.5 Turbo — fast and cost-efficient. Best for simpler tasks, high-volume applications, and latency-sensitive use cases.
- text-embedding-3-large / text-embedding-ada-002 — converts text into vector embeddings for semantic search and RAG pipelines.
- DALL-E 3 — generates images from text prompts. Available in select regions.
Setting Up Azure OpenAI
Before you can make API calls, you need to provision an Azure OpenAI resource and deploy a model. The resource is the billing and access container; the deployment is the specific model instance your application will call.
Step 1: Request access and provision the resource
Azure OpenAI requires an approved subscription. Submit a request through the Azure portal — approval typically takes 1–2 business days. Once approved, create the resource via the portal, Azure CLI, or Bicep.
resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
name: 'openai-${environment}'
location: 'swedencentral' // choose a region with your required model availability
kind: 'OpenAI'
sku: { name: 'S0' }
properties: {
publicNetworkAccess: 'Disabled' // use Private Endpoint for production
customSubDomainName: 'mycompany-openai'
}
}Step 2: Deploy a model
Model deployments are separate from the resource. Each deployment has a name (which you reference in API calls), a model version, and a tokens-per-minute (TPM) capacity limit. Deploy through Azure AI Studio or via Bicep.
resource gpt4oDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
parent: openAIAccount
name: 'gpt-4o'
properties: {
model: {
format: 'OpenAI'
name: 'gpt-4o'
version: '2024-08-06'
}
}
sku: {
name: 'Standard'
capacity: 30 // 30K tokens per minute
}
}Step 3: Store credentials in Key Vault
Never hardcode the API key or endpoint URL in your application code. Store them in Azure Key Vault and retrieve them at runtime using a managed identity — no secrets in environment variables, no secrets in source control.
// Grant the app's managed identity access to read Key Vault secrets
resource kvSecretAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
scope: keyVault
name: guid(keyVault.id, appIdentityPrincipalId, 'Key Vault Secrets User')
properties: {
roleDefinitionId: subscriptionResourceId(
'Microsoft.Authorization/roleDefinitions',
'4633458b-17de-408a-b874-0445c86b69e6' // Key Vault Secrets User
)
principalId: appIdentityPrincipalId
principalType: 'ServicePrincipal'
}
}Making Your First API Call
The Azure OpenAI SDK is available for Python, .NET, JavaScript, and Java. The API is compatible with the OpenAI SDK — you only need to change the endpoint and add the Azure-specific deployment name.
Python
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
# Use managed identity (recommended for production)
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_ad_token_provider=token_provider,
api_version="2024-10-21"
)
response = client.chat.completions.create(
model="gpt-4o", # your deployment name
messages=[
{ "role": "system", "content": "You are a helpful Azure cloud assistant." },
{ "role": "user", "content": "Explain Azure Blob Storage in two sentences." }
],
temperature=0.3,
max_tokens=300
)
print(response.choices[0].message.content).NET / C#
using Azure.AI.OpenAI;
using Azure.Identity;
var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!);
// Use managed identity — no API key required
var client = new AzureOpenAIClient(endpoint, new DefaultAzureCredential());
var chatClient = client.GetChatClient("gpt-4o"); // deployment name
var response = await chatClient.CompleteChatAsync(
new SystemChatMessage("You are a helpful Azure cloud assistant."),
new UserChatMessage("Explain Azure Blob Storage in two sentences.")
);
Console.WriteLine(response.Value.Content[0].Text);Streaming responses
For user-facing applications, stream the response token by token rather than waiting for the full completion. This dramatically improves perceived responsiveness — users see text appearing immediately instead of waiting several seconds for a complete response.
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{ "role": "user", "content": "Write a short summary of Zero Trust security." }],
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Production Tips
Getting a prototype working is straightforward. Getting it reliable, cost-efficient, and safe in production requires a few more considerations.
Understand token limits and costs
Every API call consumes tokens — both the input (prompt + context) and the output (completion). Token usage directly drives cost and determines whether you hit rate limits. GPT-4o costs significantly more per token than GPT-3.5 Turbo — profile your use case before choosing a model.
- Use tiktoken (Python) or the Azure OpenAI tokenizer to estimate prompt size before sending requests.
- Set max_tokens on every request — without it, the model may generate a very long (and expensive) response.
- Cache responses for identical or near-identical prompts using Azure Cache for Redis.
- Use GPT-3.5 Turbo for classification, extraction, and simple Q&A — reserve GPT-4o for tasks that genuinely need it.
Write effective system prompts
The system prompt defines the model's persona, constraints, and output format. A well-written system prompt is the single most impactful way to improve consistency and reduce hallucinations.
system_prompt = """
You are a customer support assistant for NativeCloud, an Azure consulting company.
Rules:
- Answer only questions related to Azure and cloud infrastructure.
- If a question is outside your scope, say: "I can only help with Azure and cloud topics."
- Always be concise — maximum 3 sentences unless the user asks for detail.
- Never make up product names, prices, or features. If unsure, say so.
- Format lists using bullet points.
"""Handle rate limits and errors gracefully
Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits per deployment. In production, implement exponential backoff with jitter when you receive a 429 (rate limit) response. Use multiple deployments or regions as fallback for high-availability applications.
import time, random
from openai import RateLimitError
def call_with_retry(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=messages
)
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)Content filtering
Azure OpenAI has built-in content filters that block harmful input and output across categories: hate speech, violence, sexual content, and self-harm. In Azure AI Studio, configure custom filter thresholds and enable prompt shields to protect against jailbreak and indirect prompt injection attacks — especially important for customer-facing applications.
Want to build an AI application on Azure?
We help teams design and ship production-ready AI features — from first prototype to scaled, secure deployment.
Closing Thoughts
Azure OpenAI Service removes the gap between AI capability and enterprise requirements. You get GPT-4o and the full OpenAI model family with private networking, managed identity authentication, regional data residency, and the compliance certifications your organisation likely already requires.
Start by provisioning a resource, deploying GPT-3.5 Turbo, and making your first API call in Python or .NET. Then add Key Vault for credential management, streaming for better UX, and a well-crafted system prompt. Those foundations will carry you from prototype to production.



