Getting Started with Azure OpenAI Service

Introduction to Azure OpenAI Service

Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment. You get the same models as OpenAI's API, but with the enterprise controls that production workloads demand: regional data residency, private networking, Azure Active Directory authentication, content filtering, and compliance certifications.

For organisations already running workloads on Azure, this is the natural starting point for AI features. Your data does not leave your Azure region, you manage access through the same identity platform you already use, and the service integrates with Azure Monitor, Key Vault, Private Endpoints, and your existing CI/CD pipelines.

“Azure OpenAI is not just OpenAI with a different URL. It is OpenAI's models wrapped in Azure's enterprise security, compliance, and networking model — the difference that matters when you are handling customer data in production.”

Available models

GPT-4o — the most capable multimodal model. Accepts text and images as input. Best for complex reasoning, document analysis, and high-quality generation.
GPT-4 Turbo — large context window (128k tokens). Best for long documents, summarisation, and multi-turn conversations with extensive history.
GPT-3.5 Turbo — fast and cost-efficient. Best for simpler tasks, high-volume applications, and latency-sensitive use cases.
text-embedding-3-large / text-embedding-ada-002 — converts text into vector embeddings for semantic search and RAG pipelines.
DALL-E 3 — generates images from text prompts. Available in select regions.

Setting Up Azure OpenAI

Before you can make API calls, you need to provision an Azure OpenAI resource and deploy a model. The resource is the billing and access container; the deployment is the specific model instance your application will call.

Step 1: Request access and provision the resource

Azure OpenAI requires an approved subscription. Submit a request through the Azure portal — approval typically takes 1–2 business days. Once approved, create the resource via the portal, Azure CLI, or Bicep.

bicep

resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'openai-${environment}'
  location: 'swedencentral'   // choose a region with your required model availability
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: {
    publicNetworkAccess: 'Disabled'   // use Private Endpoint for production
    customSubDomainName: 'mycompany-openai'
  }
}

Step 2: Deploy a model

Model deployments are separate from the resource. Each deployment has a name (which you reference in API calls), a model version, and a tokens-per-minute (TPM) capacity limit. Deploy through Azure AI Studio or via Bicep.

bicep

resource gpt4oDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
  parent: openAIAccount
  name: 'gpt-4o'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'gpt-4o'
      version: '2024-08-06'
    }
  }
  sku: {
    name: 'Standard'
    capacity: 30   // 30K tokens per minute
  }
}

Step 3: Store credentials in Key Vault

Never hardcode the API key or endpoint URL in your application code. Store them in Azure Key Vault and retrieve them at runtime using a managed identity — no secrets in environment variables, no secrets in source control.

bicep

// Grant the app's managed identity access to read Key Vault secrets
resource kvSecretAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  name: guid(keyVault.id, appIdentityPrincipalId, 'Key Vault Secrets User')
  properties: {
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6'  // Key Vault Secrets User
    )
    principalId: appIdentityPrincipalId
    principalType: 'ServicePrincipal'
  }
}

Making Your First API Call

The Azure OpenAI SDK is available for Python, .NET, JavaScript, and Java. The API is compatible with the OpenAI SDK — you only need to change the endpoint and add the Azure-specific deployment name.

Python

python

import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Use managed identity (recommended for production)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

response = client.chat.completions.create(
    model="gpt-4o",        # your deployment name
    messages=[
        { "role": "system", "content": "You are a helpful Azure cloud assistant." },
        { "role": "user",   "content": "Explain Azure Blob Storage in two sentences." }
    ],
    temperature=0.3,
    max_tokens=300
)

print(response.choices[0].message.content)

.NET / C#

csharp

using Azure.AI.OpenAI;
using Azure.Identity;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!);

// Use managed identity — no API key required
var client = new AzureOpenAIClient(endpoint, new DefaultAzureCredential());
var chatClient = client.GetChatClient("gpt-4o");  // deployment name

var response = await chatClient.CompleteChatAsync(
    new SystemChatMessage("You are a helpful Azure cloud assistant."),
    new UserChatMessage("Explain Azure Blob Storage in two sentences.")
);

Console.WriteLine(response.Value.Content[0].Text);

Streaming responses

For user-facing applications, stream the response token by token rather than waiting for the full completion. This dramatically improves perceived responsiveness — users see text appearing immediately instead of waiting several seconds for a complete response.

python

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{ "role": "user", "content": "Write a short summary of Zero Trust security." }],
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Production Tips

Getting a prototype working is straightforward. Getting it reliable, cost-efficient, and safe in production requires a few more considerations.

Understand token limits and costs

Every API call consumes tokens — both the input (prompt + context) and the output (completion). Token usage directly drives cost and determines whether you hit rate limits. GPT-4o costs significantly more per token than GPT-3.5 Turbo — profile your use case before choosing a model.

Use tiktoken (Python) or the Azure OpenAI tokenizer to estimate prompt size before sending requests.
Set max_tokens on every request — without it, the model may generate a very long (and expensive) response.
Cache responses for identical or near-identical prompts using Azure Cache for Redis.
Use GPT-3.5 Turbo for classification, extraction, and simple Q&A — reserve GPT-4o for tasks that genuinely need it.

Write effective system prompts

The system prompt defines the model's persona, constraints, and output format. A well-written system prompt is the single most impactful way to improve consistency and reduce hallucinations.

python

system_prompt = """
You are a customer support assistant for NativeCloud, an Azure consulting company.

Rules:
- Answer only questions related to Azure and cloud infrastructure.
- If a question is outside your scope, say: "I can only help with Azure and cloud topics."
- Always be concise — maximum 3 sentences unless the user asks for detail.
- Never make up product names, prices, or features. If unsure, say so.
- Format lists using bullet points.
"""

Handle rate limits and errors gracefully

Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits per deployment. In production, implement exponential backoff with jitter when you receive a 429 (rate limit) response. Use multiple deployments or regions as fallback for high-availability applications.

python

import time, random
from openai import RateLimitError

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

Content filtering

Azure OpenAI has built-in content filters that block harmful input and output across categories: hate speech, violence, sexual content, and self-harm. In Azure AI Studio, configure custom filter thresholds and enable prompt shields to protect against jailbreak and indirect prompt injection attacks — especially important for customer-facing applications.

Want to build an AI application on Azure?

We help teams design and ship production-ready AI features — from first prototype to scaled, secure deployment.

Schedule a call

Closing Thoughts

Azure OpenAI Service removes the gap between AI capability and enterprise requirements. You get GPT-4o and the full OpenAI model family with private networking, managed identity authentication, regional data residency, and the compliance certifications your organisation likely already requires.

Start by provisioning a resource, deploying GPT-3.5 Turbo, and making your first API call in Python or .NET. Then add Key Vault for credential management, streaming for better UX, and a well-crafted system prompt. Those foundations will carry you from prototype to production.

Getting Started with Azure OpenAI Service

Introduction to Azure OpenAI Service

Available models

Setting Up Azure OpenAI

Step 1: Request access and provision the resource

Step 2: Deploy a model

Step 3: Store credentials in Key Vault

Making Your First API Call

Python

.NET / C#

Streaming responses

Production Tips

Understand token limits and costs

Write effective system prompts

Handle rate limits and errors gracefully

Content filtering

Closing Thoughts

More articles

Building RAG Pipelines with Azure AI Search and GPT-4o

Building Cloud-Native Microservices on Azure

CI/CD Pipelines with Azure DevOps and GitHub Actions