Skip to content

Commit beba330

Browse files
Merge pull request #304390 from anaharris-ms/rel-ai-search
Reliability: AI Search - Correcting bad merge
2 parents f351b8c + 0bcaaab commit beba330

1 file changed

Lines changed: 74 additions & 25 deletions

File tree

articles/reliability/reliability-ai-search.md

Lines changed: 74 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,39 +6,60 @@ author: haileytap
66
ms.author: haileytapia
77
ms.service: azure-ai-search
88
ms.topic: reliability-article
9-
ms.date: 08/08/2025
9+
ms.date: 08/25/2025
1010
ms.custom: subject-reliability
1111
---
1212

1313
# Reliability in Azure AI Search
1414

15-
This article describes reliability support in Azure AI Search, covering intra-regional resiliency via [availability zones](#availability-zone-support) and [multi-region deployments](#multi-region-support).
15+
Azure AI Search is a scalable search infrastructure that indexes heterogeneous content and enables retrieval through APIs, applications, and AI agents. It's suitable for enterprise search scenarios as well as AI-powered customer experiences that require dynamic content generation through chat completion models.
1616

17-
[!INCLUDE [Shared responsibility description](includes/reliability-shared-responsibility-include.md)]
18-
19-
In Azure AI Search, you can achieve reliability by:
17+
This article describes reliability support in [Azure AI Search](/azure/search/search-what-is-azure-search), covering intra-regional resiliency via [availability zones](#availability-zone-support) and [multi-region deployments](#multi-region-support).
2018

21-
+ **Scaling a single search service**. Add multiple [replicas](/azure/search/search-capacity-planning#concepts-search-units-replicas-partitions) to increase availability and handle higher indexing and query workloads. If your region supports availability zones, replicas are distributed across different physical data centers on a best-effort basis for extra resiliency.
22-
23-
+ **Deploying multiple search services across different regions**. Each service operates independently within its region. However, in a multi-service scenario, you have options for synchronizing content across all services. You can also use a load-balancing solution to redistribute requests or fail over if there's a service outage.
19+
[!INCLUDE [Shared responsibility description](includes/reliability-shared-responsibility-include.md)]
2420

2521
## Production deployment recommendations
2622

2723
For production workloads, we recommend using a [billable tier](/azure/search/search-sku-tier) with at least [two replicas](/azure/search/search-capacity-planning#add-or-remove-partitions-and-replicas). This configuration makes your search service more resilient to transient faults and maintenance operations. It also meets the [service-level agreement](#service-level-agreement) for Azure AI Search, which requires two replicas for read-only workloads and three or more replicas for read-write workloads.
2824

2925
Azure AI Search doesn't provide a service-level agreement for the Free tier, which is limited to one replica and is strongly discouraged for production use.
3026

27+
## Reliability architecture overview
28+
29+
When you use Azure AI Search, you create a *search service*. Each search service supports many *search indexes* that store your searchable content.
30+
31+
Azure AI Search isn't designed as a primary data store. Instead, you use *indexers* to connect your search service to external data sources. An indexer crawls the source data, invokes *skills* that perform processing and enrichment, and populates your index with the skill outputs.
32+
33+
You also configure the number of *replicas* for your service. In Azure AI Search, a replica is a copy of your service's search engine. You can think of a replica as representing a single virtual machine (VM). Each search service can have between 1 and 12 replicas.
34+
35+
The addition of multiple replicas allows AI Search to:
36+
37+
- Increase the availability of your search service.
38+
- Perform maintenance on one replica while queries continue executing on other replicas.
39+
- Handle higher indexing and query workloads.
40+
- Improve resiliency by attempting to provision replicas in different availability zones, if supported in your region.
41+
42+
One replica is automatically assigned to be the *primary replica* by Azure AI Search. All write operations are performed against that replica. The other replicas are used for read operations.
43+
44+
You can also configure the number of *partitions*, which represent the storage used by the search indexes.
45+
46+
It's important to understand the impact of adding replicas and partitions, because they each affect read and write performance in different ways. For more information about replicas and partitions, see [Estimate and manage capacity of a search service](/azure/search/search-capacity-planning).
47+
3148
## Transient faults
3249

3350
[!INCLUDE [Transient fault description](includes/reliability-transient-fault-description-include.md)]
3451

52+
Azure AI Search indexers have built-in transient fault handling. If a data source is briefly unavailable, the indexer is designed to recover and retry, and uses change tracking to resume indexing from the last successfully indexed document.
53+
3554
Search services might experience transient faults during standard, unscheduled maintenance operations. Azure AI Search doesn't provide advance notification or allow scheduling of maintenance at specific times. Although every effort is made to minimize downtime, even for single-replica services, brief interruptions can still occur. To improve resiliency against these transient faults, we recommend that you use two or more replicas.
3655

56+
Any applications you build that interact with Azure AI Search should handle transient faults. Use a retry strategy, with exponential backoffs, for both read and write operations.
57+
3758
## Availability zone support
3859

3960
[!INCLUDE [Availability zone support description](includes/reliability-availability-zone-description-include.md)]
4061

41-
Azure AI Search is zone redundant, which means that your replicas are distributed across multiple availability zones within the service region.
62+
Azure AI Search is zone redundant, which means that your replicas are distributed across multiple availability zones within the search service region.
4263

4364
When you add two or more replicas to your service, Azure AI Search attempts to place each replica in a different availability zone. For services with more replicas than available zones, replicas are distributed across zones as evenly as possible.
4465

@@ -51,11 +72,20 @@ Support for availability zones depends on infrastructure and storage. For a list
5172

5273
### Requirements
5374

54-
Zone redundancy is automatically enabled when your search service:
75+
Zone redundancy is automatically enabled when your search service meets all of the following criteria:
5576

5677
+ Is in a [region that has availability zones](/azure/search/search-region-support).
57-
+ Is on the [Basic tier or higher](/azure/search/search-sku-tier). Zone redundancy isn't available for the Free tier.
58-
+ Has [multiple replicas](/azure/search/search-capacity-planning#add-or-remove-partitions-and-replicas).
78+
+ Is on the [Basic tier or higher](/azure/search/search-sku-tier).
79+
+ Has [at least two replicas](/azure/search/search-capacity-planning#add-or-remove-partitions-and-replicas).
80+
81+
> [!NOTE]
82+
> Azure AI Search attempts to distribute replicas across multiple zones when you have two or more replicas. However, for read-write workloads, you should use three or more replicas so that you receive the highest possible availability service-level agreement (SLA).
83+
84+
### Instance distribution across zones
85+
86+
Azure AI Search attempts to place replicas across different availability zones. However, there are occasionally situations where all of the replicas of a search service might be placed into the same availability zone. This situation can happen when replicas are removed from your service, such as when you *scale in* by configuring your service to use fewer replicas. The reason is that replica removal currently doesn't cause the remaining replicas to be rebalanced across the availability zones.
87+
88+
To reduce the likelihood of all of your replicas being placed into a single availability zone, you can manually trigger a scale-out operation immediately after a scale-in operation. For example, suppose your search service has ten replicas and you want to scale in to seven replicas. Instead of performing a single scale operation, you can temporarily scale to six instances, then immediately scale to seven instances, to trigger zone rebalancing.
5989

6090
### Cost
6191

@@ -65,21 +95,41 @@ Each search service starts with one replica. Zone redundancy requires two or mor
6595

6696
If your search service meets the [requirements for zone redundancy](#requirements), no extra configuration is necessary. Whenever possible, Azure AI Search attempts to place your replicas in different availability zones.
6797

98+
### Capacity planning and management
99+
100+
To prepare for availability zone failure, consider *over-provisioning* the number of replicas. Over-provisioning allows the search service to tolerate some degree of capacity loss and continue to function without degraded performance. Adding replicas during an outage is challenging, so over-provisioning helps ensure that your search service can handle normal request volumes, even with reduced capacity. For more information, see [Manage capacity with over-provisioning](/azure/reliability/concept-redundancy-replication-backup#manage-capacity-with-over-provisioning).
101+
102+
### Normal operations
103+
104+
This section describes what to expect when search services are configured for zone redundancy and all availability zones are operational.
105+
106+
- **Traffic routing between zones:** Azure AI Search performs automatic load balancing of all queries and writes across all of the available replicas. Read operations can be sent to any replica in any availability zone. Write operations are sent to a single primary replica, which is selected by the Azure AI Search service.
107+
108+
- **Data replication between zones:** Changes in data are replicated between replicas across availability zones automatically. Replication occurs asynchronously, which means writes are committed to one primary replica before they're replicated to other replicas.
109+
68110
### Zone-down experience
69111

70-
When an availability zone experiences an outage, your search service continues to operate using replicas in the surviving zones. The following points summarize the expected behavior:
112+
This section describes what to expect when search services are configured for zone redundancy and there's an availability zone outage.
71113

72114
+ **Detection and response**: Azure AI Search is responsible for detecting a failure in an availability zone. You don't need to do anything to initiate a zone failover.
73115

74-
+ **Notification**: Azure AI Search doesn't notify you when a zone is down.
116+
+ **Notification**: Azure AI Search doesn't notify you when a zone is down. However, you can use [Azure Resource Health](/azure/service-health/resource-health-overview) to monitor for the health of replicas. If a zone is down, the replicas in that zone will show as unavailable. You can also use [Azure Service Health](/azure/service-health/overview) to understand the overall health of the Azure AI Search service, including any zone failures.
117+
118+
Set up alerts on these services to receive notifications of zone-level problems. For more information, see [Create Service Health alerts in the Azure portal](/azure/service-health/alerts-activity-log-service-notifications-portal) and [Create and configure Resource Health alerts](/azure/service-health/resource-health-alert-arm-template-guide).
75119

76-
+ **Active requests**: Any active requests are dropped and should be retried by the client.
120+
+ **Active requests**: Requests being processed by replicas in the failed zone are terminated and should be retried by clients, following the guidance for [handling transient faults](#transient-faults).
77121

78-
+ **Expected data loss**: A zone failure isn't expected to cause data loss.
122+
+ **Expected data loss**: If the affected availability zone only contains read replicas, no data loss is expected.
123+
124+
If the primary replica is lost because it was in the affected zone, then any write operations that haven't yet been replicated might be lost.
79125

80-
+ **Expected downtime**: A zone failure isn't expected to cause downtime to your search service, but it can temporarily reduce your service's overall capacity. To maintain optimal performance, consider provisioning more replicas than you typically need. Adding replicas during an outage is challenging, so overprovisioning helps ensure that your service can handle normal request volumes, even with reduced capacity.
126+
+ **Expected downtime**: In most situations, a zone failure isn't expected to cause downtime to your search service for read operations, because read replicas in other availability zones continue to serve requests.
81127

82-
+ **Traffic rerouting**: When a zone fails, Azure AI Search detects the failure and routes requests to active replicas in the surviving zones.
128+
If the primary replica is lost because it was in the affected zone, Azure AI Search automatically promotes another replica to become the new primary so that write operations can resume. It typically takes a few seconds for the replica promotion to occur, and during this time write operations might not succeed. Ensure that your applications are prepared by following [transient fault handling guidance](#transient-faults).
129+
130+
However, there are some unlikely situations where all of your search service's replicas could be in a single availability zone, and if this happens, you might experience downtime until the zone recovers. For more information, and to understand a workaround, see [Instance distribution](#instance-distribution-across-zones).
131+
132+
+ **Traffic rerouting**: When a zone fails, Azure AI Search detects the failure and routes requests to active replicas in the surviving zones. If the primary replica was lost, another replica is promoted to be the new primary.
83133

84134
### Zone recovery
85135

@@ -95,7 +145,7 @@ Azure AI Search is a single-region service. If the region becomes unavailable, y
95145

96146
### Alternative multi-region approaches
97147

98-
To use Azure AI Search in multiple regions, you must deploy separate services in each region. If you create an identical deployment in a secondary Azure region using a multi-region geography architecture, your application becomes less susceptible to a single-region disaster.
148+
You can optionally deploy multiple Azure AI Search services in different regions. You're responsible for deploying and configuring separate services in each region. If you create an identical deployment in a secondary Azure region using a multi-region architecture, your application becomes less susceptible to a single-region disaster.
99149

100150
When you follow this approach, you must synchronize indexes across regions to recover the last application state. You must also configure load balancing and failover policies. For more information, see [Multi-region deployments in Azure AI Search](/azure/search/search-multi-region).
101151

@@ -107,14 +157,13 @@ However, if you accidentally delete the index and don't have a backup, you can [
107157

108158
## Service-level agreement
109159

110-
The service-level agreement (SLA) for Azure AI Search describes the expected availability of the service and the conditions that must be met to achieve that availability expectation. For more information, see the [SLA for Azure AI Search](https://azure.microsoft.com/support/legal/sla/search/v1_0/).
111-
112-
SLA coverage applies to search services on billable tiers with at least two replicas. In Azure AI Search, a replica is a copy of your index. Each service can have between 1 and 12 replicas. When you [add replicas](/azure/search/search-capacity-planning#add-or-remove-partitions-and-replicas), Azure AI Search can then perform maintenance on one replica while queries continue to execute on other replicas.
160+
[!INCLUDE [SLA description](includes/reliability-service-level-agreement-include.md)]
113161

114-
Microsoft guarantees at least 99.9% availability of:
162+
In Azure AI Search, the availability SLA applies to search services that:
115163

116-
+ Read-only workloads (queries) for search services with two replicas.
117-
+ Read-write workloads (queries and indexing) for search services with three or more replicas.
164+
+ Are configured to use [a billable tier](/azure/search/search-sku-tier).
165+
+ Have at least two [replicas](/azure/search/search-capacity-planning#add-or-remove-partitions-and-replicas) for read-only workloads (queries).
166+
+ Have at least three replicas for read-write workloads (queries and indexing).
118167

119168
## Related content
120169

0 commit comments

Comments
 (0)