You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/api-management/genai-gateway-capabilities.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ AI adoption in organizations involves several phases:
45
45
46
46
As AI adoption matures, especially in larger enterprises, the AI gateway helps address key challenges. It helps you:
47
47
48
-
* Authenticate and authorize access to Foundry Tools
48
+
* Authenticate and authorize access to AI services
49
49
* Load balance across multiple AI endpoints
50
50
* Monitor and log AI interactions
51
51
* Manage token usage and quotas across multiple applications
@@ -78,15 +78,15 @@ More information:
78
78
79
79
One of the main resources in generative AI services is *tokens*. Microsoft Foundry and other providers assign quotas for your model deployments as tokens per minute (TPM). You distribute these tokens across your model consumers, such as different applications, developer teams, or departments within the company.
80
80
81
-
If you have a single app connecting to an AI service backend, you can manage token consumption with a TPM limit that you set directly on the model deployment. However, when your application portfolio grows, you might have multiple apps calling single or multiple Azure AI Services endpoints. These endpoints can be pay-as-you-go or [Provisioned Throughput Units](/azure/ai-services/openai/concepts/provisioned-throughput) (PTU) instances. You need to make sure that one app doesn't use the whole TPM quota and block other apps from accessing the backends they need.
81
+
If you have a single app connecting to an AI service backend, you can manage token consumption with a TPM limit that you set directly on the model deployment. However, when your application portfolio grows, you might have multiple apps calling single or multiple AI service endpoints. These endpoints can be pay-as-you-go or [Provisioned Throughput Units](/azure/ai-services/openai/concepts/provisioned-throughput) (PTU) instances. You need to make sure that one app doesn't use the whole TPM quota and block other apps from accessing the backends they need.
82
82
83
83
### Token rate limiting and quotas
84
84
85
-
Configure a token limit policy on your LLM APIs to manage and enforce limits per API consumer based on the usage of Foundry Tool tokens. By using this policy, you can set a TPM limit or a token quota over a specified period, such as hourly, daily, weekly, monthly, or yearly.
85
+
Configure a token limit policy on your LLM APIs to manage and enforce limits per API consumer based on the usage of AI service tokens. By using this policy, you can set a TPM limit or a token quota over a specified period, such as hourly, daily, weekly, monthly, or yearly.
86
86
87
87
:::image type="content" source="media/genai-gateway-capabilities/token-rate-limiting.png" alt-text="Diagram of limiting Azure OpenAI Service tokens in API Management.":::
88
88
89
-
This policy provides flexibility to assign token-based limits on any counter key, such as subscription key, originating IP address, or an arbitrary key defined through a policy expression. The policy also enables precalculation of prompt tokens on the Azure API Management side, minimizing unnecessary requests to the Foundry Tool backend if the prompt already exceeds the limit.
89
+
This policy provides flexibility to assign token-based limits on any counter key, such as subscription key, originating IP address, or an arbitrary key defined through a policy expression. The policy also enables precalculation of prompt tokens on the Azure API Management side, minimizing unnecessary requests to the AI service backend if the prompt already exceeds the limit.
90
90
91
91
The following basic example demonstrates how to set a TPM limit of 500 per subscription key:
92
92
@@ -102,7 +102,7 @@ More information:
102
102
103
103
### Semantic caching
104
104
105
-
Semantic caching is a technique that improves the performance of LLM APIs by caching the results (completions) of previous prompts and reusing them by comparing the vector proximity of the prompt to prior requests. This technique reduces the number of calls made to the Foundry Tool backend, improves response times for end users, and can help reduce costs.
105
+
Semantic caching is a technique that improves the performance of LLM APIs by caching the results (completions) of previous prompts and reusing them by comparing the vector proximity of the prompt to prior requests. This technique reduces the number of calls made to the AI service backend, improves response times for end users, and can help reduce costs.
106
106
107
107
In API Management, enable semantic caching by using [Azure Managed Redis](/azure/redis/overview) or another external cache compatible with RediSearch and onboarded to Azure API Management. By using the Embeddings API, the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) and [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policies store and retrieve semantically similar prompt completions from the cache. This approach ensures completions reuse, resulting in reduced token consumption and improved response performance.
108
108
@@ -124,13 +124,13 @@ More information:
124
124
*[Deploy an API Management instance in multiple regions](api-management-howto-deploy-multi-region.md)
125
125
126
126
> [!NOTE]
127
-
> While API Management can scale gateway capacity, you also need to scale and distribute traffic to your AI backends to accommodate increased load (see the [Resiliency](#resiliency) section). For example, to take advantage of geographical distribution of your system in a multiregion configuration, deploy backend Foundry Tools in the same regions as your API Management gateways.
127
+
> While API Management can scale gateway capacity, you also need to scale and distribute traffic to your AI backends to accommodate increased load (see the [Resiliency](#resiliency) section). For example, to take advantage of geographical distribution of your system in a multiregion configuration, deploy backend AI services in the same regions as your API Management gateways.
128
128
129
129
## Security and safety
130
130
131
131
An AI gateway secures and controls access to your AI APIs. By using the AI gateway, you can:
132
132
133
-
* Use managed identities to authenticate to Foundry Tools, so you don't need API keys for authentication
133
+
* Use managed identities to authenticate to AI services, so you don't need API keys for authentication
134
134
* Configure OAuth authorization for AI apps and agents to access APIs or MCP servers by using API Management's credential manager
135
135
* Apply policies to automatically moderate LLM prompts by using [Azure AI Content Safety](/azure/ai-services/content-safety/overview)
136
136
@@ -146,7 +146,7 @@ More information:
146
146
147
147
## Resiliency
148
148
149
-
One challenge when building intelligent applications is ensuring that the applications are resilient to backend failures and can handle high loads. By configuring your LLM endpoints with [backends](backends.md) in Azure API Management, you can balance the load across them. You can also define circuit breaker rules to stop forwarding requests to Foundry Tool backends if they're not responsive.
149
+
One challenge when building intelligent applications is ensuring that the applications are resilient to backend failures and can handle high loads. By configuring your LLM endpoints with [backends](backends.md) in Azure API Management, you can balance the load across them. You can also define circuit breaker rules to stop forwarding requests to AI service backends if they're not responsive.
Copy file name to clipboardExpand all lines: articles/api-management/openai-compatible-llm-api.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ Learn more about managing AI APIs in API Management:
23
23
24
24
## Language model API types
25
25
26
-
API Management supports two types of language model APIs for this scenario. Choose the option suitable for your model deployment. The option determines how clients call the API and how the API Management instance routes requests to the Foundry Tool.
26
+
API Management supports two types of language model APIs for this scenario. Choose the option suitable for your model deployment. The option determines how clients call the API and how the API Management instance routes requests to the AI service.
27
27
28
28
***OpenAI-compatible** - Language model endpoints that are compatible with OpenAI's API. Examples include certain models exposed by inference providers such as [Hugging Face Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/en/index) and [Google Gemini API](openai-compatible-google-gemini-api.md).
0 commit comments