Skip to content

Commit af1a6dd

Browse files
authored
Merge pull request #314557 from ftrichardson1/add/files-for-ai
Add/files for ai
2 parents 2dda89d + 37707b1 commit af1a6dd

18 files changed

Lines changed: 2209 additions & 0 deletions

File tree

articles/storage/files/TOC.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,46 @@
287287
href: storage-java-how-to-use-file-storage.md
288288
- name: Python
289289
href: storage-python-how-to-use-file-storage.md
290+
- name: Files for artificial intelligence (AI)
291+
items:
292+
- name: Retrieval-augmented generation (RAG)
293+
items:
294+
- name: What is retrieval-augmented generation?
295+
href: artificial-intelligence/retrieval-augmented-generation/overview.md
296+
- name: LangChain
297+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/orchestrations/langchain.md
298+
- name: LlamaIndex
299+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/orchestrations/llamaindex.md
300+
- name: Haystack
301+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/orchestrations/haystack.md
302+
- name: Pinecone
303+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/vector-databases/pinecone.md
304+
- name: Weaviate
305+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/vector-databases/weaviate.md
306+
- name: Qdrant
307+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/vector-databases/qdrant.md
308+
- name: Tutorials
309+
items:
310+
- name: Get started
311+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/setup.md
312+
- name: Use LangChain with Pinecone
313+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/langchain-pinecone/tutorial-langchain-pinecone.md
314+
- name: Use LangChain with Weaviate
315+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/langchain-weaviate/tutorial-langchain-weaviate.md
316+
- name: Use LangChain with Qdrant
317+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/langchain-qdrant/tutorial-langchain-qdrant.md
318+
- name: Use LlamaIndex with Pinecone
319+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/llamaindex-pinecone/tutorial-llamaindex-pinecone.md
320+
- name: Use LlamaIndex with Weaviate
321+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/llamaindex-weaviate/tutorial-llamaindex-weaviate.md
322+
- name: Use LlamaIndex with Qdrant
323+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/llamaindex-qdrant/tutorial-llamaindex-qdrant.md
324+
- name: Use Haystack with Pinecone
325+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/haystack-pinecone/tutorial-haystack-pinecone.md
326+
- name: Use Haystack with Weaviate
327+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/haystack-weaviate/tutorial-haystack-weaviate.md
328+
- name: Use Haystack with Qdrant
329+
href: artificial-intelligence/retrieval-augmented-generation/open-source-frameworks/tutorials/haystack-qdrant/tutorial-haystack-qdrant.md
290330
- name: Troubleshooting
291331
items:
292332
- name: Troubleshoot Azure Files
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Build RAG pipelines with Haystack and Azure Files
3+
description: Use Haystack as an orchestration framework to build retrieval-augmented generation (RAG) pipelines using data stored in Azure Files.
4+
author: ftrichardson1
5+
ms.service: azure-file-storage
6+
ms.topic: overview
7+
ms.date: 04/09/2026
8+
ms.author: t-flynnr
9+
---
10+
11+
# Haystack with Azure Files
12+
13+
Haystack is an open-source framework that models every pipeline as a directed acyclic graph (DAG) of typed components. By using Haystack with Azure Files, you can build retrieval-augmented generation (RAG) pipelines that use your existing file shares as a primary data source.
14+
15+
Haystack separates indexing (embed and write) from querying (embed, retrieve, prompt, and generate) into distinct pipeline objects, making each independently testable and deployable.
16+
17+
## Why use Haystack with Azure Files?
18+
19+
* **Explicit pipeline DAGs:** Every component is a separate, typed node with named input and output sockets. You can visualize the pipeline, validate connections at build time, and trace data through each stage.
20+
* **Separate indexing and query pipelines:** Haystack separates ingestion from retrieval into distinct pipeline objects, making each independently testable and deployable.
21+
* **Custom components via `@component`:** Any Python class decorated with `@component` becomes a pipeline node with typed sockets, making it straightforward to add custom filtering or domain-specific logic as a first-class pipeline node.
22+
* **Built-in evaluation tools:** Haystack includes evaluation components for measuring retrieval and generation quality, so you can quantify the impact of changes to your pipeline.
23+
24+
## Tutorials
25+
26+
The following tutorials demonstrate how to build RAG pipelines over documents stored in Azure Files using Haystack with different vector databases:
27+
28+
| Vector database | Tutorial |
29+
| :--- | :--- |
30+
| Pinecone | [Haystack + Pinecone](../tutorials/haystack-pinecone/tutorial-haystack-pinecone.md) |
31+
| Weaviate | [Haystack + Weaviate](../tutorials/haystack-weaviate/tutorial-haystack-weaviate.md) |
32+
| Qdrant | [Haystack + Qdrant](../tutorials/haystack-qdrant/tutorial-haystack-qdrant.md) |
33+
34+
> [!NOTE]
35+
> All tutorials require the same [project setup and prerequisites](../setup.md).
36+
37+
## Next steps
38+
39+
* [Azure Storage documentation](/azure/storage/)
40+
* [Haystack documentation](https://docs.haystack.deepset.ai/)
41+
* [Haystack GitHub](https://github.com/deepset-ai/haystack)
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: Build RAG pipelines with LangChain and Azure Files
3+
description: Use LangChain as an orchestration framework to build retrieval-augmented generation (RAG) pipelines using data stored in Azure Files.
4+
author: ftrichardson1
5+
ms.service: azure-file-storage
6+
ms.topic: overview
7+
ms.date: 04/09/2026
8+
ms.author: t-flynnr
9+
---
10+
11+
# LangChain with Azure Files
12+
13+
LangChain is an open-source framework designed to simplify the creation of applications powered by large language models (LLMs). By using LangChain with Azure Files, you can build robust retrieval-augmented generation (RAG) pipelines that leverage your existing file shares as a primary data source.
14+
15+
LangChain's modular architecture and **LangChain Expression Language (LCEL)** allow you to swap components—such as document loaders, retrievers, and vector stores—with minimal code changes.
16+
17+
## Why use LangChain with Azure Files?
18+
19+
Integrating LangChain with Azure Files offers several advantages for AI workflows:
20+
21+
* **Modular integrations:** Connect Azure Files to a wide array of vector databases and LLMs without rewriting core logic.
22+
* **Streamlined orchestration:** Use LCEL to build composable, testable pipelines that support asynchronous execution and real-time streaming.
23+
* **Optional observability:** Integrate with tools like LangSmith to trace execution, evaluate retrieval quality, and debug latency.
24+
* **Direct data access:** Directly ingest unstructured data from Azure Files, maintaining your existing storage hierarchy as the system of record.
25+
26+
## Tutorials
27+
28+
The following tutorials demonstrate how to build RAG pipelines over documents stored in Azure Files using LangChain with different vector databases:
29+
30+
| Vector database | Tutorial |
31+
| :--- | :--- |
32+
| Pinecone | [LangChain + Pinecone](../tutorials/langchain-pinecone/tutorial-langchain-pinecone.md) |
33+
| Weaviate | [LangChain + Weaviate](../tutorials/langchain-weaviate/tutorial-langchain-weaviate.md) |
34+
| Qdrant | [LangChain + Qdrant](../tutorials/langchain-qdrant/tutorial-langchain-qdrant.md) |
35+
36+
> [!NOTE]
37+
> All tutorials require the same [project setup and prerequisites](../setup.md).
38+
39+
## Next steps
40+
41+
* [Azure Storage documentation](/azure/storage/)
42+
* [LangChain documentation](https://python.langchain.com/docs/introduction/)
43+
* [LangChain GitHub](https://github.com/langchain-ai/langchain)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Build RAG pipelines with LlamaIndex and Azure Files
3+
description: Use LlamaIndex as an orchestration framework to build retrieval-augmented generation (RAG) pipelines using data stored in Azure Files.
4+
author: ftrichardson1
5+
ms.service: azure-file-storage
6+
ms.topic: overview
7+
ms.date: 04/09/2026
8+
ms.author: t-flynnr
9+
---
10+
11+
# LlamaIndex with Azure Files
12+
13+
LlamaIndex is an open-source framework designed for building retrieval-augmented generation (RAG) applications. By using LlamaIndex with Azure Files, you can build RAG pipelines that use your existing file shares as a primary data source.
14+
15+
LlamaIndex provides fine-grained control over each stage of the pipeline through abstractions such as `SentenceSplitter` for chunking, `VectorStoreIndex` for indexing, and `RetrieverQueryEngine` for query-time retrieval and response synthesis.
16+
17+
## Why use LlamaIndex with Azure Files?
18+
19+
* **Retrieval-focused abstractions:** LlamaIndex provides specialized index types (`VectorStoreIndex`, `KeywordTableIndex`, `KnowledgeGraphIndex`) and query engines that give you control over retrieval strategies without restructuring your pipeline.
20+
* **Node-based document model:** Documents are parsed into typed nodes that carry metadata and parent-child relationships, enabling filtering and source citation at query time.
21+
* **Broad connector ecosystem:** LlamaHub provides connectors for data sources beyond file systems, so the same retrieval patterns you build for Azure Files extend to databases, APIs, and SaaS tools.
22+
* **Multimodal support:** LlamaIndex handles text, tables, images, and structured data within a single index, which is useful for Azure Files shares that contain mixed document types.
23+
24+
## Tutorials
25+
26+
The following tutorials demonstrate how to build RAG pipelines over documents stored in Azure Files using LlamaIndex with different vector databases:
27+
28+
| Vector database | Tutorial |
29+
| :--- | :--- |
30+
| Pinecone | [LlamaIndex + Pinecone](../tutorials/llamaindex-pinecone/tutorial-llamaindex-pinecone.md) |
31+
| Weaviate | [LlamaIndex + Weaviate](../tutorials/llamaindex-weaviate/tutorial-llamaindex-weaviate.md) |
32+
| Qdrant | [LlamaIndex + Qdrant](../tutorials/llamaindex-qdrant/tutorial-llamaindex-qdrant.md) |
33+
34+
> [!NOTE]
35+
> All tutorials require the same [project setup and prerequisites](../setup.md).
36+
37+
## Next steps
38+
39+
* [Azure Storage documentation](/azure/storage/)
40+
* [LlamaIndex documentation](https://docs.llamaindex.ai/)
41+
* [LlamaIndex GitHub](https://github.com/run-llama/llama_index)
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
title: Prepare Azure Files data for document-based RAG applications with open-source frameworks
3+
description: Learn how to authenticate to an Azure file share and download files for ingestion into a document-based RAG application using open-source frameworks.
4+
author: ftrichardson1
5+
ms.service: azure-file-storage
6+
ms.topic: how-to
7+
ms.date: 04/08/2026
8+
ms.author: t-flynnr
9+
ms.custom: devx-track-python
10+
---
11+
12+
# Prepare Azure Files data for document‑based RAG applications using open‑source AI tooling
13+
14+
This article describes how to authenticate to an Azure file share and download its contents for use with open‑source retrieval‑augmented generation (RAG) tooling.
15+
16+
## Prerequisites
17+
18+
- An [Azure file share](/azure/storage/files/create-classic-file-share?tabs=azure-portal) containing the documents you want to query. If you don't have an Azure subscription, [create one for free](https://azure.microsoft.com/free/).
19+
- [Python 3.12.10](https://www.python.org/downloads/release/python-31210/). On Windows, install the **x64** version.
20+
- [Azure CLI](/cli/azure/install-azure-cli).
21+
22+
## Grant access to an Azure file share
23+
24+
This article uses Microsoft Entra ID authentication via [`DefaultAzureCredential`](/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview), the recommended credential pattern for Azure software development kits (SDKs). This approach avoids storage account keys and provides a portable authentication mechanism that works across development and production environments.
25+
26+
> [!TIP]
27+
> Receiving a `403 Forbidden` error typically indicates missing authorization rather than failed authentication.
28+
29+
Assign the [**Storage File Data Privileged Reader**](/azure/storage/files/authorize-oauth-rest?tabs=portal#privileged-access-and-access-permissions-for-data-operations) role on the storage account hosting the file share.
30+
31+
> [!NOTE]
32+
> This role is required because the code accesses Azure Files using [`token_intent="backup"`](/python/api/azure-storage-file-share/azure.storage.fileshare.shareclient#keyword-only-parameters). This access pattern bypasses file‑level permissions, so Azure requires a privileged role. The **Storage File Data Privileged Reader** role is sufficient because the code performs only read operations and doesn't modify file contents.
33+
34+
#### Azure portal
35+
36+
1. Sign in to the [Azure portal](https://portal.azure.com) and navigate to your storage account.
37+
2. Select **Access Control (IAM)** > **Add** > **Add role assignment**.
38+
3. Search for **Storage File Data Privileged Reader**, select it, and select **Next**.
39+
4. Select **Select members**, search for your user account, and select it.
40+
5. Select **Review + assign**.
41+
42+
#### Azure CLI
43+
44+
```bash
45+
az login
46+
47+
az role assignment create \
48+
--assignee $(az ad signed-in-user show --query id -o tsv) \
49+
--role "Storage File Data Privileged Reader" \
50+
--scope $(az storage account show \
51+
--name <your-storage-account-name> \
52+
--query id -o tsv)
53+
```
54+
55+
#### Azure PowerShell
56+
57+
```powershell
58+
Connect-AzAccount
59+
60+
New-AzRoleAssignment `
61+
-SignInName (Get-AzADUser -SignedIn).UserPrincipalName `
62+
-RoleDefinitionName "Storage File Data Privileged Reader" `
63+
-Scope (Get-AzStorageAccount `
64+
-ResourceGroupName <your-resource-group> `
65+
-Name <your-storage-account-name>).Id
66+
```
67+
68+
## Set environment variables
69+
70+
Create a `.env` file in your project directory with your Azure Files connection details:
71+
72+
```text
73+
AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
74+
AZURE_STORAGE_SHARE_NAME=<your-share-name>
75+
```
76+
77+
| Variable | Description |
78+
| :--- | :--- |
79+
| `AZURE_STORAGE_ACCOUNT_NAME` | The name of your Azure Storage account |
80+
| `AZURE_STORAGE_SHARE_NAME` | The name of your Azure file share |
81+
82+
## Download files from an Azure file share
83+
84+
1. Install the required packages:
85+
86+
- `azure-identity`—provides `DefaultAzureCredential` for passwordless authentication.
87+
- `azure-storage-file-share`—provides the [`ShareClient`](/python/api/azure-storage-file-share/azure.storage.fileshare.shareclient) used to connect to and download files from the share.
88+
89+
```bash
90+
pip install azure-identity
91+
pip install azure-storage-file-share
92+
```
93+
94+
2. Connect to an Azure file share, recursively enumerate its directory structure, and collect the details required to locate and download each file. The `ShareClient` requires `token_intent="backup"` when using Microsoft Entra ID–based authentication.
95+
96+
```python
97+
import os
98+
import posixpath
99+
import tempfile
100+
101+
from azure.identity import DefaultAzureCredential
102+
from azure.storage.fileshare import ShareClient
103+
104+
account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"]
105+
share_name = os.environ["AZURE_STORAGE_SHARE_NAME"]
106+
107+
share = ShareClient(
108+
account_url=f"https://{account_name}.file.core.windows.net",
109+
share_name=share_name,
110+
credential=DefaultAzureCredential(),
111+
token_intent="backup",
112+
)
113+
114+
root = share.get_directory_client("")
115+
file_references = []
116+
directories_to_traverse = [root]
117+
118+
while directories_to_traverse:
119+
current = directories_to_traverse.pop()
120+
for item in current.list_directories_and_files():
121+
if item.is_directory:
122+
directories_to_traverse.append(current.get_subdirectory_client(item.name))
123+
else:
124+
# Azure Files paths use posix-style separators.
125+
relative_path = posixpath.join(current.directory_path or "", item.name)
126+
file_references.append((item.name, relative_path, current))
127+
```
128+
129+
3. Download the files. Before writing files to disk, the code validates each resolved file path to ensure it remains within the destination directory. This validation prevents files from being written outside the intended location when processing directory structures from an external source.
130+
131+
```python
132+
with tempfile.TemporaryDirectory() as destination:
133+
for name, relative_path, parent_directory in file_references:
134+
file_client = parent_directory.get_file_client(name)
135+
local_path = os.path.join(destination, relative_path)
136+
137+
# Path traversal guard
138+
real_dest = os.path.realpath(destination) + os.sep
139+
if not os.path.realpath(local_path).startswith(real_dest):
140+
raise ValueError(f"Path traversal detected: {relative_path}")
141+
142+
os.makedirs(os.path.dirname(local_path), exist_ok=True)
143+
with open(local_path, "wb") as f:
144+
for chunk in file_client.download_file().chunks():
145+
f.write(chunk)
146+
```
147+
148+
## Next steps
149+
150+
Choose a tutorial to continue with parsing, chunking, embedding, and querying:
151+
152+
- [LangChain](orchestrations/langchain.md)—LangChain + Pinecone, Weaviate, Qdrant
153+
- [LlamaIndex](orchestrations/llamaindex.md)—LlamaIndex + Pinecone, Weaviate, Qdrant
154+
- [Haystack](orchestrations/haystack.md)—Haystack + Pinecone, Weaviate, Qdrant
155+

0 commit comments

Comments
 (0)