|
| 1 | +--- |
| 2 | +title: Prepare Azure Files data for document-based RAG applications with open-source frameworks |
| 3 | +description: Learn how to authenticate to an Azure file share and download files for ingestion into a document-based RAG application using open-source frameworks. |
| 4 | +author: ftrichardson1 |
| 5 | +ms.service: azure-file-storage |
| 6 | +ms.topic: how-to |
| 7 | +ms.date: 04/08/2026 |
| 8 | +ms.author: t-flynnr |
| 9 | +ms.custom: devx-track-python |
| 10 | +--- |
| 11 | + |
| 12 | +# Prepare Azure Files data for document‑based RAG applications using open‑source AI tooling |
| 13 | + |
| 14 | +This article describes how to authenticate to an Azure file share and download its contents for use with open‑source retrieval‑augmented generation (RAG) tooling. |
| 15 | + |
| 16 | +## Prerequisites |
| 17 | + |
| 18 | +- An [Azure file share](/azure/storage/files/create-classic-file-share?tabs=azure-portal) containing the documents you want to query. If you don't have an Azure subscription, [create one for free](https://azure.microsoft.com/free/). |
| 19 | +- [Python 3.12.10](https://www.python.org/downloads/release/python-31210/). On Windows, install the **x64** version. |
| 20 | +- [Azure CLI](/cli/azure/install-azure-cli). |
| 21 | + |
| 22 | +## Grant access to an Azure file share |
| 23 | + |
| 24 | +This article uses Microsoft Entra ID authentication via [`DefaultAzureCredential`](/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview), the recommended credential pattern for Azure software development kits (SDKs). This approach avoids storage account keys and provides a portable authentication mechanism that works across development and production environments. |
| 25 | + |
| 26 | +> [!TIP] |
| 27 | +> Receiving a `403 Forbidden` error typically indicates missing authorization rather than failed authentication. |
| 28 | +
|
| 29 | +Assign the [**Storage File Data Privileged Reader**](/azure/storage/files/authorize-oauth-rest?tabs=portal#privileged-access-and-access-permissions-for-data-operations) role on the storage account hosting the file share. |
| 30 | + |
| 31 | +> [!NOTE] |
| 32 | +> This role is required because the code accesses Azure Files using [`token_intent="backup"`](/python/api/azure-storage-file-share/azure.storage.fileshare.shareclient#keyword-only-parameters). This access pattern bypasses file‑level permissions, so Azure requires a privileged role. The **Storage File Data Privileged Reader** role is sufficient because the code performs only read operations and doesn't modify file contents. |
| 33 | +
|
| 34 | +#### Azure portal |
| 35 | + |
| 36 | +1. Sign in to the [Azure portal](https://portal.azure.com) and navigate to your storage account. |
| 37 | +2. Select **Access Control (IAM)** > **Add** > **Add role assignment**. |
| 38 | +3. Search for **Storage File Data Privileged Reader**, select it, and select **Next**. |
| 39 | +4. Select **Select members**, search for your user account, and select it. |
| 40 | +5. Select **Review + assign**. |
| 41 | + |
| 42 | +#### Azure CLI |
| 43 | + |
| 44 | +```bash |
| 45 | +az login |
| 46 | + |
| 47 | +az role assignment create \ |
| 48 | + --assignee $(az ad signed-in-user show --query id -o tsv) \ |
| 49 | + --role "Storage File Data Privileged Reader" \ |
| 50 | + --scope $(az storage account show \ |
| 51 | + --name <your-storage-account-name> \ |
| 52 | + --query id -o tsv) |
| 53 | +``` |
| 54 | + |
| 55 | +#### Azure PowerShell |
| 56 | + |
| 57 | +```powershell |
| 58 | +Connect-AzAccount |
| 59 | +
|
| 60 | +New-AzRoleAssignment ` |
| 61 | + -SignInName (Get-AzADUser -SignedIn).UserPrincipalName ` |
| 62 | + -RoleDefinitionName "Storage File Data Privileged Reader" ` |
| 63 | + -Scope (Get-AzStorageAccount ` |
| 64 | + -ResourceGroupName <your-resource-group> ` |
| 65 | + -Name <your-storage-account-name>).Id |
| 66 | +``` |
| 67 | + |
| 68 | +## Set environment variables |
| 69 | + |
| 70 | +Create a `.env` file in your project directory with your Azure Files connection details: |
| 71 | + |
| 72 | +```text |
| 73 | +AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account-name> |
| 74 | +AZURE_STORAGE_SHARE_NAME=<your-share-name> |
| 75 | +``` |
| 76 | + |
| 77 | +| Variable | Description | |
| 78 | +| :--- | :--- | |
| 79 | +| `AZURE_STORAGE_ACCOUNT_NAME` | The name of your Azure Storage account | |
| 80 | +| `AZURE_STORAGE_SHARE_NAME` | The name of your Azure file share | |
| 81 | + |
| 82 | +## Download files from an Azure file share |
| 83 | + |
| 84 | +1. Install the required packages: |
| 85 | + |
| 86 | + - `azure-identity`—provides `DefaultAzureCredential` for passwordless authentication. |
| 87 | + - `azure-storage-file-share`—provides the [`ShareClient`](/python/api/azure-storage-file-share/azure.storage.fileshare.shareclient) used to connect to and download files from the share. |
| 88 | + |
| 89 | + ```bash |
| 90 | + pip install azure-identity |
| 91 | + pip install azure-storage-file-share |
| 92 | + ``` |
| 93 | + |
| 94 | +2. Connect to an Azure file share, recursively enumerate its directory structure, and collect the details required to locate and download each file. The `ShareClient` requires `token_intent="backup"` when using Microsoft Entra ID–based authentication. |
| 95 | + |
| 96 | + ```python |
| 97 | + import os |
| 98 | + import posixpath |
| 99 | + import tempfile |
| 100 | + |
| 101 | + from azure.identity import DefaultAzureCredential |
| 102 | + from azure.storage.fileshare import ShareClient |
| 103 | + |
| 104 | + account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"] |
| 105 | + share_name = os.environ["AZURE_STORAGE_SHARE_NAME"] |
| 106 | + |
| 107 | + share = ShareClient( |
| 108 | + account_url=f"https://{account_name}.file.core.windows.net", |
| 109 | + share_name=share_name, |
| 110 | + credential=DefaultAzureCredential(), |
| 111 | + token_intent="backup", |
| 112 | + ) |
| 113 | + |
| 114 | + root = share.get_directory_client("") |
| 115 | + file_references = [] |
| 116 | + directories_to_traverse = [root] |
| 117 | + |
| 118 | + while directories_to_traverse: |
| 119 | + current = directories_to_traverse.pop() |
| 120 | + for item in current.list_directories_and_files(): |
| 121 | + if item.is_directory: |
| 122 | + directories_to_traverse.append(current.get_subdirectory_client(item.name)) |
| 123 | + else: |
| 124 | + # Azure Files paths use posix-style separators. |
| 125 | + relative_path = posixpath.join(current.directory_path or "", item.name) |
| 126 | + file_references.append((item.name, relative_path, current)) |
| 127 | + ``` |
| 128 | + |
| 129 | +3. Download the files. Before writing files to disk, the code validates each resolved file path to ensure it remains within the destination directory. This validation prevents files from being written outside the intended location when processing directory structures from an external source. |
| 130 | + |
| 131 | + ```python |
| 132 | + with tempfile.TemporaryDirectory() as destination: |
| 133 | + for name, relative_path, parent_directory in file_references: |
| 134 | + file_client = parent_directory.get_file_client(name) |
| 135 | + local_path = os.path.join(destination, relative_path) |
| 136 | + |
| 137 | + # Path traversal guard |
| 138 | + real_dest = os.path.realpath(destination) + os.sep |
| 139 | + if not os.path.realpath(local_path).startswith(real_dest): |
| 140 | + raise ValueError(f"Path traversal detected: {relative_path}") |
| 141 | + |
| 142 | + os.makedirs(os.path.dirname(local_path), exist_ok=True) |
| 143 | + with open(local_path, "wb") as f: |
| 144 | + for chunk in file_client.download_file().chunks(): |
| 145 | + f.write(chunk) |
| 146 | + ``` |
| 147 | + |
| 148 | +## Next steps |
| 149 | + |
| 150 | +Choose a tutorial to continue with parsing, chunking, embedding, and querying: |
| 151 | + |
| 152 | +- [LangChain](orchestrations/langchain.md)—LangChain + Pinecone, Weaviate, Qdrant |
| 153 | +- [LlamaIndex](orchestrations/llamaindex.md)—LlamaIndex + Pinecone, Weaviate, Qdrant |
| 154 | +- [Haystack](orchestrations/haystack.md)—Haystack + Pinecone, Weaviate, Qdrant |
| 155 | + |
0 commit comments