Skip to content

Commit 24201c2

Browse files
authored
Merge pull request #314460 from mukesh-dua/release-microsoft-discovery
Add comprehensive guides for managing tools in Microsoft Discovery
2 parents 9b2e85e + fb607dc commit 24201c2

7 files changed

Lines changed: 1558 additions & 1 deletion
Lines changed: 353 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,353 @@
1+
---
2+
title: Create a tool definition for Microsoft Discovery
3+
description: Learn how to write a tool definition YAML file that describes how Microsoft Discovery deploys, configures, and invokes your containerized tool.
4+
author: mukesh-dua
5+
ms.author: mukeshdua
6+
ms.service: azure
7+
ms.topic: how-to
8+
ms.date: 04/07/2026
9+
10+
#CustomerIntent: As a tool publisher, I want to write a tool definition YAML file so that Microsoft Discovery can correctly deploy, configure, and invoke my containerized tool within investigations.
11+
---
12+
13+
# Create a tool definition for Microsoft Discovery
14+
15+
A tool definition is a YAML file that serves as the integration contract between your containerized tool and Microsoft Discovery. It tells the platform where your container image is, what compute resources the tool needs, and how to invoke each operation the tool exposes.
16+
17+
This article explains each section of a tool definition and provides complete examples for the three supported tool types: action-based, code environment, and hybrid.
18+
19+
> [!NOTE]
20+
> This article assumes your container image is already published to Azure Container Registry. See [Publish a tool container image to Azure Container Registry](how-to-publish-tool-to-acr.md).
21+
22+
## Prerequisites
23+
24+
- A container image published to Azure Container Registry (ACR).
25+
- The full ACR image reference for your tool (for example, `myregistry.azurecr.io/my-tool:v1.0.0`).
26+
- The compute resource requirements benchmarked for your tool.
27+
28+
## Step 1: Create the metadata section
29+
30+
Start your tool definition with the basic metadata. This information appears in the Discovery tool catalog.
31+
32+
```yaml
33+
name: my-analysis-tool # Unique identifier for the tool
34+
description: >
35+
A tool that performs molecular analysis including functional group
36+
identification and hazard screening. Accepts SMILES, CSV, or JSON input.
37+
version: 1.0.0 # Semantic version; increment when making breaking changes
38+
category: Scientific Computing # Category for organizing tools in the catalog
39+
license: MIT # License for this tool definition
40+
```
41+
42+
- Use a clear, lowercase `name` with hyphens rather than spaces. If you maintain multiple versions, include the version in the name (for example, `my-tool-v2`).
43+
- Write a `description` that explains what the tool does in enough detail for agents to understand when to invoke it. Agents use this description to decide which tool is appropriate for a given task.
44+
45+
## Step 2: Define the infrastructure
46+
47+
The `infra` section specifies the container image and compute resources.
48+
49+
```yaml
50+
infra:
51+
- name: worker
52+
infra_type: container
53+
image:
54+
acr: myregistry.azurecr.io/my-analysis-tool:v1.0.0
55+
compute:
56+
min_resources:
57+
cpu: 4 # Cores (integer) or millicores (e.g., 4000m)
58+
ram: 16Gi # Memory in GiB
59+
storage: 64Gi # Scratch storage in GiB
60+
gpu: 0 # Integer; 0 means no GPU
61+
max_resources:
62+
cpu: 8
63+
ram: 32Gi
64+
storage: 128Gi
65+
gpu: 0
66+
infiniband: false # Set true for tightly coupled MPI workloads
67+
recommended_sku:
68+
- Standard_D4_v4
69+
- Standard_D8_v4
70+
pool_type: static # Only supported pool type during preview
71+
pool_size: 1 # Number of container instances to run
72+
```
73+
74+
**Resource sizing guidance:**
75+
76+
| Field | Guidance |
77+
|---|---|
78+
| `min_resources` | The minimum resources your tool needs to run. Must account for platform overhead. |
79+
| `max_resources` | The maximum your tool may use under peak load. If the tool exceeds the memory limit, it is forcefully stopped. |
80+
| `recommended_sku` | Suggest Azure Virtual Machine (VM) SKUs that match your resource profile. The platform uses this field as a hint when scheduling. |
81+
| `pool_size` | For parallel workloads that run many simultaneous instances, increase this value. For most tools, `1` is correct. |
82+
83+
> [!NOTE]
84+
> Dynamic GPU sharing isn't currently supported. When a tool definition specifies GPUs, the `min_resources.gpu` value is used for scheduling.
85+
86+
## Step 3a: Define actions (action-based and hybrid tools)
87+
88+
Add an `actions` section for each discrete operation your tool exposes. Each action needs a name, a description, an input schema, and a command template.
89+
90+
```yaml
91+
actions:
92+
- name: identify_functional_groups
93+
description: >
94+
Identifies common functional groups in molecular structures including
95+
carbonyls, amines, alcohols, ethers, and halides. Accepts SMILES (.smi),
96+
CSV, or JSON input files. Writes detailed results to a CSV and a summary
97+
to results.json in the output directory.
98+
infra_node: worker
99+
input_schema:
100+
type: object
101+
properties:
102+
input_directory:
103+
type: string
104+
description: "Directory containing input files (SMILES, CSV, or JSON format)."
105+
output_directory:
106+
type: string
107+
description: "Directory where analysis results and output files are written."
108+
column_name:
109+
type: string
110+
description: >
111+
For CSV or TSV input files, the name of the column that contains
112+
SMILES strings. Defaults to 'smiles' if not provided.
113+
batch_size:
114+
type: number
115+
description: "Number of molecules to process per batch. Defaults to 100."
116+
file_pattern:
117+
type: string
118+
description: "Glob pattern to filter files in the input directory. Defaults to '*.*'."
119+
required:
120+
- input_directory
121+
- output_directory
122+
command: >
123+
python3 /app/entrypoint.py
124+
--action identify_functional_groups
125+
--input {{input_directory}}
126+
--output {{output_directory}}
127+
{{#if column_name}}--column-name {{column_name}}{{/if}}
128+
{{#if batch_size}}--batch-size {{batch_size}}{{/if}}
129+
{{#if file_pattern}}--file-pattern {{file_pattern}}{{/if}}
130+
environment_variables:
131+
- name: TOOL_INPUT_DIR
132+
value: "{{ input_directory }}"
133+
- name: TOOL_OUTPUT_DIR
134+
value: "{{ output_directory }}"
135+
output_mount_configurations:
136+
- mount_path: "{{ output_directory }}"
137+
auto_promote: false
138+
output_name: "FunctionalGroupResults"
139+
output_description: "Functional group analysis results"
140+
```
141+
142+
**Action fields:**
143+
144+
| Field | Required | Description |
145+
|---|---|---|
146+
| `name` | Yes | Unique identifier for the action within the tool. |
147+
| `description` | Yes | Explains what the action does, what inputs it expects, and what outputs it produces. Agents use this description to decide when to invoke the action. |
148+
| `infra_node` | Yes | Which infrastructure node runs this action. Must match a `name` in the `infra` section. |
149+
| `input_schema` | Yes | JSON Schema describing all parameters the action accepts. |
150+
| `input_schema.required` | Yes | Array of parameter names that must always be provided. |
151+
| `command` | Yes | Command template executed in the container. Uses `{{parameter}}` to insert values and `{{#if parameter}}...{{/if}}` for optional parameters. |
152+
| `environment_variables` | No | Environment variable set in the container before the command runs. |
153+
| `output_mount_configurations` | No | Directories to capture after the action runs. Set `auto_promote: true` to automatically share outputs as storage assets without the agent calling `ShareResource`. |
154+
155+
**`output_mount_configurations` fields:**
156+
157+
| Field | Required | Description |
158+
|---|---|---|
159+
| `mount_path` | Yes | Absolute path in the container to capture after execution. |
160+
| `auto_promote` | Yes | If `true`, the platform automatically creates a storage asset from the captured directory after each run. If `false`, the agent must call `ShareResource` to share the outputs. |
161+
| `output_name` | Yes | Display name for the storage asset created when `auto_promote` is `true`. |
162+
| `output_description` | Yes | Description of the storage asset. |
163+
164+
## Step 3b: Define code environments (code environment and hybrid tools)
165+
166+
Add a `code_environments` section to allow agents to write and execute custom scripts using your container's installed libraries.
167+
168+
```yaml
169+
code_environments:
170+
- language: python
171+
command: "python \"/{{scriptName}}\""
172+
description: >
173+
Python 3.11 environment with RDKit, ASE, pandas, NumPy, SciPy, and
174+
scikit-learn pre-installed. Use this environment to write custom
175+
molecular analysis scripts.
176+
infra_node: worker
177+
```
178+
179+
When an agent uses a code environment, the Discovery platform generates a Python script, mounts it into the container at the path specified by `{{scriptName}}`, and executes it using the `command` template.
180+
181+
## Complete examples
182+
183+
### Example 1: Action-based tool
184+
185+
A tool that exposes two specific molecular analysis operations:
186+
187+
```yaml
188+
name: molecular-groups-analyzer
189+
description: >
190+
Analyzes molecular structures from SMILES input to identify functional groups
191+
and screen for hazardous chemical groups. Accepts SMILES (.smi), CSV, or JSON
192+
input files.
193+
version: 1.0.0
194+
category: Cheminformatics
195+
license: MIT
196+
197+
infra:
198+
- name: worker
199+
infra_type: container
200+
image:
201+
acr: myregistry.azurecr.io/molecular-groups-analyzer:v1.0.0
202+
compute:
203+
min_resources:
204+
cpu: 2
205+
ram: 8Gi
206+
storage: 32Gi
207+
gpu: 0
208+
max_resources:
209+
cpu: 4
210+
ram: 16Gi
211+
storage: 64Gi
212+
gpu: 0
213+
recommended_sku:
214+
- Standard_D4s_v3
215+
pool_type: static
216+
pool_size: 1
217+
218+
actions:
219+
- name: identify_functional_groups
220+
description: >
221+
Identifies common functional groups in molecules (carbonyls, amines,
222+
alcohols, ethers, halides). Accepts SMILES, CSV, or JSON input. Writes
223+
a detailed CSV and results.json summary to the output directory.
224+
infra_node: worker
225+
input_schema:
226+
type: object
227+
properties:
228+
input_directory:
229+
type: string
230+
description: "Directory containing input molecule files."
231+
output_directory:
232+
type: string
233+
description: "Directory to write analysis results."
234+
column_name:
235+
type: string
236+
description: "Column name for SMILES strings in CSV files. Defaults to 'smiles'."
237+
batch_size:
238+
type: number
239+
description: "Molecules per batch. Defaults to 100."
240+
required:
241+
- input_directory
242+
- output_directory
243+
command: >
244+
python3 /app/entrypoint.py --action identify_functional_groups
245+
--input {{input_directory}} --output {{output_directory}}
246+
{{#if column_name}}--column-name {{column_name}}{{/if}}
247+
{{#if batch_size}}--batch-size {{batch_size}}{{/if}}
248+
249+
- name: identify_hazardous_groups
250+
description: >
251+
Screens molecules for hazardous functional groups including explosives,
252+
PFAS, chemical weapon convention (CWC) compounds, and reactive groups.
253+
Accepts SMILES, CSV, or JSON input.
254+
infra_node: worker
255+
input_schema:
256+
type: object
257+
properties:
258+
input_directory:
259+
type: string
260+
description: "Directory containing input molecule files."
261+
output_directory:
262+
type: string
263+
description: "Directory to write hazard assessment results."
264+
categories:
265+
type: string
266+
description: >
267+
Comma-separated list of hazard categories to screen, or 'all' to
268+
screen all categories. Supported values: us_pfas_groups, cwc_groups,
269+
explosive_groups, self_reactive_groups, pnnl_hazardous_groups.
270+
Defaults to 'all'.
271+
batch_size:
272+
type: number
273+
description: "Molecules per batch. Defaults to 100."
274+
required:
275+
- input_directory
276+
- output_directory
277+
command: >
278+
python3 /app/entrypoint.py --action identify_hazardous_groups
279+
--input {{input_directory}} --output {{output_directory}}
280+
{{#if categories}}--categories {{categories}}{{/if}}
281+
{{#if batch_size}}--batch-size {{batch_size}}{{/if}}
282+
```
283+
284+
### Example 2: Code environment tool
285+
286+
A tool that exposes a Python runtime with preinstalled molecular analysis libraries:
287+
288+
```yaml
289+
name: moltoolkit
290+
description: >
291+
A comprehensive molecular analysis toolkit providing Python 3.11 with RDKit,
292+
ASE, Biopython, pandas, NumPy, PyMOL, and OpenBabel pre-installed. Use the
293+
Python code environment to write custom molecular analysis scripts.
294+
version: 1.0.0
295+
category: Scientific Computing
296+
license: MIT
297+
298+
infra:
299+
- name: worker
300+
infra_type: container
301+
image:
302+
acr: myregistry.azurecr.io/moltoolkit:v1.0.0
303+
compute:
304+
min_resources:
305+
cpu: 1
306+
ram: 8Gi
307+
storage: 8Gi
308+
gpu: 0
309+
max_resources:
310+
cpu: 2
311+
ram: 16Gi
312+
storage: 32Gi
313+
gpu: 0
314+
recommended_sku:
315+
- Standard_D4s_v3
316+
pool_type: static
317+
pool_size: 1
318+
319+
code_environments:
320+
- language: python
321+
command: "python \"/{{scriptName}}\""
322+
description: >
323+
Python 3.11 environment with RDKit, ASE, Biopython, MDAnalysis,
324+
pandas, NumPy, SciPy, PyMOL, and OpenBabel. Use for custom molecular
325+
analysis, conformer generation, descriptor calculation, and data processing.
326+
infra_node: worker
327+
```
328+
329+
## Step 4: Validate the tool definition
330+
331+
Before registering your tool in Discovery, validate the YAML structure:
332+
333+
1. **Syntax check**: Run the YAML through a validator (for example, `python -c "import yaml; yaml.safe_load(open('tool-definition.yaml'))"`) to catch formatting errors.
334+
335+
2. **Command template check**: Manually expand each `command` template with representative parameter values and verify the resulting command matches what your container expects.
336+
337+
3. **Required parameters**: Confirm every parameter referenced in `command` is listed in `input_schema.properties` and that any required ones are in the `required` array.
338+
339+
4. **Image reference**: Confirm the `image.acr` value matches the exact tag you pushed to ACR.
340+
341+
## Step 5: Register the tool in Microsoft Discovery
342+
343+
After validating the definition, register the tool as a resource in your Discovery workspace. You can do register a tool through the Azure portal or via the REST API.
344+
345+
To perform this action, you need to convert the tool definition yaml created in [steps-3](#step-3a-define-actions-action-based-and-hybrid-tools) to corresponding json and provide that as an input during Discovery Tool resource creation.
346+
347+
## Related content
348+
349+
- [Plan tool requirements for Microsoft Discovery](how-to-plan-tool-requirements.md)
350+
- [Write action scripts for a Discovery tool](how-to-write-tool-action-scripts.md)
351+
- [Create a Dockerfile for a Discovery tool](how-to-create-tool-docker-file.md)
352+
- [Publish a tool container image to Azure Container Registry](how-to-publish-tool-to-acr.md)
353+
- [Manage data handling with tools and agents](how-to-data-handling-with-tools-agents.md)

0 commit comments

Comments
 (0)