Skip to content

Commit e3740f2

Browse files
authored
Merge pull request #54135 from staleycyn/patch-3
Content drift for the design data integration module
2 parents d64463d + 0baf827 commit e3740f2

7 files changed

Lines changed: 93 additions & 46 deletions

learn-pr/wwl-azure/design-data-integration/includes/2-solution-azure-data-factory.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,12 @@
77
There are four major steps to create and implement a data-driven workflow in the Azure Data Factory architecture:
88

99
1. **Connect and collect**. First, ingest the data to collect all the data from different sources into a centralized location.
10-
2. **Transform and enrich**. Next, transform the data by using a compute service like Azure Databricks and Azure HDInsight Hadoop.
11-
3. **Provide continuous integration and delivery (CI/CD) and publish**. Support CI/CD by using GitHub and Azure Pipelines to deliver the ETL process incrementally before publishing the data to the analytics engine.
12-
4. **Monitor**. Finally, use the Azure portal to monitor the pipeline for scheduled activities and for any failures.
10+
11+
1. **Transform and enrich**. Next, transform the data by using a compute service like Azure Databricks and Azure HDInsight Hadoop.
12+
13+
1. **Provide continuous integration and delivery (CI/CD) and publish**. Support CI/CD by using GitHub and Azure Pipelines to deliver the ETL process incrementally before publishing the data to the analytics engine.
14+
15+
1. **Monitor**. Finally, use the Azure portal to monitor the pipeline for scheduled activities and for any failures.
1316

1417
The following diagram shows how Azure Data Factory orchestrates the ingestion of data from different data sources. Data is ingested into a Storage blob and stored in Azure Synapse Analytics. Analysis and visualization components are also connected to Azure Data Factory. Azure Data Factory provides a common management interface for all of your data integration needs.
1518

@@ -34,16 +37,25 @@ A significant challenge for a fast-growing home improvement retailer like Tailwi
3437
Let's review how the Azure Data Factory components are involved in a data preparation and movement scenario for Tailwind Traders. They have many different data sources to connect to and that data needs to be ingested and transformed through stored procedures that are run on the data. Finally, the data should be pushed to an analytics platform for analysis.
3538

3639
- In this scenario, the linked service enables Tailwind Traders to ingest data from different sources and it stores connection strings to fire up compute services on demand.
40+
3741
- You can execute stored procedures for data transformation that happens through the linked service in Azure-SSIS, which is the integration runtime environment for Tailwind Traders.
42+
3843
- The datasets components are used by the activity object and the activity object contains the transformation logic.
44+
3945
- You can trigger the pipeline, which is all the activities grouped together.
46+
4047
- You can use Azure Data Factory to publish the final dataset consumed by technologies, such as Power BI or Machine Learning.
4148

4249
### Things to consider when using Azure Data Factory
4350

4451
Evaluate Azure Data Factory against the following decision criteria and consider how the service can benefit your data integration solution for Tailwind Traders.
4552

4653
- **Consider requirements for data integration**. Azure Data Factory serves two communities: the big data community and the relational data warehousing community that uses SQL Server Integration Services (SSIS). Depending on your organization's data needs, you can set up pipelines in the cloud by using Azure Data Factory. You can access data from both cloud and on-premises data services.
54+
4755
- **Consider coding resources**. If you prefer a graphical interface to set up pipelines, then Azure Data Factory authoring and monitoring tool is the right fit for your needs. Azure Data Factory provides a low code/no code process for working with data sources.
48-
- **Consider support for multiple data sources**. Azure Data Factory supports 90+ connectors to integrate with disparate data sources.
56+
57+
- **Consider support for multiple data sources**. Azure Data Factory supports 100+ connectors, including Microsoft Fabric Warehouse and Fabric Lakehouse alongside Azure, AWS, Google Cloud, SaaS, and database sources.
58+
4959
- **Consider serverless infrastructure**. There are advantages to using a fully managed, serverless solution for data integration. There's no need to maintain, configure, or deploy servers, and you gain the ability to scale with fluctuating workloads.
60+
61+

learn-pr/wwl-azure/design-data-integration/includes/3-solution-azure-data-lake.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,25 @@ A data lake is a repository of data stored in its natural format, usually as blo
22

33
> [!VIDEO https://learn-video.azurefd.net/vod/player?id=b4c743cb-38b8-4d39-a99a-3e7c39803836]
44
5-
> [!Note]
6-
> The current implementation of the service is Azure Data Lake Storage Gen2.
5+
> [!Important]
6+
> Azure Data Lake Storage Gen1 was retired on February 29, 2024. Existing Gen1 accounts are no longer accessible and new accounts cannot be created. This unit covers Azure Data Lake Storage Gen2 exclusively.
77
88
### Things to know about Azure Data Lake Storage
99

1010
To better understand Azure Data Lake Storage, let's examine the following characteristics.
1111

1212
- Azure Data Lake Storage can store any type of data by using the data's native format. With support for any data format and massive data sizes, Azure Data Lake Storage can work with structured, semi-structured, and unstructured data.
13+
1314
- The solution is primarily designed to work with Hadoop and all frameworks that use the Apache Hadoop Distributed File System (HDFS) as their data access layer. Data analysis frameworks that use HDFS as their data access layer can directly access.
15+
1416
- Azure Data Lake Storage supports high throughput for input and output–intensive analytics and data movement.
17+
1518
- The Azure Data Lake Storage access control model supports both Azure role-based access control (RBAC) and Portable Operating System Interface for UNIX (POSIX) access control lists (ACLs).
16-
- Azure Data Lake Storage utilizes Azure Blob replication models. These models provide data redundancy in a single datacenter with locally redundant storage (LRS).
19+
20+
- Azure Data Lake Storage utilizes Azure Blob replication models. These models support the same redundancy options available for Azure Blob Storage. Microsoft recommends ZRS for Azure Data Lake Storage workloads.
21+
1722
- Azure Data Lake Storage offers massive storage and accepts numerous data types for analytics.
23+
1824
- Azure Data Lake Storage is priced at Azure Blob Storage levels.
1925

2026
#### How Azure Data Lake Storage works
@@ -57,4 +63,7 @@ The following table compares storage solution criteria for using Azure Blob Stor
5763
| **Geographic redundancy** | Must manually configure data replication | Provides geo-redundant storage by default |
5864
| **Namespaces** | Supports hierarchical namespaces | Supports flat namespaces |
5965
| **Hadoop compatibility** | Hadoop services can use data stored in Azure Data Lake | By using Azure Blob Filesystem Driver, applications and frameworks can access data in Azure Blob Storage |
60-
| **Security** | Supports granular access | Granular access isn't supported |
66+
| **Security** | Supports granular access | Granular access isn't supported |
67+
68+
> [!Tip]
69+
> Learn more with self-paced training, [Introduction to Azure Data Lake Storage Gen2](/training/modules/introduction-to-azure-data-lake-storage/).

learn-pr/wwl-azure/design-data-integration/includes/4-solution-azure-data-brick.md

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,18 @@
66

77
Azure Databricks is entirely based on Apache Spark, and it's a great tool for users who are already familiar with the open-source cluster-computing framework. Databricks is designed specifically for big data processing. Data scientists can take advantage of the built-in core API for core languages like SQL, Java, Python, R, and Scala.
88

9-
Azure Databricks has a Control plane and a Data plane:
9+
Azure Databricks has a Control plane and a Compute plane:
1010

11-
- **Control Plane**: Hosts Databricks jobs, notebooks with query results, and the cluster manager. The Control plane also has the web application, hive metastore, and security access control lists (ACLs), and user sessions. Microsoft manages these components in collaboration with Azure Databricks.
12-
- **Data Plane**: Contains all the Azure Databricks runtime clusters that are hosted within the workspace. All data processing and storage exists within the client subscription. No data processing ever takes place within the Microsoft/Databricks-managed subscription.
11+
- **Control Plane**: Hosts Databricks jobs, notebooks with query results, and the cluster manager. The Control plane also has the web application, security access control lists (ACLs), and user sessions. Microsoft manages these components in collaboration with Azure Databricks.
12+
13+
- **Compute Plane**: Contains all the Azure Databricks runtime clusters that are hosted within the workspace. All data processing and storage exists within the client subscription.
1314

1415
Azure Databricks offers three environments for developing data intensive applications.
1516

1617
- **Databricks SQL**: Azure Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake. You can create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
17-
- **Databricks Data Science & Engineering**: Azure Databricks Data Science & Engineering is an interactive *workspace* that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time by using Apache Kafka, Azure Event Hubs, or Azure IoT Hub. The data lands in a data lake for long term persisted storage within Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights by using Spark.
18+
19+
- **Databricks Data Science & Engineering**: Azure Databricks Data Science & Engineering lets data teams work together in an interactive workspace. Data is brought into Azure through batch or real-time tools like Azure Data Factory, Kafka, Event Hubs, or IoT Hub. Data is stored in Azure Blob Storage or Data Lake Storage. Databricks reads data from these sources and uses Spark to generate insights.
20+
1821
- **Databricks Machine Learning**: Azure Databricks Machine Learning is an integrated end-to-end machine learning environment. It incorporates managed services for experiment tracking, model training, feature development and management, and feature and model serving.
1922

2023
#### Business scenario
@@ -23,21 +26,34 @@ Let's analyze a scenario for Tailwind Traders in the heavy machinery manufacturi
2326

2427
Let's review why Azure Databricks can be the right choice to meet these requirements.
2528

26-
- Azure Databricks provides an integrated Analytics *workspace* based on Apache Spark that allows collaboration between different users.
27-
- By using Spark components like Spark SQL and Dataframes, Azure Databricks can handle structured data. It integrates with real-time data ingestion tools like Kafka and Flume for processing streaming data.
28-
- Secure data integration capabilities built on top of Spark enable you to unify your data without centralization. Data scientists can visualize data in a few steps, and use familiar tools like Matplotlib, ggplot, or d3.
29-
- The Azure Databricks runtime abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure. Users can use existing languages skills for Python, Scala, and R, and explore the data.
30-
- Azure Databricks integrates deeply with Azure databases and stores like Azure Synapse Analytics, Azure Cosmos DB, Azure Data Lake Storage, and Azure Blob Storage. It supports diverse data store platforms, which satisfies the Tailwind Traders big data storage needs.
31-
- Integration with Power BI allows for quick and meaningful insights, which is a requirement for Tailwind Traders.
32-
- Azure Databricks SQL isn't the right choice because it can't handle unstructured data.
33-
- Azure Databricks Machine Learning is also not the right environment choice because machine learning isn't a requirement in this scenario.
29+
- Azure Databricks is an analytics workspace built on Apache Spark.
30+
31+
- Supports collaboration and handles both structured and streaming data.
32+
33+
- Integrates with real-time tools like Kafka and Flume.
34+
35+
- Lets users work with Python, Scala, or R.
36+
37+
- Connects to Azure databases and storage solutions, meeting big data needs.
38+
39+
- Works with Power BI for fast insights.
40+
41+
- Databricks SQL and Machine Learning aren't suitable here, as unstructured data and machine learning aren't required.
42+
3443

3544
### Things to consider when using Azure Databricks
3645

3746
You can use Azure Databricks as a solution for multiple scenarios. Consider how the service can benefit your data integration solution for Tailwind Traders.
3847

3948
- **Consider data science preparation of data**. Create, clone, and edit clusters of complex, unstructured data. Turn the data clusters into specific jobs. Deliver the results to data scientists and data analysts for review.
49+
4050
- **Consider insights in the data**. Implement Azure Databricks to build recommendation engines, churn analysis, and intrusion detection.
51+
4152
- **Consider productivity across data and analytics teams**. Create a collaborative environment and shared workspaces for data engineers, analysts, and scientists. Teams can work together across the data science lifecycle with shared workspaces, which helps to save valuable time and resources.
42-
- **Consider big data workloads**. Exercise Azure Data Lake and the engine to get the best performance and reliability for your big data workloads. Create no-fuss multi-step data pipelines.
43-
- **Consider machine learning programs**. Take advantage of the integrated end-to-end machine learning environment. It incorporates managed services for experiment tracking, model training, feature development and management, and feature and model serving.
53+
54+
- **Consider big data workloads**. Use Azure Data Lake and the engine to get the best performance and reliability for your big data workloads. Create no-fuss multi-step data pipelines.
55+
56+
- **Consider machine learning programs**. Take advantage of the integrated end-to-end machine learning environment. It incorporates managed services for experiment tracking, model training, feature development and management, and feature and model serving.
57+
58+
> [!Tip]
59+
> Learn more with self-paced training, [Explore Azure Databricks](/training/wwl-databricks/explore-azure-databricks/).

learn-pr/wwl-azure/design-data-integration/includes/5-solution-azure-synapse-analytics.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,15 @@ Azure Synapse Analytics is composed of the five elements:
2222

2323
:::image type="content" source="../media/azure-synapse-analytics-overview.png" alt-text="Diagram that shows an overview of Azure Synapse Analytics capabilities." border="false":::
2424

25-
- **Azure Synapse SQL pool**: Synapse SQL offers both serverless and dedicated resource models to work with a node-based architecture. For predictable performance and cost, you can create dedicated SQL pools. For irregular or unplanned workloads, you can use the always-available, serverless SQL endpoint.
26-
- **Azure Synapse Spark pool**: This pool is a cluster of servers that run Apache Spark to process data. You write your data processing logic by using one of the four supported languages: Python, Scala, SQL, and C# (via .NET for Apache Spark). Apache Spark for Azure Synapse integrates Apache Spark (the open source big data engine used for data preparation, data engineering, ETL, and machine learning).
27-
- **Azure Synapse Pipelines**: Azure Synapse Pipelines applies the capabilities of Azure Data Factory. Pipelines are the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. You can include activities that transform the data as it's transferred, or you can combine data from multiple sources together.
28-
- **Azure Synapse Link**: This component allows you to connect to Azure Cosmos DB. You can use it to perform near real-time analytics over the operational data stored in an Azure Cosmos DB database.
29-
- **Azure Synapse Studio**: This element is a web-based IDE that can be used centrally to work with all capabilities of Azure Synapse Analytics. You can use Azure Synapse Studio to create SQL and Spark pools, define and run pipelines, and configure links to external data sources.
25+
- **Azure Synapse SQL pool**: Choose between dedicated SQL pools for consistent performance and cost, or serverless SQL endpoints for flexible, on-demand workloads.
26+
27+
- **Azure Synapse Spark pool**: Run Apache Spark clusters to process data using Python, Scala, SQL, or C#.
28+
29+
- **Azure Synapse Pipelines**: Use cloud-based ETL workflows to move and transform data at scale, combining multiple sources if needed.
30+
31+
- **Azure Synapse Link**: Connect to Azure Cosmos DB for near real-time analytics on operational data.
32+
33+
- **Azure Synapse Studio**: Work in a central web-based IDE to manage SQL and Spark pools, pipelines, and data links.
3034

3135
#### Analytical options
3236

@@ -50,14 +54,22 @@ The following table compares storage solution criteria for using Azure Data Fact
5054
| Compare | Azure Data Factory | Azure Synapse Analytics |
5155
| ------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
5256
| **Data sharing** | Data can be shared across different data factories | Not supported |
53-
| **Solution templates** | Solution templates are provided with the Azure Data Factory template gallery | Solution templates are provided in the Synapse Workspace Knowledge center |
57+
| **Solution templates** | Solution templates are provided with the Azure Data Factory template gallery | Solution templates are provided in the Synapse Workspace Knowledge Center |
5458
| **Integration runtime cross region flows** | Cross region data flows are supported | Not supported |
5559
| **Monitor data** | Data monitoring is integrated with Azure Monitor | Diagnostic logs are available in Azure Monitor |
5660
| **Monitor Spark Jobs for data flow** | Not supported | Spark Jobs can be monitored for data flow by using Synapse Spark pools |
5761

5862
Azure Synapse Analytics is an ideal solution for many other scenarios. Consider the following options:
5963

6064
- **Consider variety of data sources**. When you have various data sources that use Azure Synapse Analytics for code-free ETL and data flow activities.
65+
6166
- **Consider Machine Learning**. When you need to implement Machine Learning solutions by using Apache Spark, you can use Azure Synapse Analytics for built-in support for Azure Machine Learning.
67+
6268
- **Consider data lake integration**. When you have existing data stored on a data lake and need integration with Azure Data Lake and other input sources, Azure Synapse Analytics provides seamless integration between the two components.
69+
6370
- **Consider real-time analytics**. When you require real-time analytics, you can use features like Azure Synapse Link to analyze data in real-time and offer insights.
71+
72+
- **Consider Microsoft Fabric**. Microsoft recommends Microsoft Fabric over new Synapse deployments.
73+
74+
> [!Tip]
75+
> Learn more with self-paced training, [Introduction to end-to-end analytics using Microsoft Fabric](/training/modules/introduction-end-analytics-use-microsoft-fabric/).

0 commit comments

Comments
 (0)