Constructing an Automated Data Pipeline with Azure Data Factory, DevOps, and Machine Learning
This guide outlines the process for creating a robust data pipeline. We will cover data ingestion from a CSV file, its transformation, and subsequent model training, demonstrating a complete end-to-end workflow using Azure services.
You will learn to retrieve data from a source file, store it in Azure Blob Storage, process the data into a usable format, and then use this refined data to train a machine learning model. The final trained model will be saved back to Blob Storage as a Python pickle file.
Table of Contents
Prerequisites
To follow this tutorial, you will need the following:
- An active Azure subscription. You can create a free account if you don't have one.
- An operational Azure DevOps organization. If needed, you can sign up for Azure Pipelines.
- Your account must have the Administrator role for service connections within your Azure DevOps project.
- The data file, sample.csv.
- Access to the solution code, available in the data pipeline GitHub repository.
- The DevOps for Azure Databricks extension installed in your Azure DevOps organization.
1. Provisioning Azure Resources
First, we will set up the necessary cloud infrastructure using the Azure Cloud Shell.
- Log in to the Azure portal.
- Open the Cloud Shell from the top menu. When prompted, select the Bash environment.
Note: Azure Cloud Shell requires an Azure Storage account to persist files. If this is your first time using it, you'll be guided to create a resource group, storage account, and an Azure Files share. This configuration is automatically used for future sessions.
Selecting an Azure Region
A region is a specific geographic location containing one or more Azure datacenters. To optimize performance and streamline commands, we will set a default region.
List the available regions for your subscription by running the following command in Cloud Shell.
az account list-locations \
--query "[].{Name: name, DisplayName: displayName}" \
--output table
Choose a region from the Name column that is geographically close to you (e.g., westus2).
Set your chosen region as the default. Replace
az config set defaults.location=
For instance:
az config set defaults.location=westus2
Creating Environment Variables
To simplify the process, we'll create several Bash variables for our resource names.
Generate a random number to ensure unique resource names.
resourceSuffix=$RANDOM
Define globally unique names for your storage account and key vault.
storageName="datacicd${resourceSuffix}"
keyVault="keyvault${resourceSuffix}"
Define variables for your resource group name and region.
rgName='data-pipeline-cicd-rg'
region=''
Define variables for your Azure Data Factory and Azure Databricks instances.
datafactorydev='data-factory-cicd-dev'
datafactorytest='data-factory-cicd-test'
databricksname='databricks-cicd-ws'
Creating the Azure Resources
Now, execute the following commands to create the necessary infrastructure.
Create a resource group.
az group create --name $rgName
Create a new storage account.
az storage account create \
--name $storageName \
--resource-group $rgName \
--sku Standard_RAGRS \
--kind StorageV2
Create two blob containers within the storage account: rawdata and prepareddata.
az storage container create -n rawdata --account-name $storageName
az storage container create -n prepareddata --account-name $storageName
Create a new Azure Key Vault.
az keyvault create \
--name $keyVault \
--resource-group $rgName
Create a development Azure Data Factory instance.
az extension add --name datafactory
az datafactory create \
--name data-factory-cicd-dev \
--resource-group $rgName
Create a testing Azure Data Factory instance.
az datafactory create \
--name data-factory-cicd-test \
--resource-group $rgName
Create an Azure Databricks workspace.
az extension add --name databricks
az databricks workspace create \
--resource-group $rgName \
--name databricks-cicd-ws \
--location eastus2 \
--sku trial
2. Preparing Data and Credentials
With the infrastructure in place, let's upload the data and store our credentials securely.
- Upload Data: In the Azure portal, navigate to your data-pipeline-cicd-rg resource group, open the storage account, find the rawdata container, and upload the sample.csv file.
- Set up Key Vault: We will use Key Vault to manage all our secrets.
- Generate a Databricks Personal Access Token (PAT):
- Open your Azure Databricks workspace from the Azure portal.
- Inside the Databricks UI, navigate to User Settings and generate a new personal access token. Copy this token securely.
- Get Storage Account Credentials:
- Navigate to your storage account in the Azure portal.
- Under Security + networking, select Access keys.
- Copy the value of key1 and the Connection string.
- Save Credentials to Key Vault: Execute the following commands, replacing the placeholder values with the credentials you just copied.
az keyvault secret set --vault-name "$keyVault" --name "databricks-token" --value "your-databricks-pat" az keyvault secret set --vault-name "$keyVault" --name "StorageKey" --value "your-storage-key" az keyvault secret set --vault-name "$keyVault" --name "StorageConnectString" --value "your-storage-connection"
3. Configuring Azure DevOps
Now, we'll set up the Azure DevOps project to manage our CI/CD process.
- Import Code Repository:
- Sign in to your Azure DevOps organization and navigate to your project.
- Go to Repos and select Import.
- Import the forked version of the data pipeline solution from your GitHub account.
- Create a Service Connection:
- In Project Settings, go to Service connections.
- Create a new Azure Resource Manager service connection using Workload identity federation (automatic).
- Select your Azure subscription and the data-pipeline-cicd-rg resource group.
- Name the service connection azure_rm_connection.
- Select the Grant access permission to all pipelines checkbox.
- Create Pipeline Variable Groups:
- In the Pipelines > Library section, create a new variable group named datapipeline-vg. Add the following variables:
Variable Name Value LOCATION $region RESOURCE_GROUP $rgName DATA_FACTORY_DEV_NAME $datafactorydev DATA_FACTORY_TEST_NAME $datafactorytest DATABRICKS_NAME $databricksname AZURE_RM_CONNECTION azure_rm_connection DATABRICKS_URL STORAGE_ACCOUNT_NAME $storageName - Create a second variable group named keys-vg.
- Enable the option Link secrets from an Azure key vault as variables.
- Select your Azure subscription and the key vault you created ($keyVault).
- Add all available secrets (databricks-token, StorageConnectString, StorageKey) as variables.
- In the Pipelines > Library section, create a new variable group named datapipeline-vg. Add the following variables:
4. Configuring Azure Services
The final configuration steps involve linking our services together.
- Create a Databricks Secret Scope:
- In the Azure portal, navigate to your key vault and copy its Vault URI and Resource ID from the Properties page.
- Go to https://
#secrets/createScope and create a new secret scope named testscope, pasting the Vault URI and Resource ID when prompted.
- Create a Databricks Cluster:
- In your Azure Databricks workspace, go to Compute and create a new cluster.
- After creation, view the cluster details and copy the Cluster ID from the URL. (It's the string between /clusters/ and /configuration).
- Configure Data Factory Git Integration:
- Open your dev data factory (data-factory-cicd-dev) and launch the Azure Data Factory Studio.
- In the Manage hub, select Git configuration and set up a code repository.
- Connect to your Azure DevOps Git repo, select the main branch for collaboration, and set the root folder to /azure-data-pipeline/factorydata.
- Grant Data Factory Access to Key Vault:
- In the Azure portal, navigate to your key vault and select Access configuration.
- On the Access policies page, add a new policy.
- Use the Key & Secret Management template.
- For the principal, search for and select your dev data factory's managed identity.
- Repeat the process to add an access policy for the test data factory.
- Update Linked Services in Data Factory:
- In the Data Factory Studio's Manage hub, go to Linked services.
- Update the Azure Key Vault, Azure Blob Storage, and Azure Databricks linked services to connect to your subscription and use the correct credentials and cluster ID.
- Validate and Publish Data Factory:
- In the Author hub, open the DataPipeline pipeline.
- Select Validate to check for errors.
- Select Publish to save all changes to the adf_publish branch of your repository.
5. Run the CI/CD Pipeline
With all configurations complete, you can now execute the pipeline.
- In Azure DevOps, go to the Pipelines page.
- Select Create Pipeline.
- Choose Azure Repos Git and select your repository.
- Select Existing Azure Pipelines YAML file and set the path to /azure-data-pipeline/data_pipeline_ci_cd.yml.
- Save and run the pipeline. You may need to grant permissions for the pipeline to access your service connections on its first run.
6. Clean Up Resources
To avoid ongoing costs, delete the resources when you are finished.
- Delete the data-pipeline-cicd-rg resource group from the Azure portal.
- Delete the Azure DevOps project if it is no longer needed.