How to: Generate a wiki out of your existing repository

14 Feb 2024

code 02 code 01

Automating Documentation with ADF API and Python

Documentation is essential for any project and beneficial for operations. It needs to be a live entity and an ongoing activity. Writing documentation is not a one-time action anymore because it becomes outdated quickly. This article will explain how to export documentation automatically from Azure Data Factory and generate Markdown documents using Python code.

To automate the documentation generation for Azure Data Factory pipelines, we can leverage the ADF API, a set of RESTful web services Azure provides for programmatically interacting with Data Factory resources. Python, with its rich ecosystem of libraries, can be used to make API calls and process the retrieved information to create comprehensive Markdown documentation.

The example below is developed to document many pipelines and maintain current documentation. In the example, details such as source and target linked services, activities, secrets, and triggers are collected, processed, and generated in documentation. As a result, in generated documentation, pipelines are grouped by source-linked service and separated into individual files.

How To:

Collect necessary information across pipelines

It is necessary to collect the following information across the pipelines and transform it into the delimited file to prepare them for the script execution:

resource_group_name – Required. Azure resource group name,
factory_name – Required. Azure Data Factory name,
pipeline_name – Required. Name of the pipeline,
activity_names – Required. All pipeline activities, separated by semi-colon,
source_secret_names – Secret name of the source linked service,
source_linked_service_names – Name of the source-linked service,
source_linked_service_values – Value of the source-linked service,
target_secret_names – Secret name of the target(sink) linked service,
target_linked_service_names – Name of the target-linked service,
target_linked_service_values – Value of the target-linked service,
trigger – Pipeline trigger,
text – Pipeline description

Delimited file sample

resource_group_name,factory_name,pipeline_name,activity_names,source_secret_names,source_linked_service_names,source_linked_service_values,target_secret_names,target_linked_service_names,target_linked_service_values,trigger,text
rg-Analytics,adf-Analytics,IngestSalesData,LoadSalesData,SalesDBSecret,SalesDatabase,SalesTable,BlobStorageSecret,SalesDataBlobStorage,sales_data.parquet,Daily1130,This pipeline consists of an activity that ingest data from the SQL table to the Azure Blob Storage.

Upload images

Upload images or schema diagrams of the pipelines to the Wiki repository. The best practice is to create a new folder (e.g., “images/” or “diagrams/”) within your Wiki repository where you want to store the images. The file name of the image must match the name of the pipeline, e.g.:

./images/IngestSalesData.png

Only one image per pipeline is possible to attach to the generated document.

Script Execution

This script provides a generic code. Depending on the specific requirements, like the need to retrieve additional details, format the information differently, or include visual representations using external tools.

Code Repository

For detailed information regarding the code please contact: marketing@um-orange.com

Prerequisites

pip install -r requirements.txt

Execute script

python generate_markdown.py -s <subscription_id> -f <delimited_file> -i <images_locaition>

Script parameters

-s, --subscription_id – Required. ID of the Azure subscription. Learn how to get a Subscription ID: https://learn.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id
-f, --file – Required. A CSV file of the collection of ADF pipelines description, activities, and parameters. A sample of a delimited file is shown in the example above.
-i, --images_location – Relative path of the location of images or diagrams folder (e.g., “./images/” or “./diagrams/”).

The following files will be generated

1_SalesDatabase.md
2_SQLDatabase.md
… and so on…

Markdown document sample

SalesDatabase
=============

Contents
========

* [1_SalesDatabase](#1_salesdatabase)
* [1_IngestSalesData](#1_ingestsalesdata)

# 1_SalesDatabase

## 1_IngestSalesData

![IngestSalesData](/.attachments/IngestSalesData.png)

*This pipeline consists of an activity that ingests data from the SQL table to the Azure Blob Storage.*

### 1. pipeline activity

**Activity name: LoadSalesData**
**Activity type: Copy**

### Source

**Dataset name:** SourceSalesData
**Source type:** SqlServerSource
**Secret name:** SalesDBSecret
**Linked service name:** SalesDatabase
**Linked service value:** dbo.SalesTable

**Additional columns**

```json
{"name": "Source", "value": "SqlServer"}
{"name": "Createdby", "value": {"value": "@pipeline().Pipeline", "type": "Expression"}}
{"name": "LoadDate", "value": {"value": "@pipeline().TriggerTime", "type": "Expression"}}
```

### Sink

**Dataset name:** TargetSalesData
**Sink type:** ParquetSink
**Secret name:** BlobStorageSecret
**Linked service name:** SalesDataBlobStorage
**Linked service value:** sales_data.parquet
**Truncate:** true
**Format:** parquet

### Mappings

<details> <summary>See mappings </summary>

```json
{"source": {"name": "Region", "type": "String", "physicalType": "String"}, "sink": {"name": "Region", "type": "String", "physicalType": "nvarchar"}}
{"source": {"name": "Country", "type": "String", "physicalType": "String"}, "sink": {"name": "Country", "type": "String", "physicalType": "nvarchar"}}
{"source": {"name": "Account - ID", "type": "String", "physicalType": "String"}, "sink": {"name": "AccountID", "type": "String", "physicalType": "nvarchar"}}
{"source": {"name": "Source", "type": "String"}, "sink": {"name": "Source", "type": "String", "physicalType": "nvarchar"}}
{"source": {"name": "Createdby", "type": "String"}, "sink": {"name": "Createdby", "type": "String", "physicalType": "nvarchar"}}
{"source": {"name": "LoadDate", "type": "String"}, "sink": {"name": "LoadDate", "type": "DateTime", "physicalType": "datetime"}}
```
</details>

### Trigger

**WeeklyRefresh**

Information exported in generated files

Pipeline activity (one pipeline can have multiple activities)

Activity name and
Activity type

Source

Dataset name
Source type
Secret name
Linked service name
Linked service value

Sink (Target)

Dataset name
Sink type
Secret name
Linked service name
Linked service value
Pre-copy script

Mappings

Mappings from source columns to target columns.

Trigger

Defined trigger for pipeline execution.

Conclusion

Automating the documentation generation process for Azure Data Factory pipelines using the ADF API and Python provides a streamlined way to keep your documentation accurate and current. By incorporating this automated approach into your data engineering workflows, you can enhance collaboration, facilitate troubleshooting, and maintain a comprehensive record of your data integration processes. A Python script can be scheduled or triggered as needed. It would be best to use a task scheduler or a tool like Azure Logic Apps to trigger the script based on events in your Data Factory.

Autor:

Sasa Tasic

Data Engineer, Orange Business - Digital Services, Berlin - Vienna

Kontakt