Efficiently Reading 1 Million Records with Until and Lookup Activities in Azure Data Factory

Dhruv Singhal
4 min readMay 31, 2023

--

Azure Data Factory (ADF) is a powerful cloud-based data integration service that enables you to orchestrate and manage data workflows. When dealing with large datasets, it’s crucial to implement efficient techniques for reading and processing records to avoid performance bottlenecks. In this tutorial, we’ll guide you through the process of efficiently reading 1 million records using the Until and Lookup activities in Azure Data Factory. By leveraging the iterative capabilities of the Until activity and the data retrieval capabilities of the Lookup activity, you’ll be able to retrieve records in batches and process them effectively. Let’s dive in!

Prerequisites:
- Azure Data Factory instance set up and configured.

Step 1: Create a Pipeline
1. In your Azure Data Factory instance, navigate to the Author & Monitor section.
2. Click on the “Author” button to create a new pipeline.
3. Give your pipeline a name and click “OK” to create it.

Step 2: Add Activities to the Pipeline
1. Inside the created pipeline, click on the “Add” button to add activities.
2. Select the “Until” activity from the list of available activities and drag it to the pipeline canvas.

Step 3: Configure the Until Activity
1. Double-click on the Until activity to open its settings.
2. In the General tab, provide a suitable name for the activity.
3. In the Condition tab, enter the condition expression to control the loop. You can use the expression `@equals(variables(‘Counter’), ‘0’)` to check if the counter variable is equal to 0, indicating the initial iteration.
4. In the Settings tab, set the “Timeout” option to a suitable value, considering the expected execution time.

Step 4: Add Lookup Activity
1. Inside the Until activity, click on the “Add” button to add an activity.
2. Select the “Lookup” activity from the list of available activities and drag it to the Until activity.

Step 5: Configure the Lookup Activity
1. Double-click on the Lookup activity to open its settings.
2. In the General tab, provide a suitable name for the activity.
3. In the Source tab, select the connection and dataset to read the data.
4. Optionally, you can specify a query or filter conditions in the Source tab if you need to fetch specific records. Use the OFFSET and FETCH NEXT clauses to read records in batches. For example:
SELECT *
FROM your_table
ORDER BY id_column
OFFSET (variables(‘Counter’) * 1000) ROWS
FETCH NEXT 1000 ROWS ONLY;

Replace `your_table` with your actual table name and `id_column` with the appropriate column for ordering.
5. In the Mapping tab, define the schema of the expected output. You can map the columns of the lookup data to their respective target columns.
6. Click “OK” to save the Lookup activity settings.

Step 6: Set Variable Activity — Increment Counter
1. After the Lookup activity, click on the “Add” button to add an activity.
2. Select the “Set Variable” activity from the list of available activities and drag it to the pipeline canvas.
3. Double-click on the Set Variable activity to open its settings.
4. In the General tab, provide a suitable name for the activity.
5. In the Variable tab, select the variable you want to increment (e.g., `Counter`).
6. In the Value tab, enter the expression `@add(int(variables(‘Counter’)), 1)` to increment the counter variable.
7. Click “OK” to save the Set Variable activity settings.

Step 7: Set Variable Activity — Convert Counter to String
1. After the Set Variable activity, add another Set Variable activity.
2. Double-click on the Set Variable activity to open its settings.
3. In the General tab, provide a suitable name for the activity.
4. In the Variable tab, select the variable you want to assign the converted value to (e.g., `CounterString`).
5. In the Value tab, enter the expression `

@string(variables(‘Counter’))` to convert the counter variable to a string.
6. Click “OK” to save the Set Variable activity settings.

Step 8: Publish and Trigger the Pipeline
1. Once you have configured the activities, click on the “Publish” button to publish the pipeline changes.
2. Navigate back to the “Author & Monitor” section and trigger the pipeline execution either manually or on a schedule.

The Until activity, in combination with the Lookup activity, allows you to read records in a loop until the condition is met. The Lookup activity fetches a set of records based on the counter variable, which is incremented with each iteration. The loop continues until the condition evaluates to true (i.e., the counter variable is 0), indicating the end of the records.

Make sure to adjust the timeout and any other settings based on your specific requirements and data volume.

Note: Handling and processing a large number of records in a single pipeline run may require optimizations or partitioning strategies to avoid performance issues. Consider using techniques like parallel processing, partitioning, or batching to handle large datasets efficiently.

I hope this updated tutorial helps you efficiently read 1 million records using the Until and Lookup activities in Azure Data Factory!

--

--

Dhruv Singhal
Dhruv Singhal

Written by Dhruv Singhal

Data engineer with expertise in PySpark, SQL, Flask. Skilled in Databricks, Snowflake, and Datafactory. Published articles. Passionate about tech and games.

Responses (1)