Simplifying Data Management with monotonically_increasing_id() in PySpark 🚀

Dhruv Singhal
2 min readDec 21, 2023

--

In the vast landscape of PySpark functionality, one gem stands out for data engineers: monotonically_increasing_id(). This function simplifies the generation of unique identifiers in DataFrames, making it an essential tool for efficient data manipulation.

The Basics: Simple Example

Let’s dive into a straightforward example. Assume we have a DataFrame with names and ages, and we want to add a unique ID to each row:

from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Create a sample DataFrame
data = [("John", 28), ("Alice", 35), ("Bob", 40)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Add a monotonically increasing ID column
df = df.withColumn("ID", monotonically_increasing_id())

# Show the result
df.show()

Output:

+-----+---+---+
| Name|Age| ID|
+-----+---+---+
| John| 28| 0|
|Alice| 35| 1|
| Bob| 40| 2|
+-----+---+---+

Setting a Custom Starting Point

Now, let’s explore the flexibility of this function by setting a custom starting point for the IDs. If you want to begin from 100:

# Set a custom starting point for the IDs
start_id = 100
df = df.withColumn("ID", monotonically_increasing_id() + start_id)

# Show the result
df.show()

Output:

+-----+---+---+
| Name|Age| ID|
+-----+---+---+
| John| 28|100|
|Alice| 35|101|
| Bob| 40|102|
+-----+---+---+

Elevate Your PySpark Game

Mastering monotonically_increasing_id() equips you with a powerful tool for handling unique identifiers. Whether you're working with extensive datasets or simply streamlining your data, this function empowers your PySpark endeavors.

If you found this tutorial helpful, give it a round of applause đź‘Ź. Follow us for more insights, and feel free to share your thoughts in the comments. Your feedback fuels our commitment to delivering quality content!

--

--

Dhruv Singhal
Dhruv Singhal

Written by Dhruv Singhal

Data engineer with expertise in PySpark, SQL, Flask. Skilled in Databricks, Snowflake, and Datafactory. Published articles. Passionate about tech and games.

Responses (1)