PySpark Data Alchemy: Unleashing the Power of between for DataFrame Sorcery

Dhruv Singhal
2 min readDec 21, 2023

Data manipulation is a crucial skill for every data engineer 🔧, and PySpark offers a powerful function that enables easy filtering of DataFrames based on specified ranges 📊. In this tutorial, we will explore the mechanics of using this function with a practical example 💻.

Input: A Pricey Endeavor

Consider a scenario with a PySpark DataFrame containing information about product prices. Let’s create our DataFrame and explore how to use the between function to filter products within a specific price range

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a Spark session
spark = SparkSession.builder.appName("price_filter").getOrCreate()

# Sample DataFrame
data = [("Product_A", 75),
("Product_B", 120),
("Product_C", 50),
("Product_D", 90)]

columns = ["Product", "Price"]
df = spark.createDataFrame(data, columns)

# Display the original DataFrame
print("Original DataFrame:")
df.show()

Output: The Price is Right

Now, let’s use the between function to filter products with prices between $50 and $100.

# Use the between function to filter prices
filtered_df = df.filter(col('Price').between(50, 100))

# Display the filtered DataFrame
print("Filtered DataFrame:")
filtered_df.show()

Filtered DataFrame Output:

+--------+-----+
| Product|Price|
+--------+-----+
|Product_A| 75|
|Product_C| 50|
|Product_D| 90|
+--------+-----+

In the output, you’ll see that filtered_df contains only the rows where the 'Price' column falls between $50 and $100.

Conclusion: Empowering Data Engineers

Unleash the power of PySpark’s between function, a game-changer for seamless DataFrame filtering within specified ranges 🚀. Mastering this tool empowers data engineers to manage and analyze vast datasets with precision efficiently 🎯.

If PySpark's brilliance has brightened your data journey, give it a round of applause 👏 and follow for more enchanting updates. Your valuable feedback is the magic that enhances our spellbook ✨.

🚀 Happy PySparking!

--

--

Dhruv Singhal
Dhruv Singhal

Written by Dhruv Singhal

Data engineer with expertise in PySpark, SQL, Flask. Skilled in Databricks, Snowflake, and Datafactory. Published articles. Passionate about tech and games.

No responses yet