PySpark Data Alchemy: Unleashing the Power of `between` for DataFrame Sorcery

2 min readDec 21, 2023

Data manipulation is a crucial skill for every data engineer 🔧, and PySpark offers a powerful function that enables easy filtering of DataFrames based on specified ranges 📊. In this tutorial, we will explore the mechanics of using this function with a practical example 💻.

Input: A Pricey Endeavor

Consider a scenario with a PySpark DataFrame containing information about product prices. Let’s create our DataFrame and explore how to use the between function to filter products within a specific price range

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a Spark session
spark = SparkSession.builder.appName("price_filter").getOrCreate()

# Sample DataFrame
data = [("Product_A", 75),
        ("Product_B", 120),
        ("Product_C", 50),
        ("Product_D", 90)]

columns = ["Product", "Price"]
df = spark.createDataFrame(data, columns)

# Display the original DataFrame
print("Original DataFrame:")
df.show()

Output: The Price is Right

Now, let’s use the between function to filter products with prices between $50 and $100.

# Use the between function to filter prices
filtered_df = df.filter(col('Price').between(50, 100))

# Display the filtered DataFrame
print("Filtered DataFrame:")
filtered_df.show()

Filtered DataFrame Output:

+--------+-----+
| Product|Price|
+--------+-----+
|Product_A|   75|
|Product_C|   50|
|Product_D|   90|
+--------+-----+

In the output, you’ll see that filtered_df contains only the rows where the 'Price' column falls between $50 and $100.

Conclusion: Empowering Data Engineers

Unleash the power of PySpark’s between function, a game-changer for seamless DataFrame filtering within specified ranges 🚀. Mastering this tool empowers data engineers to manage and analyze vast datasets with precision efficiently 🎯.

If PySpark's brilliance has brightened your data journey, give it a round of applause 👏 and follow for more enchanting updates. Your valuable feedback is the magic that enhances our spellbook ✨.

🚀 Happy PySparking!

PySpark Data Alchemy: Unleashing the Power of `between` for DataFrame Sorcery

Input: A Pricey Endeavor

Output: The Price is Right

Filtered DataFrame Output:

Conclusion: Empowering Data Engineers

Written by Dhruv Singhal

No responses yet

PySpark Data Alchemy: Unleashing the Power of between for DataFrame Sorcery

Input: A Pricey Endeavor

Output: The Price is Right

Filtered DataFrame Output:

Conclusion: Empowering Data Engineers

Written by Dhruv Singhal

No responses yet

PySpark Data Alchemy: Unleashing the Power of `between` for DataFrame Sorcery