Exploring PySpark’s collect_list Function: A Journey into Grouping and Aggregation 🚀

2 min readDec 8, 2023

PySpark, the Python API for Apache Spark, offers a powerful aggregation function known as collect_list. In this journey, we'll dive into its world and unravel its magic, discovering how it can simplify complex data aggregation tasks.

Use-case Scenario: Unveiling Daily Product Sales 🛍️

Imagine you’re dealing with a vast dataset containing daily sales records, and your quest is to uncover the list of products sold daily. Enter collect_list – your companion in this exploration.

Let’s Begin with Sample Data 📊

from pyspark.sql import SparkSession
from pyspark.sql import functions as F# Setting the stage
spark = SparkSession.builder.appName("CollectListAdventure").getOrCreate()# Creating our sales DataFrame
data = [("2023-01-01", "Product_A", 100),
        ("2023-01-01", "Product_B", 150),
        ("2023-01-02", "Product_A", 120),
        ("2023-01-02", "Product_C", 200),
        ("2023-01-02", "Product_B", 80)]columns = ["Date", "Product", "Amount"]
sales_df = spark.createDataFrame(data, columns)

Embarking on the Journey of `collect_list`

Our journey begins with understanding how collect_list operates. It's like a magical guide that groups your DataFrame by specified columns and collects values from another column into a list, similar to a SQL GROUP BY clause.

The Magical Incantation

# Applying collect_list to aggregate products sold per day
result_df = sales_df.groupBy("Date").agg(F.collect_list("Product").alias("ProductsSold"))
# Displaying the Treasure!
result_df.show(truncate=False)

Witnessing the Magic ✨

+----------+------------------+
|Date      |ProductsSold      |
+----------+------------------+
|2023-01-01|[Product_A, Product_B]|
|2023-01-02|[Product_A, Product_C, Product_B]|
+----------+------------------+

Decoding the Spell

sales_df.groupBy("Date"): Conjures groups based on the "Date" column.
agg(F.collect_list("Product").alias("ProductsSold")): Unveils the list of products sold each day.

The result is a captivating DataFrame showcasing the products that found new homes on every glorious sales day.

Conclusion

In our journey through PySpark’s collect_list, we've harnessed the power of aggregation with elegance. Armed with this knowledge, you're now equipped to wield collect_list your adventures through the vast realms of data.

Happy coding! 🚀📊

Exploring PySpark’s collect_list Function: A Journey into Grouping and Aggregation 🚀

Use-case Scenario: Unveiling Daily Product Sales 🛍️

Embarking on the Journey of `collect_list`

Conclusion

Written by Dhruv Singhal

No responses yet

Exploring PySpark’s collect_list Function: A Journey into Grouping and Aggregation 🚀

Use-case Scenario: Unveiling Daily Product Sales 🛍️

Embarking on the Journey of collect_list

Conclusion

Written by Dhruv Singhal

No responses yet

Embarking on the Journey of `collect_list`