Exploring PySpark’s collect_list Function: A Journey into Grouping and Aggregation 🚀

Dhruv Singhal
2 min readDec 8, 2023

--

PySpark, the Python API for Apache Spark, offers a powerful aggregation function known as collect_list. In this journey, we'll dive into its world and unravel its magic, discovering how it can simplify complex data aggregation tasks.

Use-case Scenario: Unveiling Daily Product Sales 🛍️

Imagine you’re dealing with a vast dataset containing daily sales records, and your quest is to uncover the list of products sold daily. Enter collect_list – your companion in this exploration.

Let’s Begin with Sample Data 📊

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Setting the stage
spark = SparkSession.builder.appName("CollectListAdventure").getOrCreate()
# Creating our sales DataFrame
data = [("2023-01-01", "Product_A", 100),
("2023-01-01", "Product_B", 150),
("2023-01-02", "Product_A", 120),
("2023-01-02", "Product_C", 200),
("2023-01-02", "Product_B", 80)]
columns = ["Date", "Product", "Amount"]
sales_df = spark.createDataFrame(data, columns)

Embarking on the Journey of collect_list

Our journey begins with understanding how collect_list operates. It's like a magical guide that groups your DataFrame by specified columns and collects values from another column into a list, similar to a SQL GROUP BY clause.

The Magical Incantation

# Applying collect_list to aggregate products sold per day
result_df = sales_df.groupBy("Date").agg(F.collect_list("Product").alias("ProductsSold"))
# Displaying the Treasure!
result_df.show(truncate=False)

Witnessing the Magic ✨

+----------+------------------+
|Date |ProductsSold |
+----------+------------------+
|2023-01-01|[Product_A, Product_B]|
|2023-01-02|[Product_A, Product_C, Product_B]|
+----------+------------------+

Decoding the Spell

  • sales_df.groupBy("Date"): Conjures groups based on the "Date" column.
  • agg(F.collect_list("Product").alias("ProductsSold")): Unveils the list of products sold each day.

The result is a captivating DataFrame showcasing the products that found new homes on every glorious sales day.

Conclusion

In our journey through PySpark’s collect_list, we've harnessed the power of aggregation with elegance. Armed with this knowledge, you're now equipped to wield collect_list your adventures through the vast realms of data.

Happy coding! 🚀📊

--

--

Dhruv Singhal
Dhruv Singhal

Written by Dhruv Singhal

Data engineer with expertise in PySpark, SQL, Flask. Skilled in Databricks, Snowflake, and Datafactory. Published articles. Passionate about tech and games.

No responses yet