GroupBy in PySpark

The groupBy() transformation in PySpark allows us to group rows that share a common value in one or more columns, and then apply aggregate functions (like sum, count, average) to those groups.

This is useful when you want to:

  • Summarize data by categories (e.g., schools, countries, continents)
  • Count occurrences of each group
  • Calculate averages, totals, or other statistics per group


Example Dataset

In the following examples we will use a sample dataset from Dewey Data containing education information about schools and students.

To follow along:

  1. Go to Dewey Data

  2. Download Education By People Data Labs

  3. Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section: How to Get Dewey Data Sample Sets

end_date gpa id person_id school_domain school_facebook_url school_id school_linkedin_id school_linkedin_url school_location_continent school_location_country school_location_locality school_location_name school_location_region school_name school_twitter_url school_type school_website start_date
2017 None 1311739095 –262iwMlKiHUMD2H9Znyw_0000 tafeqld.edu.au None anN7Y9T9ByrEmn3AigOAhQ_0 2503325 linkedin.com/school/tafe-queensland oceania australia brisbane brisbane, queensland, australia queensland tafe queensland None post-secondary institution tafeqld.edu.au 2016
2008 None 1311742094 –262iwMlKiHUMD2H9Znyw_0000 federation.edu.au facebook.com/feduniaustralia nCzICJS71EKGdpez5KRlbw_0 24588 linkedin.com/school/university-of-ballarat oceania australia mount helen mount helen, victoria, australia victoria university of ballarat twitter.com/feduniaustralia post-secondary institution federation.edu.au 2008
2015 None 76801023 –3NBEfBmPwBrn5RcJ2yQg_0000 unistra.fr facebook.com/unistra 7QlvQV8XPUWgY9FyZup67g_0 18551169 linkedin.com/school/ecole-doctorale-vie-et-sante-strasbourg europe france strasbourg strasbourg, grand est, france grand est ecole doctorale des sciences de la vie et de la santé - strasbourg twitter.com/unistra post-secondary institution ed.vie-sante.unistra.fr 2011
2011 None 76804022 –3NBEfBmPwBrn5RcJ2yQg_0000 univ-amu.fr facebook.com/aixmarseilleuniversite Rys6e9VDribv1lMcEJEO3g_0 15250774 linkedin.com/school/aix-marseille-universite europe france marseille marseille, provence-alpes-côte d’azur, france provence-alpes-côte d’azur aix - marseille university twitter.com/univamu post-secondary institution univ-amu.fr 2010
2011 None 76807021 –3NBEfBmPwBrn5RcJ2yQg_0000 None None None None None None None None None None esil - ecole supérieure d’ingénieurs de luminy None None None 2008
2008 None 76810020 –3NBEfBmPwBrn5RcJ2yQg_0000 hust.edu.vn facebook.com/dhbkhanoi J3lJBEV1hagUITlozL8KDA_0 18063954 linkedin.com/school/dhbkhn asia vietnam hanoi hanoi, hà nội, vietnam hà nội hanoi university of science and technology None post-secondary institution hust.edu.vn 2005

This displays the first few rows, letting us see column names and understand the dataset structure.



Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

edu = spark.read.csv("data/education-sample.csv", header=True, inferSchema=True)

display(edu)

The display() function allows us to visually inspect the dataset.



Simple GroupBy

Counting Students by School

Suppose we want to see how many records exist for each school. We can use groupBy() with count().

student_counts = edu.groupBy("school_name").count().orderBy("count", ascending=False)

display(student_counts)

Explanation:

  • groupBy("school_name") creates groups for each unique school

  • count() counts how many rows (students/records) belong to each school

  • orderBy("count", ascending=False) sorts the results from largest to smallest

This is useful for understanding which schools have the most data records.



GroupBy with Aggration and orderby

Calculating Average GPA by School

We can calculate the average GPA per school using groupBy() with avg().

from pyspark.sql.functions import avg

avg_gpa = edu.groupBy("school_name").agg(
    avg("gpa").alias("avg_gpa")
).orderBy("avg_gpa", ascending=False)

display(avg_gpa)

Explanation:

  • agg(avg("gpa").alias("avg_gpa")) computes the average GPA for each school

  • alias("avg_gpa") renames the resulting column for clarity

  • Sorting by avg_gpa shows schools with the highest average GPA at the top

This is useful for comparing academic performance across schools.



Grouping by Multiple Columns

We can also group by more than one column, such as school_location_country and school_type.

country_school_type_counts = edu.groupBy(
    "school_location_country", "school_type"
).count().orderBy("school_location_country", "count", ascending=False)

display(country_school_type_counts)

Explanation:

  • groupBy("school_location_country", "school_type") creates groups for each country and school type combination

  • count() counts how many records exist in each group

  • Sorting by country and count lets us see distribution per country

This is useful for analyzing data patterns across multiple dimensions.



Aggregating Multiple Metrics

We can aggregate multiple statistics at once using agg() with multiple functions.

from pyspark.sql.functions import avg, max, min

school_stats = edu.groupBy("school_name").agg(
    avg("gpa").alias("avg_gpa"),
    max("gpa").alias("max_gpa"),
    min("gpa").alias("min_gpa")
).orderBy("avg_gpa", ascending=False)

display(school_stats)

Explanation:

  • avg("gpa"), max("gpa"), min("gpa") calculate average, maximum, and minimum GPAs for each school

  • agg() allows multiple metrics in one step

  • Sorting by avg_gpa helps quickly identify top-performing schools

This is especially useful when you want a comprehensive summary of each group.



Summary

Using groupBy() in PySpark allows you to:

  • Summarize and aggregate data by categories

  • Count, average, sum, min, max values per group

  • Group by one or multiple columns

  • Produce insights and summaries without losing row-level detail

GroupBy is an essential tool for data aggregation, reporting, and exploratory analysis.