The groupBy() transformation in PySpark
allows us to group rows that share a common value in
one or more columns, and then apply aggregate functions
(like sum, count, average) to those groups.
This is useful when you want to:
In the following examples we will use a sample dataset from Dewey Data containing education information about schools and students.
To follow along:
Go to Dewey Data
Download Education By People Data Labs
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section: How to Get Dewey Data Sample Sets
| end_date | gpa | id | person_id | school_domain | school_facebook_url | school_id | school_linkedin_id | school_linkedin_url | school_location_continent | school_location_country | school_location_locality | school_location_name | school_location_region | school_name | school_twitter_url | school_type | school_website | start_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2017 | None | 1311739095 | –262iwMlKiHUMD2H9Znyw_0000 | tafeqld.edu.au | None | anN7Y9T9ByrEmn3AigOAhQ_0 | 2503325 | linkedin.com/school/tafe-queensland | oceania | australia | brisbane | brisbane, queensland, australia | queensland | tafe queensland | None | post-secondary institution | tafeqld.edu.au | 2016 |
| 2008 | None | 1311742094 | –262iwMlKiHUMD2H9Znyw_0000 | federation.edu.au | facebook.com/feduniaustralia | nCzICJS71EKGdpez5KRlbw_0 | 24588 | linkedin.com/school/university-of-ballarat | oceania | australia | mount helen | mount helen, victoria, australia | victoria | university of ballarat | twitter.com/feduniaustralia | post-secondary institution | federation.edu.au | 2008 |
| 2015 | None | 76801023 | –3NBEfBmPwBrn5RcJ2yQg_0000 | unistra.fr | facebook.com/unistra | 7QlvQV8XPUWgY9FyZup67g_0 | 18551169 | linkedin.com/school/ecole-doctorale-vie-et-sante-strasbourg | europe | france | strasbourg | strasbourg, grand est, france | grand est | ecole doctorale des sciences de la vie et de la santé - strasbourg | twitter.com/unistra | post-secondary institution | ed.vie-sante.unistra.fr | 2011 |
| 2011 | None | 76804022 | –3NBEfBmPwBrn5RcJ2yQg_0000 | univ-amu.fr | facebook.com/aixmarseilleuniversite | Rys6e9VDribv1lMcEJEO3g_0 | 15250774 | linkedin.com/school/aix-marseille-universite | europe | france | marseille | marseille, provence-alpes-côte d’azur, france | provence-alpes-côte d’azur | aix - marseille university | twitter.com/univamu | post-secondary institution | univ-amu.fr | 2010 |
| 2011 | None | 76807021 | –3NBEfBmPwBrn5RcJ2yQg_0000 | None | None | None | None | None | None | None | None | None | None | esil - ecole supérieure d’ingénieurs de luminy | None | None | None | 2008 |
| 2008 | None | 76810020 | –3NBEfBmPwBrn5RcJ2yQg_0000 | hust.edu.vn | facebook.com/dhbkhanoi | J3lJBEV1hagUITlozL8KDA_0 | 18063954 | linkedin.com/school/dhbkhn | asia | vietnam | hanoi | hanoi, hà nội, vietnam | hà nội | hanoi university of science and technology | None | post-secondary institution | hust.edu.vn | 2005 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before filtering, we first load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
edu = spark.read.csv("data/education-sample.csv", header=True, inferSchema=True)
display(edu)The display() function allows us to visually
inspect the dataset.
Counting Students by School
Suppose we want to see how many records exist for each
school. We can use groupBy() with
count().
student_counts = edu.groupBy("school_name").count().orderBy("count", ascending=False)
display(student_counts)Explanation:
groupBy("school_name") creates groups for each
unique school
count() counts how many rows (students/records)
belong to each school
orderBy("count", ascending=False) sorts the results
from largest to smallest
This is useful for understanding which schools have the most data records.
Calculating Average GPA by School
We can calculate the average GPA per school using
groupBy() with avg().
from pyspark.sql.functions import avg
avg_gpa = edu.groupBy("school_name").agg(
avg("gpa").alias("avg_gpa")
).orderBy("avg_gpa", ascending=False)
display(avg_gpa)Explanation:
agg(avg("gpa").alias("avg_gpa")) computes the
average GPA for each school
alias("avg_gpa") renames the resulting column for
clarity
Sorting by avg_gpa shows schools with the
highest average GPA at the top
This is useful for comparing academic performance across schools.
We can also group by more than one column, such as
school_location_country and school_type.
country_school_type_counts = edu.groupBy(
"school_location_country", "school_type"
).count().orderBy("school_location_country", "count", ascending=False)
display(country_school_type_counts)Explanation:
groupBy("school_location_country", "school_type")
creates groups for each country and school type
combination
count() counts how many records exist in each
group
Sorting by country and count lets us see distribution per country
This is useful for analyzing data patterns across multiple dimensions.
We can aggregate multiple statistics at once using
agg() with multiple functions.
from pyspark.sql.functions import avg, max, min
school_stats = edu.groupBy("school_name").agg(
avg("gpa").alias("avg_gpa"),
max("gpa").alias("max_gpa"),
min("gpa").alias("min_gpa")
).orderBy("avg_gpa", ascending=False)
display(school_stats)Explanation:
avg("gpa"), max("gpa"),
min("gpa") calculate average, maximum, and minimum GPAs for
each school
agg() allows multiple metrics in one
step
Sorting by avg_gpa helps quickly identify
top-performing schools
This is especially useful when you want a comprehensive summary of each group.
Using groupBy() in PySpark allows you to:
Summarize and aggregate data by categories
Count, average, sum, min, max values per group
Group by one or multiple columns
Produce insights and summaries without losing row-level detail
GroupBy is an essential tool for data aggregation, reporting, and exploratory analysis.