GroupBy

GroupBy in PySpark

The groupBy() transformation in PySpark allows us to group rows that share a common value in one or more columns, and then apply aggregate functions (like sum, count, average) to those groups.

This is useful when you want to:

Summarize data by categories (e.g., schools, countries, continents)
Count occurrences of each group
Calculate averages, totals, or other statistics per group

Example Dataset

In the following examples we will use a sample dataset from Dewey Data containing education information about schools and students.

To follow along:

Go to Dewey Data
Download Education By People Data Labs
Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section: How to Get Dewey Data Sample Sets

end_date	gpa	id	person_id	school_domain	school_facebook_url	school_id	school_linkedin_id	school_linkedin_url	school_location_continent	school_location_country	school_location_locality	school_location_name	school_location_region	school_name	school_twitter_url	school_type	school_website	start_date
2017	None	1311739095	–262iwMlKiHUMD2H9Znyw_0000	tafeqld.edu.au	None	anN7Y9T9ByrEmn3AigOAhQ_0	2503325	linkedin.com/school/tafe-queensland	oceania	australia	brisbane	brisbane, queensland, australia	queensland	tafe queensland	None	post-secondary institution	tafeqld.edu.au	2016
2008	None	1311742094	–262iwMlKiHUMD2H9Znyw_0000	federation.edu.au	facebook.com/feduniaustralia	nCzICJS71EKGdpez5KRlbw_0	24588	linkedin.com/school/university-of-ballarat	oceania	australia	mount helen	mount helen, victoria, australia	victoria	university of ballarat	twitter.com/feduniaustralia	post-secondary institution	federation.edu.au	2008
2015	None	76801023	–3NBEfBmPwBrn5RcJ2yQg_0000	unistra.fr	facebook.com/unistra	7QlvQV8XPUWgY9FyZup67g_0	18551169	linkedin.com/school/ecole-doctorale-vie-et-sante-strasbourg	europe	france	strasbourg	strasbourg, grand est, france	grand est	ecole doctorale des sciences de la vie et de la santé - strasbourg	twitter.com/unistra	post-secondary institution	ed.vie-sante.unistra.fr	2011
2011	None	76804022	–3NBEfBmPwBrn5RcJ2yQg_0000	univ-amu.fr	facebook.com/aixmarseilleuniversite	Rys6e9VDribv1lMcEJEO3g_0	15250774	linkedin.com/school/aix-marseille-universite	europe	france	marseille	marseille, provence-alpes-côte d’azur, france	provence-alpes-côte d’azur	aix - marseille university	twitter.com/univamu	post-secondary institution	univ-amu.fr	2010
2011	None	76807021	–3NBEfBmPwBrn5RcJ2yQg_0000	None	None	None	None	None	None	None	None	None	None	esil - ecole supérieure d’ingénieurs de luminy	None	None	None	2008
2008	None	76810020	–3NBEfBmPwBrn5RcJ2yQg_0000	hust.edu.vn	facebook.com/dhbkhanoi	J3lJBEV1hagUITlozL8KDA_0	18063954	linkedin.com/school/dhbkhn	asia	vietnam	hanoi	hanoi, hà nội, vietnam	hà nội	hanoi university of science and technology	None	post-secondary institution	hust.edu.vn	2005

This displays the first few rows, letting us see column names and understand the dataset structure.

Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

edu = spark.read.csv("data/education-sample.csv", header=True, inferSchema=True)

display(edu)

The display() function allows us to visually inspect the dataset.

Simple GroupBy

Counting Students by School

Suppose we want to see how many records exist for each school. We can use groupBy() with count().

student_counts = edu.groupBy("school_name").count().orderBy("count", ascending=False)

display(student_counts)

Explanation:

groupBy("school_name") creates groups for each unique school
count() counts how many rows (students/records) belong to each school
orderBy("count", ascending=False) sorts the results from largest to smallest

This is useful for understanding which schools have the most data records.

GroupBy with Aggration and orderby

Calculating Average GPA by School

We can calculate the average GPA per school using groupBy() with avg().

from pyspark.sql.functions import avg

avg_gpa = edu.groupBy("school_name").agg(
    avg("gpa").alias("avg_gpa")
).orderBy("avg_gpa", ascending=False)

display(avg_gpa)

Explanation:

agg(avg("gpa").alias("avg_gpa")) computes the average GPA for each school
alias("avg_gpa") renames the resulting column for clarity
Sorting by avg_gpa shows schools with the highest average GPA at the top

This is useful for comparing academic performance across schools.

Grouping by Multiple Columns

We can also group by more than one column, such as school_location_country and school_type.

country_school_type_counts = edu.groupBy(
    "school_location_country", "school_type"
).count().orderBy("school_location_country", "count", ascending=False)

display(country_school_type_counts)

Explanation:

groupBy("school_location_country", "school_type") creates groups for each country and school type combination
count() counts how many records exist in each group
Sorting by country and count lets us see distribution per country

This is useful for analyzing data patterns across multiple dimensions.

Aggregating Multiple Metrics

We can aggregate multiple statistics at once using agg() with multiple functions.

from pyspark.sql.functions import avg, max, min

school_stats = edu.groupBy("school_name").agg(
    avg("gpa").alias("avg_gpa"),
    max("gpa").alias("max_gpa"),
    min("gpa").alias("min_gpa")
).orderBy("avg_gpa", ascending=False)

display(school_stats)

Explanation:

avg("gpa"), max("gpa"), min("gpa") calculate average, maximum, and minimum GPAs for each school
agg() allows multiple metrics in one step
Sorting by avg_gpa helps quickly identify top-performing schools

This is especially useful when you want a comprehensive summary of each group.

Summary

Using groupBy() in PySpark allows you to:

Summarize and aggregate data by categories
Count, average, sum, min, max values per group
Group by one or multiple columns
Produce insights and summaries without losing row-level detail

GroupBy is an essential tool for data aggregation, reporting, and exploratory analysis.