Aggregation

Aggregation in PySpark

Aggregation in PySpark allows you to combine multiple rows into a single summary value using functions like:

sum()
avg() (average)
max()
min()
count()

Aggregation is useful for things like:

Summarizing data
Calculating daily, weekly, or monthly statistics
Finding trends, extremes, or averages across groups
Etc.

Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

Go to Dewey Data
Download Daily Weather Data by Custom Weather
Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

average_dew_point_f___float_value_to_nearest_hundredths_place	average_relative_humidity_____float_value_to_nearest_hundredths_place	average_sea_level_pressure_millibars___float_value_to_nearest_hundredths_place	average_temperature_c___float_value_to_nearest_hundredths_place	average_visibility_kilometers___float_value_to_nearest_hundredths_place	average_wind_speed_knots___float_value_to_nearest_hundredths_place	city_location_identifier_up_to_9_alphanumeric_characters	heating_degree_days_c___float_value_to_nearest_hundredths_place	maximum_temperature_c___float_value_to_nearest_hundredths_place	minimum_temperature_c___float_value_to_nearest_hundredths_place	precipitation_integer_in_hundredths_of_a_millimeter_liquid_equivalent__0is_used_for_trace_amounts_and_1__is_used_for_no_precipitation	valid_date_as_yyyymmdd
-3.47	66.74	1014.60	1.5	16.10	4.80	K04W	16.83	7	-4	-1	2024-04-01
-0.95	84.05	1010.19	2.0	14.14	5.98	K04W	16.33	4	0	-1	2024-04-02
-6.49	47.54	1006.81	5.0	16.10	12.39	K04W	13.33	9	1	182	2024-04-03
-4.31	55.88	1013.92	5.0	16.10	5.42	K04W	13.33	12	-2	-1	2024-04-04
-4.03	58.93	1021.37	4.5	16.10	2.94	K04W	13.83	14	-5	-1	2024-04-05
-4.02	50.16	1020.35	6.5	16.10	7.06	K04W	11.83	15	-2	-1	2024-04-06

This displays the first few rows, letting us see column names and understand the dataset structure.

Loading the Data in PySpark

Before sorting, we load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

temp = spark.read.table("workspace.default.daily_weather_data_sample")

display(temp)

The display() function lets us visually inspect the dataset.

Aggregating a Single Column

Suppose we want the average temperature across all days.

from pyspark.sql.functions import avg

avg_temp = weather.agg(avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temperature"))

display(avg_temp)

Explanation:

agg() applies an aggregation function to a column
avg() calculates the mean temperature
alias("avg_temperature") gives a readable name to the result

This is useful to get a quick summary of the dataset.

Aggregating Multiple Columns

We can calculate multiple statistics in one step.

from pyspark.sql.functions import avg, max, min

weather_summary = weather.agg(
    avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
    max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
    min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
)

display(weather_summary)

Explanation:

avg(), max(), min() provide a full summary of temperature ranges
agg() allows combining multiple aggregate metrics
Results help understand typical, high, and low temperatures

Aggregation with Grouping

Suppose we want to see the average temperature by city. We combine groupBy() with agg().

avg_temp_by_city = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
    avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
    max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
    min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
).orderBy("avg_temp", ascending=False)

display(avg_temp_by_city)

Explanation:

groupBy("city_location_identifier__up_to_9_alphanumeric_characters_") creates groups for each city
agg() calculates average, max, and min temperatures per city
Sorting by avg_temp helps identify the warmest and coolest cities

This is useful for city-level weather summaries.

Count and Sum Aggregations

We can also calculate counts and sums. For example, total precipitation per city:

from pyspark.sql.functions import sum, count

precip_summary = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
    sum("precipitation_integer_in_hundredths_of_a_millimeter___liquid_equivalent____0__is_used_for_trace_amounts_and___1__is_used_for_no_precipitation").alias("total_precip"),
    count("valid_date_as_yyyymmdd").alias("days_recorded")
).orderBy("total_precip", ascending=False)

display(precip_summary)

Explanation:

sum() calculates total precipitation per city
count() counts the number of days recorded per city
This helps identify cities with the most precipitation or data completeness

Summary

Aggregation in PySpark allows you to:

Summarize numeric data using sum, average, min, max
Combine grouping with aggregation for multi-level summaries
Quickly explore large datasets to extract meaningful insights

This is especially useful for reporting, dashboards, and statistical analysis.

More aggregation functions you might want run into:

countDistinct() – Counts the number of unique values in a column.
first() / last() – Returns the first or last value in a column within a group.
collect_list() – Collects all values from a column into a list for each group.
collect_set() – Collects unique values into a set for each group.
variance() / var_samp() / var_pop() – Calculates variance for a column (sample or population).
stddev() / stddev_samp() / stddev_pop() – Calculates standard deviation for a column.
skewness() – Measures asymmetry of a distribution.
kurtosis() – Measures the tailedness of a distribution.
approx_count_distinct() – Estimates the number of distinct items efficiently for very large datasets.
sumDistinct() – Sums distinct values only, ignoring duplicates.
And many more that can’t be listed here.