Aggregation in PySpark allows you to combine multiple rows into a single summary value using functions like:
sum()avg() (average)max()min()count()Aggregation is useful for things like:
In the following examples we will use a sample dataset from Dewey Data.
To follow along:
Go to Dewey Data
Download Daily Weather Data by Custom Weather
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section:
How to Get Dewey Data Sample Sets
| average_dew_point_f___float_value_to_nearest_hundredths_place | average_relative_humidity_____float_value_to_nearest_hundredths_place | average_sea_level_pressure_millibars___float_value_to_nearest_hundredths_place | average_temperature_c___float_value_to_nearest_hundredths_place | average_visibility_kilometers___float_value_to_nearest_hundredths_place | average_wind_speed_knots___float_value_to_nearest_hundredths_place | city_location_identifier_up_to_9_alphanumeric_characters | cooling_degree_days_c___float_value_to_nearest_hundredths_place | heating_degree_days_c___float_value_to_nearest_hundredths_place | maximum_temperature_c___float_value_to_nearest_hundredths_place | minimum_temperature_c___float_value_to_nearest_hundredths_place | precipitation_integer_in_hundredths_of_a_millimeter_liquid_equivalent__0is_used_for_trace_amounts_and_1__is_used_for_no_precipitation | valid_date_as_yyyymmdd |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -3.47 | 66.74 | 1014.60 | 1.5 | 16.10 | 4.80 | K04W | 0 | 16.83 | 7 | -4 | -1 | 2024-04-01 |
| -0.95 | 84.05 | 1010.19 | 2.0 | 14.14 | 5.98 | K04W | 0 | 16.33 | 4 | 0 | -1 | 2024-04-02 |
| -6.49 | 47.54 | 1006.81 | 5.0 | 16.10 | 12.39 | K04W | 0 | 13.33 | 9 | 1 | 182 | 2024-04-03 |
| -4.31 | 55.88 | 1013.92 | 5.0 | 16.10 | 5.42 | K04W | 0 | 13.33 | 12 | -2 | -1 | 2024-04-04 |
| -4.03 | 58.93 | 1021.37 | 4.5 | 16.10 | 2.94 | K04W | 0 | 13.83 | 14 | -5 | -1 | 2024-04-05 |
| -4.02 | 50.16 | 1020.35 | 6.5 | 16.10 | 7.06 | K04W | 0 | 11.83 | 15 | -2 | -1 | 2024-04-06 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before sorting, we load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
temp = spark.read.table("workspace.default.daily_weather_data_sample")
display(temp)The display() function lets us visually inspect the
dataset.
Suppose we want the average temperature across all days.
from pyspark.sql.functions import avg
avg_temp = weather.agg(avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temperature"))
display(avg_temp)Explanation:
agg() applies an aggregation function to a
column
avg() calculates the mean
temperature
alias("avg_temperature") gives a readable name to
the result
This is useful to get a quick summary of the dataset.
We can calculate multiple statistics in one step.
from pyspark.sql.functions import avg, max, min
weather_summary = weather.agg(
avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
)
display(weather_summary)Explanation:
avg(), max(), min()
provide a full summary of temperature ranges
agg() allows combining multiple aggregate
metrics
Results help understand typical, high, and low temperatures
Suppose we want to see the average temperature by
city. We combine groupBy() with
agg().
avg_temp_by_city = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
).orderBy("avg_temp", ascending=False)
display(avg_temp_by_city)Explanation:
groupBy("city_location_identifier__up_to_9_alphanumeric_characters_")
creates groups for each city
agg() calculates average, max, and min
temperatures per city
Sorting by avg_temp helps identify the
warmest and coolest cities
This is useful for city-level weather summaries.
We can also calculate counts and sums. For example, total precipitation per city:
from pyspark.sql.functions import sum, count
precip_summary = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
sum("precipitation_integer_in_hundredths_of_a_millimeter___liquid_equivalent____0__is_used_for_trace_amounts_and___1__is_used_for_no_precipitation").alias("total_precip"),
count("valid_date_as_yyyymmdd").alias("days_recorded")
).orderBy("total_precip", ascending=False)
display(precip_summary)Explanation:
sum() calculates total precipitation per
city
count() counts the number of days recorded per
city
This helps identify cities with the most precipitation or data completeness
Aggregation in PySpark allows you to:
Summarize numeric data using sum, average, min, max
Combine grouping with aggregation for multi-level summaries
Quickly explore large datasets to extract meaningful insights
This is especially useful for reporting, dashboards, and statistical analysis.
More aggregation functions you might want run into:
countDistinct() – Counts the number
of unique values in a column.
first() / last() –
Returns the first or last value in a column within a
group.
collect_list() – Collects all
values from a column into a list for each
group.
collect_set() – Collects
unique values into a set for each group.
variance() / var_samp() /
var_pop() – Calculates variance
for a column (sample or population).
stddev() / stddev_samp() /
stddev_pop() – Calculates standard
deviation for a column.
skewness() – Measures
asymmetry of a distribution.
kurtosis() – Measures the
tailedness of a distribution.
approx_count_distinct() – Estimates
the number of distinct items efficiently for very large
datasets.
sumDistinct() – Sums
distinct values only, ignoring duplicates.
And many more that can’t be listed here.