Aggregation in PySpark

Aggregation in PySpark allows you to combine multiple rows into a single summary value using functions like:

  • sum()
  • avg() (average)
  • max()
  • min()
  • count()

Aggregation is useful for things like:

  • Summarizing data
  • Calculating daily, weekly, or monthly statistics
  • Finding trends, extremes, or averages across groups
  • Etc.


Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

  1. Go to Dewey Data

  2. Download Daily Weather Data by Custom Weather

  3. Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

average_dew_point_f___float_value_to_nearest_hundredths_place average_relative_humidity_____float_value_to_nearest_hundredths_place average_sea_level_pressure_millibars___float_value_to_nearest_hundredths_place average_temperature_c___float_value_to_nearest_hundredths_place average_visibility_kilometers___float_value_to_nearest_hundredths_place average_wind_speed_knots___float_value_to_nearest_hundredths_place city_location_identifier_up_to_9_alphanumeric_characters cooling_degree_days_c___float_value_to_nearest_hundredths_place heating_degree_days_c___float_value_to_nearest_hundredths_place maximum_temperature_c___float_value_to_nearest_hundredths_place minimum_temperature_c___float_value_to_nearest_hundredths_place precipitation_integer_in_hundredths_of_a_millimeter_liquid_equivalent__0is_used_for_trace_amounts_and_1__is_used_for_no_precipitation valid_date_as_yyyymmdd
-3.47 66.74 1014.60 1.5 16.10 4.80 K04W 0 16.83 7 -4 -1 2024-04-01
-0.95 84.05 1010.19 2.0 14.14 5.98 K04W 0 16.33 4 0 -1 2024-04-02
-6.49 47.54 1006.81 5.0 16.10 12.39 K04W 0 13.33 9 1 182 2024-04-03
-4.31 55.88 1013.92 5.0 16.10 5.42 K04W 0 13.33 12 -2 -1 2024-04-04
-4.03 58.93 1021.37 4.5 16.10 2.94 K04W 0 13.83 14 -5 -1 2024-04-05
-4.02 50.16 1020.35 6.5 16.10 7.06 K04W 0 11.83 15 -2 -1 2024-04-06

This displays the first few rows, letting us see column names and understand the dataset structure.



Loading the Data in PySpark

Before sorting, we load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

temp = spark.read.table("workspace.default.daily_weather_data_sample")

display(temp)

The display() function lets us visually inspect the dataset.



Aggregating a Single Column

Suppose we want the average temperature across all days.

from pyspark.sql.functions import avg

avg_temp = weather.agg(avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temperature"))

display(avg_temp)

Explanation:

  • agg() applies an aggregation function to a column

  • avg() calculates the mean temperature

  • alias("avg_temperature") gives a readable name to the result

This is useful to get a quick summary of the dataset.



Aggregating Multiple Columns

We can calculate multiple statistics in one step.

from pyspark.sql.functions import avg, max, min

weather_summary = weather.agg(
    avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
    max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
    min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
)

display(weather_summary)

Explanation:

  • avg(), max(), min() provide a full summary of temperature ranges

  • agg() allows combining multiple aggregate metrics

  • Results help understand typical, high, and low temperatures



Aggregation with Grouping

Suppose we want to see the average temperature by city. We combine groupBy() with agg().

avg_temp_by_city = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
    avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
    max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
    min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
).orderBy("avg_temp", ascending=False)

display(avg_temp_by_city)

Explanation:

  • groupBy("city_location_identifier__up_to_9_alphanumeric_characters_") creates groups for each city

  • agg() calculates average, max, and min temperatures per city

  • Sorting by avg_temp helps identify the warmest and coolest cities

This is useful for city-level weather summaries.



Count and Sum Aggregations

We can also calculate counts and sums. For example, total precipitation per city:

from pyspark.sql.functions import sum, count

precip_summary = weather.groupBy("city_location_identifier__up_to_9_alphanumeric_characters_").agg(
    sum("precipitation_integer_in_hundredths_of_a_millimeter___liquid_equivalent____0__is_used_for_trace_amounts_and___1__is_used_for_no_precipitation").alias("total_precip"),
    count("valid_date_as_yyyymmdd").alias("days_recorded")
).orderBy("total_precip", ascending=False)

display(precip_summary)

Explanation:

  • sum() calculates total precipitation per city

  • count() counts the number of days recorded per city

  • This helps identify cities with the most precipitation or data completeness



Summary

Aggregation in PySpark allows you to:

  • Summarize numeric data using sum, average, min, max

  • Combine grouping with aggregation for multi-level summaries

  • Quickly explore large datasets to extract meaningful insights

This is especially useful for reporting, dashboards, and statistical analysis.



More aggregation functions you might want run into:

  • countDistinct() – Counts the number of unique values in a column.

  • first() / last() – Returns the first or last value in a column within a group.

  • collect_list() – Collects all values from a column into a list for each group.

  • collect_set() – Collects unique values into a set for each group.

  • variance() / var_samp() / var_pop() – Calculates variance for a column (sample or population).

  • stddev() / stddev_samp() / stddev_pop() – Calculates standard deviation for a column.

  • skewness() – Measures asymmetry of a distribution.

  • kurtosis() – Measures the tailedness of a distribution.

  • approx_count_distinct() – Estimates the number of distinct items efficiently for very large datasets.

  • sumDistinct() – Sums distinct values only, ignoring duplicates.

  • And many more that can’t be listed here.