Window Functions in PySpark

Window functions allow us to perform calculations across sets of rows related to the current row.
Unlike aggregations, window functions don’t collapse rows, so you can compute things like:

  • Running totals
  • Differences between current and previous rows
  • Rankings within a group
  • Moving averages

Windowing is especially useful for time series data or grouped comparisons, where context matters.



Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

  1. Go to Dewey Data

  2. Download Airline Employment Data By WARN Database

  3. Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

carrier_group carrier_name full_time month_date month_date_parsed part_time total
Major Alaska 19,273 Oct 2025 2025-10-01 2,322 21,595
Major Allegiant Air 5,364 Oct 2025 2025-10-01 958 6,322
Major American 96,547 Oct 2025 2025-10-01 12,016 108,563
Major Atlas Air 4,281 Oct 2025 2025-10-01 53 4,334
Major Delta 103,018 Oct 2025 2025-10-01 1,468 104,486
Major Envoy Air 12,971 Oct 2025 2025-10-01 6,835 19,806

This displays the first few rows, letting us see column names and understand the dataset structure.



Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

warn = spark.read.csv("data/airline-employment-data-sample.csv", header=True, inferSchema=True)

display(warn)

The display() function allows us to visually inspect the dataset.



Setting Up a Window

A window defines:

  • Partitioning: which group each row belongs to (e.g., carrier_name)

  • Ordering: how rows are sorted within the partition (e.g., month_date_parsed)

  • Frame: which rows are included in the calculation (default is all rows up to the current row)

from pyspark.sql.window import Window

# Partition by airline and order by month
windowSpec = Window.partitionBy("carrier_name").orderBy("month_date_parsed")

Explanation:

  • partitionBy("carrier_name") creates separate groups for each airline.

  • orderBy("month_date_parsed") ensures calculations consider the temporal order of rows.



Calculating Previous Month Full-Time Employees (lag)

We can use lag() to get the value of a column from the previous row in the window.

from pyspark.sql.functions import lag

airline_lag = airline.withColumn(
    "prev_full_time",
    lag("full_time").over(windowSpec)
)

display(airline_lag.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time"))

Explanation:

  • lag("full_time") retrieves the previous row’s full_time value for each airline.

  • Useful to see month-to-month changes without aggregating rows.



Calculating Change in Full-Time Employees

We can subtract the previous month’s value from the current month to calculate change in full-time employees.

from pyspark.sql.functions import col

airline_change = airline_lag.withColumn(
    "full_time_change",
    col("full_time") - col("prev_full_time")
)

display(airline_change.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time", "full_time_change"))

Explanation:

  • full_time_change shows growth or reduction in employees for each airline.

  • This helps analyze trends over time.



Running Total of Employees

Windowing can also calculate a cumulative sum of full-time employees over months.

from pyspark.sql.functions import sum

airline_cumsum = airline.withColumn(
    "cumulative_full_time",
    sum("full_time").over(windowSpec)
)

display(airline_cumsum.select("carrier_name", "month_date_parsed", "full_time", "cumulative_full_time"))

Explanation:

  • sum("full_time").over(windowSpec) keeps a running total for each airline.

  • Running totals are useful for tracking long-term trends without collapsing the dataset.



Ranking Airlines by Month

Windowing also allows ranking within a partition.
For example, rank airlines by total employees each month:

from pyspark.sql.functions import rank, desc

month_window = Window.partitionBy("month_date_parsed").orderBy(desc("full_time"))

airline_rank = airline.withColumn(
    "rank_full_time",
    rank().over(month_window)
)

display(airline_rank.select("carrier_name", "month_date_parsed", "full_time", "rank_full_time"))

Explanation:

  • rank() assigns a rank to each airline within the same month, highest full_time first.

  • Useful for leaderboards or identifying top contributors.



Summary

Window functions in PySpark let you analyze data in context:

  • lag / lead – compare rows within a group

  • sum, avg – running totals or moving averages

  • rank, dense_rank – rank rows within a group

  • Works with partitions to calculate group-wise metrics without collapsing rows

Windowing is essential for time series, trend analysis, and group-based insights.