Window functions allow us to perform calculations across sets
of rows related to the current row.
Unlike aggregations, window functions don’t collapse
rows, so you can compute things like:
Windowing is especially useful for time series data or grouped comparisons, where context matters.
In the following examples we will use a sample dataset from Dewey Data.
To follow along:
Go to Dewey Data
Download Airline Employment Data By WARN Database
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section:
How to Get Dewey Data Sample Sets
| carrier_group | carrier_name | full_time | month_date | month_date_parsed | part_time | total |
|---|---|---|---|---|---|---|
| Major | Alaska | 19,273 | Oct 2025 | 2025-10-01 | 2,322 | 21,595 |
| Major | Allegiant Air | 5,364 | Oct 2025 | 2025-10-01 | 958 | 6,322 |
| Major | American | 96,547 | Oct 2025 | 2025-10-01 | 12,016 | 108,563 |
| Major | Atlas Air | 4,281 | Oct 2025 | 2025-10-01 | 53 | 4,334 |
| Major | Delta | 103,018 | Oct 2025 | 2025-10-01 | 1,468 | 104,486 |
| Major | Envoy Air | 12,971 | Oct 2025 | 2025-10-01 | 6,835 | 19,806 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before filtering, we first load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
warn = spark.read.csv("data/airline-employment-data-sample.csv", header=True, inferSchema=True)
display(warn)The display() function allows us to visually
inspect the dataset.
A window defines:
Partitioning: which group each row belongs to
(e.g., carrier_name)
Ordering: how rows are sorted within the
partition (e.g., month_date_parsed)
Frame: which rows are included in the calculation (default is all rows up to the current row)
from pyspark.sql.window import Window
# Partition by airline and order by month
windowSpec = Window.partitionBy("carrier_name").orderBy("month_date_parsed")Explanation:
partitionBy("carrier_name") creates separate groups
for each airline.
orderBy("month_date_parsed") ensures calculations
consider the temporal order of rows.
lag)We can use lag() to get the value of a column from the
previous row in the window.
from pyspark.sql.functions import lag
airline_lag = airline.withColumn(
"prev_full_time",
lag("full_time").over(windowSpec)
)
display(airline_lag.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time"))Explanation:
lag("full_time") retrieves the previous row’s
full_time value for each airline.
Useful to see month-to-month changes without aggregating rows.
We can subtract the previous month’s value from the current month to calculate change in full-time employees.
from pyspark.sql.functions import col
airline_change = airline_lag.withColumn(
"full_time_change",
col("full_time") - col("prev_full_time")
)
display(airline_change.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time", "full_time_change"))Explanation:
full_time_change shows growth or
reduction in employees for each airline.
This helps analyze trends over time.
Windowing can also calculate a cumulative sum of full-time employees over months.
from pyspark.sql.functions import sum
airline_cumsum = airline.withColumn(
"cumulative_full_time",
sum("full_time").over(windowSpec)
)
display(airline_cumsum.select("carrier_name", "month_date_parsed", "full_time", "cumulative_full_time"))Explanation:
sum("full_time").over(windowSpec) keeps a
running total for each airline.
Running totals are useful for tracking long-term trends without collapsing the dataset.
Windowing also allows ranking within a
partition.
For example, rank airlines by total employees each
month:
from pyspark.sql.functions import rank, desc
month_window = Window.partitionBy("month_date_parsed").orderBy(desc("full_time"))
airline_rank = airline.withColumn(
"rank_full_time",
rank().over(month_window)
)
display(airline_rank.select("carrier_name", "month_date_parsed", "full_time", "rank_full_time"))Explanation:
rank() assigns a rank to each airline within
the same month, highest full_time first.
Useful for leaderboards or identifying top contributors.
Window functions in PySpark let you analyze data in context:
lag / lead – compare rows within a
group
sum, avg – running totals or moving
averages
rank, dense_rank – rank rows within a
group
Works with partitions to calculate group-wise metrics without collapsing rows
Windowing is essential for time series, trend analysis, and group-based insights.