The filter() transformation in PySpark
is used to select only the rows that meet a specific
condition.
In many datasets, we do not want to analyze every row. Instead, we
often want to focus on only the data that is relevant to our question.
The filter() function helps us do this by creating a
new DataFrame that only contains rows that match a
condition.
For example, imagine a dataset containing people’s ages from 15 to 21. If we only want to analyze adults, we could filter the dataset to include only ages 18 and older.
Filtering is especially useful when working with large datasets, because it allows us to quickly narrow the data down to only the information we need.
In the following examples we will use a sample dataset from Dewey Data.
To follow along:
Go to Dewey Data
Download WARN Layoff Data
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section:
How to Get Dewey Data Sample Sets
| city | closure_layoff | company | county | effective_date | industry | notes | number_of_workers | region | state | temporary_permanent | unions | warn_received_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Statewide | None | Nordstrom Card Services | None | None | None | None | 15 | None | Utah | None | None | 2026-02-21 |
| Clackamas | Large Layoff - 10 or more workers | Vacuum Technique LLC | None | 04/30/2026-06/30/2026 | None | 9415 | 78 | None | Oregon | None | None | 2025-12-31 |
| Johnson City | Layoff | US Endodontics, LLC | Washington | 2/27/26 | None | #202500025 | 70 | None | Tennessee | Permanent | None | 2025-12-30 |
| Atlanta | Permanent Layoff / Reduction in Force | The Coca-Cola Company | Fulton | 2/28/25 | None | GA202500061 | 75 | None | Georgia | None | None | 2025-12-30 |
| Deposit | Layoff - Temporary | Crockett & Sons Concrete, Inc. | None | 12/26/25 | 238110 | 61 Marian Drive Port Deposit, MD 21904 | 4 | Susquehanna | Maryland | None | None | 2025-12-29 |
| Fayette | Closure | Eagle Machining - First Brands Group, LLC | Fulton | 02/27/2026-03/31/2026 | None | 007-25-87 | 251 | None | Ohio | Permanent | UAW Local 1181 | 2025-12-29 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before filtering, we first load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
warn = spark.read.csv("data/warn-layoff-data-sample.csv", header=True, inferSchema=True)
display(warn)The display() function allows us to visually inspect the
dataset.
Sometimes we want to filter numeric columns, like
number_of_workers, to focus on larger
layoffs or smaller events.
For example, we can filter the dataset to include only events where 50 or more workers were affected.
This creates a new DataFrame that excludes small events affecting fewer than 50 workers.
Example of how this could be used.
To better understand the filtered data, we can count how many events exist for each number of workers:
worker_counts = (
large_layoffs
.groupBy("number_of_workers")
.count()
.orderBy("number_of_workers", ascending=True)
)
worker_counts.show()This allows us to see how many events affect a given number of workers, which is useful for summary statistics or further analysis.
Filtering is useful for categorical data, such as states.
For example, we can filter the dataset to show only events in California:
This returns only rows where the state column equals “California”.
Example of how this could be used.
We can also see how many events occur in each state:
This helps us understand the distribution of layoff events across states.
We can filter rows based on more than one condition.
For example, keep only rows where:
number_of_workers ≥ 50 AND
closure_layoff contains the word “Layoff”
filtered_and = warn.filter(
(col("number_of_workers") >= 50) &
(col("closure_layoff").like("%Layoff%"))
)
display(filtered_and)Or using OR, to keep rows where either condition is true:
filtered_or = warn.filter(
(col("number_of_workers") >= 50) |
(col("state") == "California")
)
display(filtered_or)Why this is useful:
Combining conditions allows you to focus on events that meet multiple criteria simultaneously or any one of several criteria, giving you more control over your analysis.
isin() for Multiple ValuesWhen filtering for multiple categories,
isin() is very convenient.
filtered_states = warn.filter(
col("state").isin("California", "Texas", "Washington")
)
display(filtered_states)This code creates a new DataFrame containing only rows where the state column matches any one of the values “California”, “Texas”, or “Washington”. Internally, isin() checks each row and returns True if the state is in the provided list. The display() function then shows all rows that meet this condition, letting us quickly examine multiple states at once without writing multiple OR conditions.
Why this is useful:
It lets you select multiple states at once without writing multiple OR conditions, which keeps your code cleaner and easier to read.
between() for Numeric RangesThe between() function allows you to filter numeric
columns within a specific range.
For example, we can find layoffs that affect between 50 and 200 workers:
filtered_workers = warn.filter(
col("number_of_workers").between(50, 200)
)
display(filtered_workers)This code filters the dataset to include only rows where number_of_workers is between 50 and 200, inclusive. Each row is checked, and only those meeting the numeric range condition are kept in the new DataFrame. The display() function then shows all rows that fall within this range, which is especially useful for analyzing mid-sized layoffs or creating reports based on specific thresholds.
Why this is useful:
between() provides a readable way to filter
numeric columns inclusively, which is especially handy for
reporting or summary statistics.
like FiltersPySpark allows you to filter text columns using pattern matching. Two common approaches are:
like() — simple wildcard
matching
rlike() / regexp —
full regular expression matching
These are especially useful when you want to find rows that match a pattern, rather than an exact value.
like() for Simple PatternsThe like() function works like a SQL wildcard
search.
For example, we can find all layoffs where the
closure_layoff column contains the word
"Temporary":
Explanation:
% is a wildcard that matches any
sequence of characters.
This keeps all rows where closure_layoff contains
"Temporary".
It’s useful for quick text filtering without needing full regular expressions.
rlike() for Regular ExpressionsThe rlike() function allows more complex pattern
matching using regular expressions.
For example, to find layoffs that mention either "Layoff"
or "Closure":
layoff_or_closure = warn.filter(
col("closure_layoff").rlike("Layoff|Closure")
)
display(layoff_or_closure)Explanation:
The | symbol means OR in regular
expressions.
This keeps all rows where closure_layoff contains
either “Layoff” or “Closure”.
Regular expressions let you capture complex text patterns, like multiple keywords, optional words, or specific formats (e.g., dates, codes).
like() is perfect for simple substring
searches.
rlike() is powerful when you need flexible
pattern matching, e.g., multiple keywords, optional terms, or
numeric patterns in text columns.
Both methods let you filter large datasets based on text content, which is very common in WARN or layoff reports.