Filter

Filtering in PySpark

The filter() transformation in PySpark is used to select only the rows that meet a specific condition.

In many datasets, we do not want to analyze every row. Instead, we often want to focus on only the data that is relevant to our question. The filter() function helps us do this by creating a new DataFrame that only contains rows that match a condition.

For example, imagine a dataset containing people’s ages from 15 to 21. If we only want to analyze adults, we could filter the dataset to include only ages 18 and older.

Filtering is especially useful when working with large datasets, because it allows us to quickly narrow the data down to only the information we need.

Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

Go to Dewey Data
Download WARN Layoff Data
Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

city	closure_layoff	company	county	effective_date	industry	notes	number_of_workers	region	state	temporary_permanent	unions	warn_received_date
Statewide	None	Nordstrom Card Services	None	None	None	None	15	None	Utah	None	None	2026-02-21
Clackamas	Large Layoff - 10 or more workers	Vacuum Technique LLC	None	04/30/2026-06/30/2026	None	9415	78	None	Oregon	None	None	2025-12-31
Johnson City	Layoff	US Endodontics, LLC	Washington	2/27/26	None	#202500025	70	None	Tennessee	Permanent	None	2025-12-30
Atlanta	Permanent Layoff / Reduction in Force	The Coca-Cola Company	Fulton	2/28/25	None	GA202500061	75	None	Georgia	None	None	2025-12-30
Deposit	Layoff - Temporary	Crockett & Sons Concrete, Inc.	None	12/26/25	238110	61 Marian Drive Port Deposit, MD 21904	4	Susquehanna	Maryland	None	None	2025-12-29
Fayette	Closure	Eagle Machining - First Brands Group, LLC	Fulton	02/27/2026-03/31/2026	None	007-25-87	251	None	Ohio	Permanent	UAW Local 1181	2025-12-29

This displays the first few rows, letting us see column names and understand the dataset structure.

Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

warn = spark.read.csv("data/warn-layoff-data-sample.csv", header=True, inferSchema=True)

display(warn)

The display() function allows us to visually inspect the dataset.

Numeric Filter Example

Sometimes we want to filter numeric columns, like number_of_workers, to focus on larger layoffs or smaller events.

For example, we can filter the dataset to include only events where 50 or more workers were affected.

large_layoffs = warn.filter(col("number_of_workers") >= 50)

display(large_layoffs)

This creates a new DataFrame that excludes small events affecting fewer than 50 workers.

Example of how this could be used.

To better understand the filtered data, we can count how many events exist for each number of workers:

worker_counts = (
    large_layoffs
    .groupBy("number_of_workers")
    .count()
    .orderBy("number_of_workers", ascending=True)
)

worker_counts.show()

This allows us to see how many events affect a given number of workers, which is useful for summary statistics or further analysis.

Categorical Filter Example

Filtering is useful for categorical data, such as states.

For example, we can filter the dataset to show only events in California:

filtered_state = warn.filter(col("state") == "California")

display(filtered_state)

This returns only rows where the state column equals “California”.

Example of how this could be used.

We can also see how many events occur in each state:

state_counts = (
    warn
    .groupBy("state")
    .count()
    .orderBy("state")
)

state_counts.show()

This helps us understand the distribution of layoff events across states.

Multiple Conditions (AND / OR) Filtering

We can filter rows based on more than one condition.

For example, keep only rows where:

number_of_workers ≥ 50 AND
closure_layoff contains the word “Layoff”

filtered_and = warn.filter(
    (col("number_of_workers") >= 50) &
    (col("closure_layoff").like("%Layoff%"))
)

display(filtered_and)

Or using OR, to keep rows where either condition is true:

filtered_or = warn.filter(
    (col("number_of_workers") >= 50) |
    (col("state") == "California")
)

display(filtered_or)

Why this is useful:

Combining conditions allows you to focus on events that meet multiple criteria simultaneously or any one of several criteria, giving you more control over your analysis.

Using `isin()` for Multiple Values

When filtering for multiple categories, isin() is very convenient.

filtered_states = warn.filter(
    col("state").isin("California", "Texas", "Washington")
)

display(filtered_states)

This code creates a new DataFrame containing only rows where the state column matches any one of the values “California”, “Texas”, or “Washington”. Internally, isin() checks each row and returns True if the state is in the provided list. The display() function then shows all rows that meet this condition, letting us quickly examine multiple states at once without writing multiple OR conditions.

Why this is useful:

It lets you select multiple states at once without writing multiple OR conditions, which keeps your code cleaner and easier to read.

Using `between()` for Numeric Ranges

The between() function allows you to filter numeric columns within a specific range.

For example, we can find layoffs that affect between 50 and 200 workers:

filtered_workers = warn.filter(
    col("number_of_workers").between(50, 200)
)

display(filtered_workers)

This code filters the dataset to include only rows where number_of_workers is between 50 and 200, inclusive. Each row is checked, and only those meeting the numeric range condition are kept in the new DataFrame. The display() function then shows all rows that fall within this range, which is especially useful for analyzing mid-sized layoffs or creating reports based on specific thresholds.

Why this is useful:

between() provides a readable way to filter numeric columns inclusively, which is especially handy for reporting or summary statistics.

Regex and `like` Filters

PySpark allows you to filter text columns using pattern matching. Two common approaches are:

like() — simple wildcard matching
rlike() / regexp — full regular expression matching

These are especially useful when you want to find rows that match a pattern, rather than an exact value.

Using `like()` for Simple Patterns

The like() function works like a SQL wildcard search.
For example, we can find all layoffs where the closure_layoff column contains the word "Temporary":

temp_layoffs = warn.filter(
    col("closure_layoff").like("%Temporary%")
)

display(temp_layoffs)

Explanation:

% is a wildcard that matches any sequence of characters.
This keeps all rows where closure_layoff contains "Temporary".
It’s useful for quick text filtering without needing full regular expressions.

Using `rlike()` for Regular Expressions

The rlike() function allows more complex pattern matching using regular expressions.
For example, to find layoffs that mention either "Layoff" or "Closure":

layoff_or_closure = warn.filter(
    col("closure_layoff").rlike("Layoff|Closure")
)

display(layoff_or_closure)

Explanation:

The | symbol means OR in regular expressions.
This keeps all rows where closure_layoff contains either “Layoff” or “Closure”.
Regular expressions let you capture complex text patterns, like multiple keywords, optional words, or specific formats (e.g., dates, codes).

Why this is useful

like() is perfect for simple substring searches.
rlike() is powerful when you need flexible pattern matching, e.g., multiple keywords, optional terms, or numeric patterns in text columns.
Both methods let you filter large datasets based on text content, which is very common in WARN or layoff reports.

Filter

Filtering in PySpark

Example Dataset

Loading the Data in PySpark

Numeric Filter Example

Categorical Filter Example

Multiple Conditions (AND / OR) Filtering

Using isin() for Multiple Values

Using between() for Numeric Ranges

Regex and like Filters

Using like() for Simple Patterns

Using rlike() for Regular Expressions

Why this is useful

Using `isin()` for Multiple Values

Using `between()` for Numeric Ranges

Regex and `like` Filters

Using `like()` for Simple Patterns

Using `rlike()` for Regular Expressions