Filtering in PySpark

The filter() transformation in PySpark is used to select only the rows that meet a specific condition.

In many datasets, we do not want to analyze every row. Instead, we often want to focus on only the data that is relevant to our question. The filter() function helps us do this by creating a new DataFrame that only contains rows that match a condition.

For example, imagine a dataset containing people’s ages from 15 to 21. If we only want to analyze adults, we could filter the dataset to include only ages 18 and older.

Filtering is especially useful when working with large datasets, because it allows us to quickly narrow the data down to only the information we need.



Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

  1. Go to Dewey Data

  2. Download WARN Layoff Data

  3. Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

city closure_layoff company county effective_date industry notes number_of_workers region state temporary_permanent unions warn_received_date
Statewide None Nordstrom Card Services None None None None 15 None Utah None None 2026-02-21
Clackamas Large Layoff - 10 or more workers Vacuum Technique LLC None 04/30/2026-06/30/2026 None 9415 78 None Oregon None None 2025-12-31
Johnson City Layoff US Endodontics, LLC Washington 2/27/26 None #202500025 70 None Tennessee Permanent None 2025-12-30
Atlanta Permanent Layoff / Reduction in Force The Coca-Cola Company Fulton 2/28/25 None GA202500061 75 None Georgia None None 2025-12-30
Deposit Layoff - Temporary Crockett & Sons Concrete, Inc.  None 12/26/25 238110 61 Marian Drive Port Deposit, MD 21904 4 Susquehanna Maryland None None 2025-12-29
Fayette Closure Eagle Machining - First Brands Group, LLC Fulton 02/27/2026-03/31/2026 None 007-25-87 251 None Ohio Permanent UAW Local 1181 2025-12-29

This displays the first few rows, letting us see column names and understand the dataset structure.



Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

warn = spark.read.csv("data/warn-layoff-data-sample.csv", header=True, inferSchema=True)

display(warn)

The display() function allows us to visually inspect the dataset.



Numeric Filter Example

Sometimes we want to filter numeric columns, like number_of_workers, to focus on larger layoffs or smaller events.

For example, we can filter the dataset to include only events where 50 or more workers were affected.

large_layoffs = warn.filter(col("number_of_workers") >= 50)

display(large_layoffs)

This creates a new DataFrame that excludes small events affecting fewer than 50 workers.


Example of how this could be used.

To better understand the filtered data, we can count how many events exist for each number of workers:

worker_counts = (
    large_layoffs
    .groupBy("number_of_workers")
    .count()
    .orderBy("number_of_workers", ascending=True)
)

worker_counts.show()

This allows us to see how many events affect a given number of workers, which is useful for summary statistics or further analysis.



Categorical Filter Example

Filtering is useful for categorical data, such as states.

For example, we can filter the dataset to show only events in California:

filtered_state = warn.filter(col("state") == "California")

display(filtered_state)

This returns only rows where the state column equals “California”.


Example of how this could be used.

We can also see how many events occur in each state:

state_counts = (
    warn
    .groupBy("state")
    .count()
    .orderBy("state")
)

state_counts.show()

This helps us understand the distribution of layoff events across states.



Multiple Conditions (AND / OR) Filtering

We can filter rows based on more than one condition.

For example, keep only rows where:

  • number_of_workers ≥ 50 AND

  • closure_layoff contains the word “Layoff”

filtered_and = warn.filter(
    (col("number_of_workers") >= 50) &
    (col("closure_layoff").like("%Layoff%"))
)

display(filtered_and)

Or using OR, to keep rows where either condition is true:

filtered_or = warn.filter(
    (col("number_of_workers") >= 50) |
    (col("state") == "California")
)

display(filtered_or)

Why this is useful:

Combining conditions allows you to focus on events that meet multiple criteria simultaneously or any one of several criteria, giving you more control over your analysis.



Using isin() for Multiple Values

When filtering for multiple categories, isin() is very convenient.

filtered_states = warn.filter(
    col("state").isin("California", "Texas", "Washington")
)

display(filtered_states)

This code creates a new DataFrame containing only rows where the state column matches any one of the values “California”, “Texas”, or “Washington”. Internally, isin() checks each row and returns True if the state is in the provided list. The display() function then shows all rows that meet this condition, letting us quickly examine multiple states at once without writing multiple OR conditions.

Why this is useful:

It lets you select multiple states at once without writing multiple OR conditions, which keeps your code cleaner and easier to read.



Using between() for Numeric Ranges

The between() function allows you to filter numeric columns within a specific range.

For example, we can find layoffs that affect between 50 and 200 workers:

filtered_workers = warn.filter(
    col("number_of_workers").between(50, 200)
)

display(filtered_workers)

This code filters the dataset to include only rows where number_of_workers is between 50 and 200, inclusive. Each row is checked, and only those meeting the numeric range condition are kept in the new DataFrame. The display() function then shows all rows that fall within this range, which is especially useful for analyzing mid-sized layoffs or creating reports based on specific thresholds.

Why this is useful:

between() provides a readable way to filter numeric columns inclusively, which is especially handy for reporting or summary statistics.

Regex and like Filters

PySpark allows you to filter text columns using pattern matching. Two common approaches are:

  • like() — simple wildcard matching

  • rlike() / regexp — full regular expression matching

These are especially useful when you want to find rows that match a pattern, rather than an exact value.


Using like() for Simple Patterns

The like() function works like a SQL wildcard search.
For example, we can find all layoffs where the closure_layoff column contains the word "Temporary":

temp_layoffs = warn.filter(
    col("closure_layoff").like("%Temporary%")
)

display(temp_layoffs)

Explanation:

  • % is a wildcard that matches any sequence of characters.

  • This keeps all rows where closure_layoff contains "Temporary".

  • It’s useful for quick text filtering without needing full regular expressions.


Using rlike() for Regular Expressions

The rlike() function allows more complex pattern matching using regular expressions.
For example, to find layoffs that mention either "Layoff" or "Closure":

layoff_or_closure = warn.filter(
    col("closure_layoff").rlike("Layoff|Closure")
)

display(layoff_or_closure)

Explanation:

  • The | symbol means OR in regular expressions.

  • This keeps all rows where closure_layoff contains either “Layoff” or “Closure”.

  • Regular expressions let you capture complex text patterns, like multiple keywords, optional words, or specific formats (e.g., dates, codes).


Why this is useful

  • like() is perfect for simple substring searches.

  • rlike() is powerful when you need flexible pattern matching, e.g., multiple keywords, optional terms, or numeric patterns in text columns.

  • Both methods let you filter large datasets based on text content, which is very common in WARN or layoff reports.