Sorting and Ordering in PySpark

The sort() or orderBy() transformation in PySpark is used to reorder rows in a DataFrame based on one or more columns.

Sorting allows us to quickly see the largest, smallest, earliest, or latest values in a dataset. Unlike filtering, which removes rows, sorting keeps all rows but changes their order.

For example, in a dataset of people’s ages, sorting by age ascending shows the youngest to oldest, while descending shows oldest to youngest.

Sorting is especially useful when working with large datasets, because it helps us identify trends, outliers, or top/bottom values quickly.



Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

  1. Go to Dewey Data

  2. Download Daily Weather Data by Custom Weather

  3. Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

average_dew_point_f___float_value_to_nearest_hundredths_place average_relative_humidity_____float_value_to_nearest_hundredths_place average_sea_level_pressure_millibars___float_value_to_nearest_hundredths_place average_temperature_c___float_value_to_nearest_hundredths_place average_visibility_kilometers___float_value_to_nearest_hundredths_place average_wind_speed_knots___float_value_to_nearest_hundredths_place city_location_identifier_up_to_9_alphanumeric_characters cooling_degree_days_c___float_value_to_nearest_hundredths_place heating_degree_days_c___float_value_to_nearest_hundredths_place maximum_temperature_c___float_value_to_nearest_hundredths_place minimum_temperature_c___float_value_to_nearest_hundredths_place precipitation_integer_in_hundredths_of_a_millimeter_liquid_equivalent__0is_used_for_trace_amounts_and_1__is_used_for_no_precipitation valid_date_as_yyyymmdd
-3.47 66.74 1014.60 1.5 16.10 4.80 K04W 0 16.83 7 -4 -1 2024-04-01
-0.95 84.05 1010.19 2.0 14.14 5.98 K04W 0 16.33 4 0 -1 2024-04-02
-6.49 47.54 1006.81 5.0 16.10 12.39 K04W 0 13.33 9 1 182 2024-04-03
-4.31 55.88 1013.92 5.0 16.10 5.42 K04W 0 13.33 12 -2 -1 2024-04-04
-4.03 58.93 1021.37 4.5 16.10 2.94 K04W 0 13.83 14 -5 -1 2024-04-05
-4.02 50.16 1020.35 6.5 16.10 7.06 K04W 0 11.83 15 -2 -1 2024-04-06

This displays the first few rows, letting us see column names and understand the dataset structure.



Loading the Data in PySpark

Before sorting, we load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

temp = spark.read.table("workspace.default.daily_weather_data_sample")

display(temp)

The display() function lets us visually inspect the dataset.



Numeric Sort Example

The precipitation column contains numeric data:

  • -1 = no precipitation

  • 0 or higher = actual measurements

First, rename the column for easier use.

temp = temp.withColumnRenamed(
    "precipitation_integer_in_hundredths_of_a_millimeter___liquid_equivalent____0__is_used_for_trace_amounts_and___1__is_used_for_no_precipitation",
    "precipitation"
)

Now, sort the dataset by precipitation ascending:

sorted_temp_asc = temp.sort(col("precipitation").asc())

display(sorted_temp_asc)

Sorting descending (largest to smallest):

sorted_temp_desc = temp.sort(col("precipitation").desc())

display(sorted_temp_desc)


Checking the Sorted Results

We can group by precipitation and count to confirm the order:

prec_count = (
    temp
    .groupBy("precipitation")
    .count()
    .orderBy("precipitation", ascending=True)
)

prec_count.show()



Categorical Sort Example

Sorting is also useful for categorical columns, like state names.

la = spark.read.table("workspace.default.layoff")

display(la)

Sort by state alphabetically:

sorted_states = la.sort(col("state").asc())

display(sorted_states)

Sort descending:

sorted_states_desc = la.sort(col("state").desc())

display(sorted_states_desc)


Example of why this is useful.

Counting Entries by State

All = (
    la
    .groupBy("state")
    .count()
    .orderBy("state", ascending=True)
)

All.show()

This shows the distribution of records by state.



Sorting After Changing Data Types

Sometimes numeric columns are stored as text. We convert them before sorting.

edu = spark.read.table("workspace.default.education")

display(edu)

Convert end_date to numeric:

from pyspark.sql.functions import expr

edu2 = edu.withColumn(
    "end_date",
    expr("try_cast(end_date as bigint)")
)

Sort by end_date ascending:

sorted_edu = edu2.sort(col("end_date").asc())

display(sorted_edu)

Sort descending:

sorted_edu_desc = edu2.sort(col("end_date").desc())

display(sorted_edu_desc)


Example of why this is useful.

Counting Records by Year

year = (
    edu2
    .groupBy("end_date")
    .count()
    .orderBy("end_date", ascending=True)
)

year.show()

This allows us to see records in chronological order after sorting.

Multi-Column Sort Example

Sometimes we want to sort by more than one column.
For example, in the daily_weather_data_sample dataset, we might want to sort first by precipitation (ascending) and then by temperature_max (descending). This helps us see, for example, days with the same precipitation but the hottest first.

multi_sorted = temp.sort(
    col("precipitation").asc(),
    col("temperature_max").desc()
)

display(multi_sorted)
  1. precipitation.asc() – Sorts by precipitation from smallest to largest.

  2. temperature_max.desc() – Within the same precipitation value, sorts by maximum temperature from largest to smallest.

This is especially useful when analyzing hierarchies or ties in the data.


Multi-Column Sort with Categorical Data

We can also sort by a categorical column and a numeric column. For example, sorting the layoff dataset by state ascending and number_of_layoffs descending:

sorted_layoffs = la.sort(
    col("state").asc(),
    col("number_of_layoffs").desc()
)

display(sorted_layoffs)

This lets us quickly identify which states had the largest layoffs while keeping the states in alphabetical order.