The sort() or orderBy()
transformation in PySpark is used to reorder rows in a
DataFrame based on one or more columns.
Sorting allows us to quickly see the largest, smallest, earliest, or latest values in a dataset. Unlike filtering, which removes rows, sorting keeps all rows but changes their order.
For example, in a dataset of people’s ages, sorting by age ascending shows the youngest to oldest, while descending shows oldest to youngest.
Sorting is especially useful when working with large datasets, because it helps us identify trends, outliers, or top/bottom values quickly.
In the following examples we will use a sample dataset from Dewey Data.
To follow along:
Go to Dewey Data
Download Daily Weather Data by Custom Weather
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section:
How to Get Dewey Data Sample Sets
| average_dew_point_f___float_value_to_nearest_hundredths_place | average_relative_humidity_____float_value_to_nearest_hundredths_place | average_sea_level_pressure_millibars___float_value_to_nearest_hundredths_place | average_temperature_c___float_value_to_nearest_hundredths_place | average_visibility_kilometers___float_value_to_nearest_hundredths_place | average_wind_speed_knots___float_value_to_nearest_hundredths_place | city_location_identifier_up_to_9_alphanumeric_characters | cooling_degree_days_c___float_value_to_nearest_hundredths_place | heating_degree_days_c___float_value_to_nearest_hundredths_place | maximum_temperature_c___float_value_to_nearest_hundredths_place | minimum_temperature_c___float_value_to_nearest_hundredths_place | precipitation_integer_in_hundredths_of_a_millimeter_liquid_equivalent__0is_used_for_trace_amounts_and_1__is_used_for_no_precipitation | valid_date_as_yyyymmdd |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -3.47 | 66.74 | 1014.60 | 1.5 | 16.10 | 4.80 | K04W | 0 | 16.83 | 7 | -4 | -1 | 2024-04-01 |
| -0.95 | 84.05 | 1010.19 | 2.0 | 14.14 | 5.98 | K04W | 0 | 16.33 | 4 | 0 | -1 | 2024-04-02 |
| -6.49 | 47.54 | 1006.81 | 5.0 | 16.10 | 12.39 | K04W | 0 | 13.33 | 9 | 1 | 182 | 2024-04-03 |
| -4.31 | 55.88 | 1013.92 | 5.0 | 16.10 | 5.42 | K04W | 0 | 13.33 | 12 | -2 | -1 | 2024-04-04 |
| -4.03 | 58.93 | 1021.37 | 4.5 | 16.10 | 2.94 | K04W | 0 | 13.83 | 14 | -5 | -1 | 2024-04-05 |
| -4.02 | 50.16 | 1020.35 | 6.5 | 16.10 | 7.06 | K04W | 0 | 11.83 | 15 | -2 | -1 | 2024-04-06 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before sorting, we load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
temp = spark.read.table("workspace.default.daily_weather_data_sample")
display(temp)The display() function lets us visually inspect the
dataset.
The precipitation column contains numeric data:
-1 = no precipitation
0 or higher = actual measurements
First, rename the column for easier use.
temp = temp.withColumnRenamed(
"precipitation_integer_in_hundredths_of_a_millimeter___liquid_equivalent____0__is_used_for_trace_amounts_and___1__is_used_for_no_precipitation",
"precipitation"
)Now, sort the dataset by precipitation ascending:
Sorting descending (largest to smallest):
Sorting is also useful for categorical columns, like state names.
Sort by state alphabetically:
Sort descending:
Example of why this is useful.
Counting Entries by State
This shows the distribution of records by state.
Sometimes numeric columns are stored as text. We convert them before sorting.
Convert end_date to numeric:
from pyspark.sql.functions import expr
edu2 = edu.withColumn(
"end_date",
expr("try_cast(end_date as bigint)")
)Sort by end_date ascending:
Sort descending:
Example of why this is useful.
Counting Records by Year
This allows us to see records in chronological order after sorting.
Sometimes we want to sort by more than one
column.
For example, in the daily_weather_data_sample dataset, we
might want to sort first by precipitation (ascending)
and then by temperature_max (descending). This helps us
see, for example, days with the same precipitation but the hottest
first.
multi_sorted = temp.sort(
col("precipitation").asc(),
col("temperature_max").desc()
)
display(multi_sorted)precipitation.asc() – Sorts by
precipitation from smallest to largest.
temperature_max.desc() – Within the
same precipitation value, sorts by maximum temperature from largest to
smallest.
This is especially useful when analyzing hierarchies or ties in the data.
We can also sort by a categorical column and a numeric column. For
example, sorting the layoff dataset by state
ascending and number_of_layoffs
descending:
sorted_layoffs = la.sort(
col("state").asc(),
col("number_of_layoffs").desc()
)
display(sorted_layoffs)This lets us quickly identify which states had the largest layoffs while keeping the states in alphabetical order.