email github twitter twitter twitter

nasseri.io

The Blog of Dean Nasseri

Understanding Posrgres Window Functions

Posted on Thursday, Jun 02, 2016

Recently I stumbled into a very handy feature of Postgres — window functions. A window function gives you a “window” into other rows which are somehow related to the current row. You can then preform a calculation across that set of rows. Window functions behave much like regular aggregate functions, with the caveat that the rows do not become grouped into a single row! Let’s explore this feature, and why it might be useful.

Let’s imagine that we have an application that is used to record sales data for different stores. To keep things simple, we will say that the application’s database has only three tables: sales, users, and stores.

SALES                                                        STORES
+------+-----------+------------+----------+--------------+  +------+---------------+
|   id |   user_id |   store_id | amount   | entered_at   |  |   id | name          |
|------+-----------+------------+----------+--------------|  |------+---------------|
|    1 |         1 |          1 | 10000.0  | 2015-11-07   |  |    1 | Gucci         |
|    2 |         1 |          1 | 30000.0  | 2015-11-29   |  |    2 | Louis Vuitton |
|    3 |         1 |          2 | 11000.0  | 2015-11-08   |  |    3 | Fendi         |
|    4 |         1 |          2 | 19000.0  | 2015-11-26   |  +------+---------------+
|    5 |         1 |          3 | 14000.0  | 2015-11-15   |
|    6 |         1 |          3 | 14000.0  | 2015-11-23   |  USERS
|    7 |         1 |          1 | 30000.0  | 2015-12-08   |  +------+--------+
|    8 |         1 |          1 | 40000.0  | 2015-12-29   |  |   id | name   |
|    9 |         1 |          2 | 10000.0  | 2015-12-01   |  |------+--------|
|   10 |         1 |          2 | 11400.0  | 2015-12-14   |  |    1 | Sarah  |
|   11 |         1 |          3 | 21000.0  | 2015-12-26   |  +------+--------+
|   12 |         1 |          3 | 21000.0  | 2015-12-26   |
+------+-----------+------------+----------+--------------+

In our application, users enter sales data for a given store. The sales table has a column for “entered_at” which represents when the sales data was entered.

Using Windows Functions with Aggregate Functions

Lets say we wanted to aggregate the sales for each store. We could do this easily using a GROUP BY statement.

SELECT store_id, sum(amount) as amt FROM sales GROUP BY sales.store_id

which yields the following rows:

+------------+----------+
|   store_id | amt      |
|------------+----------|
|          1 | 110000.0 |
|          2 | 51400.0  |
|          3 | 70000.0  |
+------------+----------+

But what if we wanted to return each row from the sales table individually, with the corresponding stores total sales amount. We can do this using a window function.

SELECT store_id, sum(amount) OVER (PARTITION BY store_id) FROM sales

which results in:

+------------+----------+
|   store_id | sum      |
|------------+----------|
|          1 | 110000.0 |
|          1 | 110000.0 |
|          1 | 110000.0 |
|          1 | 110000.0 |
|          2 | 51400.0  |
|          2 | 51400.0  |
|          2 | 51400.0  |
|          2 | 51400.0  |
|          3 | 70000.0  |
|          3 | 70000.0  |
|          3 | 70000.0  |
|          3 | 70000.0  |
+------------+----------+

Let’s break this down a bit. The store_id column is self explanatory. The sum aggregates the amount from every row that has the same store_id as the current row. The OVER keyword is what causes the aggregate function to be treated as a window function, and compute the aggregate across the rows in the window without grouping.

But how do we define the window? How does Postgres know which subset of rows we want to aggregate? The answer lies in the PARTITION keyword. The partition is what divides the table rows into distinct window frames. In this case we partition the table by the store_id.

One thing to note is that we can use an aggregate function as a window function. We could have just as easily written

SELECT store_id, COUNT(*) OVER (PARTITION BY store_id) FROM sales

Using the Rank Function

Postgres provides a number of window specific functions which you can find here. One of the most common use cases for window functions is to rank a window of rows using the rank function.

Lets say we wanted to rank each sales record by the amount entered for a given store_id. We can do this easily with the rank function:

SELECT store_id, amount, rank() OVER (PARTITION BY store_id ORDER BY amount DESC) AS amount_rank FROM sales

The resulting table would look like this:


+------------+----------+---------------+
|   store_id | amount   |   amount_rank |
|------------+----------+---------------|
|          1 | 40000.0  |             1 |
|          1 | 30000.0  |             2 |
|          1 | 30000.0  |             2 |
|          1 | 10000.0  |             4 |
|          2 | 19000.0  |             1 |
|          2 | 11400.0  |             2 |
|          2 | 11000.0  |             3 |
|          2 | 10000.0  |             4 |
|          3 | 21000.0  |             1 |
|          3 | 21000.0  |             1 |
|          3 | 14000.0  |             3 |
|          3 | 14000.0  |             3 |
+------------+----------+---------------+

You may notice that if a two or more rows have the same amount, they receive the same rank. The rank function will assign a unique number to each distinct row, but it leaves a gap when there are multiple rows with the same rank. Take a look at the rows with a store_id of 1. We can see that there are two ranks of 2, so the next rank is incremented not to 3, but to 4. the dense_rank function does not do this and instead the row with a rank of 4 would have a rank of 3.

These have been some pretty simple examples of window functions. Let’s move onto something a little bit more interesting.

Complex Window Partitions

Let’s say that in our application we want to be able to get the most recent sales data entered for a given month/year combination and a given store. This problem can be solved handily using window functions.

Let’s start by crafting a query to give us the store_id, the month of the entered date, the year of the entered_date, and the amount of the sale.

SELECT store_id, EXTRACT(month from entered_at) as mm, EXTRACT(year from entered_at) as yyyy, amount FROM sales

Which yields:

+------------+------+--------+----------+
|   store_id |   mm |   yyyy | amount   |
|------------+------+--------+----------|
|          1 |   11 |   2015 | 10000.0  |
|          1 |   11 |   2015 | 30000.0  |
|          2 |   11 |   2015 | 11000.0  |
|          2 |   11 |   2015 | 19000.0  |
|          3 |   11 |   2015 | 14000.0  |
|          3 |   11 |   2015 | 14000.0  |
|          1 |   12 |   2015 | 30000.0  |
|          1 |   12 |   2015 | 40000.0  |
|          2 |   12 |   2015 | 10000.0  |
|          2 |   12 |   2015 | 11400.0  |
|          3 |   12 |   2015 | 21000.0  |
|          3 |   12 |   2015 | 21000.0  |
+------------+------+--------+----------+

Note that we are using the EXTRACT function which allows us to retrieve subfields such as year, hour, day, month from date/time values.

The next thing we want to do is to partition the table. The original intent of our query was to grab the most recent sales data entered for a given month/year combination and a given store. So our partition should be defined as follows “for each row, get me the rows with the same month/year in the entered date, and the same store_id”. In other words, we need to partition by the extracted month, year, and store_id.

Now here’s the problem. Unlike a GROUP BY statement, we can’t partition over several columns. So how can we get the window frames we want?

The solution I found was to use the string function concat, and grouping by the concatenated value of the extracted month/year, and the store_id.

SELECT store_id, EXTRACT(month from entered_at) as mm, EXTRACT(year from entered_at) as yyyy, amount, rank() OVER (PARTITION BY concat(extract(month from entered_at), extract(year from entered_at), store_id) ORDER BY entered_at DESC) AS entered_at_rank FROM sales ORDER BY store_id

This is effectively grouping by the month of the entered date, the year of the entered date, and the store id. It gives us the exact partitions that we want!

+------------+------+--------+----------+-------------------+
|   store_id |   mm |   yyyy | amount   |   entered_at_rank |
|------------+------+--------+----------+-------------------|
|          1 |   11 |   2015 | 30000.0  |                 1 |
|          1 |   11 |   2015 | 10000.0  |                 2 |
|          1 |   12 |   2015 | 40000.0  |                 1 |
|          1 |   12 |   2015 | 30000.0  |                 2 |
|          2 |   11 |   2015 | 11000.0  |                 2 |
|          2 |   12 |   2015 | 11400.0  |                 1 |
|          2 |   12 |   2015 | 10000.0  |                 2 |
|          2 |   11 |   2015 | 19000.0  |                 1 |
|          3 |   11 |   2015 | 14000.0  |                 1 |
|          3 |   11 |   2015 | 14000.0  |                 2 |
|          3 |   12 |   2015 | 21000.0  |                 1 |
|          3 |   12 |   2015 | 21000.0  |                 1 |
+------------+------+--------+----------+-------------------+

Now the next step is to apply a where clause that selects only the sales with the highest rank. Since we can’t apply a WHERE clause to an aliased column that is not on the table itself, we have to use a subquery in our FROM statement:

SELECT store_id, mm, yyyy, amount, entered_at_rank FROM (SELECT store_id, EXTRACT(month from entered_at) as mm, EXTRACT(year from entered_at) as yyyy, amount, rank() OVER (PARTITION BY concat(extract(month from entered_at), extract(year from entered_at), store_id) ORDER BY entered_at DESC) AS entered_at_rank FROM sales ORDER by entered_at_rank) as t ORDER BY store_id

This results in exactly the same table as above, but now we can use a WHERE clause on entered_at_rank like this:

SELECT store_id, mm, yyyy, amount, entered_at_rank FROM (SELECT store_id, EXTRACT(month from entered_at) as mm, EXTRACT(year from entered_at) as yyyy, amount, rank() OVER (PARTITION BY concat(extract(month from entered_at), extract(year from entered_at), store_id) ORDER BY entered_at DESC) AS entered_at_rank FROM sales ORDER by entered_at_rank) as t WHERE entered_at_rank = 1 ORDER BY store_id

Which results in

+------------+------+--------+----------+-------------------+
|   store_id |   mm |   yyyy | amount   |   entered_at_rank |
|------------+------+--------+----------+-------------------|
|          1 |   12 |   2015 | 40000.0  |                 1 |
|          1 |   11 |   2015 | 30000.0  |                 1 |
|          2 |   12 |   2015 | 11400.0  |                 1 |
|          2 |   11 |   2015 | 19000.0  |                 1 |
|          3 |   12 |   2015 | 21000.0  |                 1 |
|          3 |   11 |   2015 | 14000.0  |                 1 |
|          3 |   12 |   2015 | 21000.0  |                 1 |
+------------+------+--------+----------+-------------------+

There is a problem though! The store with an id of 3 has multiple sales for the same month with the same rank (since they have the same amount). What to do?

Row Number to the Rescue

One thing we can do is use the row_number() window function which will give the number of the current_row within its partition.

SELECT store_id, mm, yyyy, amount, entered_at_row FROM (SELECT store_id, EXTRACT(month from entered_at) as mm, EXTRACT(year from entered_at) as yyyy, amount, row_number() OVER (PARTITION BY concat(extract(month from entered_at), extract(year from entered_at), store_id) ORDER BY entered_at DESC) AS entered_at_row FROM sales ORDER by entered_at_row) as t WHERE entered_at_row = 1 ORDER BY store_id
+------------+------+--------+----------+------------------+
|   store_id |   mm |   yyyy | amount   |   entered_at_row |
|------------+------+--------+----------+------------------|
|          1 |   11 |   2015 | 30000.0  |                1 |
|          1 |   12 |   2015 | 40000.0  |                1 |
|          2 |   11 |   2015 | 19000.0  |                1 |
|          2 |   12 |   2015 | 11400.0  |                1 |
|          3 |   11 |   2015 | 14000.0  |                1 |
|          3 |   12 |   2015 | 21000.0  |                1 |
+------------+------+--------+----------+------------------+

Now we have exactly one sale for each store in a given month/year!

Back