ᚱᛗ
© 2022
Powered by Hugo

Window Functions

Table of Contents

Definition

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities and they all are displayed in the query’s output.

Example: Sales Orders Table

Let’s analyze (artificial) sales orders data (Table 1). The table has the following attributes:

  • order_id: the id of the sales order
  • sales_value_thsd: the total amount of the sales order, in thousands of dollars
  • point_of_sale: represents the store where the sale took place
  • date: date of the transaction, YYYY-MM-DD

Table0. Sales Orders Table

order_id sales_value_thsd point_of_sale date
1 6080.25 1 2019-11-29
2 8175.90 2 2019-11-29
3 8175.00 2 2020-05-25
4 2199.00 5 2020-06-29
5 3970.100 5 2020-07-29
6 3299.33 3 2020-10-29
7 2088.75 1 2020-11-29
8 5299.10 1 2021-01-29
9 1199.00 5 2022-12-22

Query1: Running Sum/Total Across All Orders by Date (No Partitions)

--running sum across all orders by date
SELECT date, sales_value_thsd, SUM(sales_value_thsd) OVER
	(ORDER BY date ASC ROWS UNBOUNDED PRECEDING) as running_sum
    -- UNBOUNDED PRECEDING is the default
FROM sales_orders;

Table1. Running Sum/Total without Partitions

date sales_value_thsd running_sum
2019-11-29 6080.25 6080.25
2019-11-29 8175.90 14256.15
2020-05-25 8175.00 22431.15
2020-06-29 2199.00 24630.15
2020-07-29 3970.100 28600.250
2020-10-29 3299.33 31899.580
2020-11-29 2088.75 33988.330
2021-01-29 5299.10 39287.430
2022-12-22 1199.00 40486.430

Query2: Ranking Largest Orders Partitioning by point_of_sale

--ranking largest orders by value_thsd partitioning by point_of_sale
SELECT point_of_sale, order_id, sales_value_thsd, RANK() OVER
	(PARTITION BY point_of_sale ORDER BY sales_value_thsd DESC)
FROM sales_orders;

Table2. Value Rank by Partition (descending)

point_of_sale order_id sales_value_thsd rank
1 1 6080.25 1
1 8 5299.10 2
1 7 2088.75 3
2 2 8175.90 1
2 3 8175.00 2
3 6 3299.33 1
5 5 3970.100 1
5 4 2199.00 2
5 9 1199.00 3

As shown here, the RANK function produces a numerical rank for each distinct ORDER BY value in the current row’s partition, using the order defined by the ORDER BY clause. rank needs no explicit parameter, because its behavior is entirely determined by the OVER clause.

The rows considered by a window function are those of the “virtual table” produced by the query’s FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row removed because it does not meet the WHERE condition is not seen by any window function. A query can contain multiple window functions that slice up the data in different ways using different OVER clauses, but they all act on the same collection of rows defined by this virtual table.

Query3: Average Sales Partitioning by point_of_sale

--average sales_value_thsd by point of sale
SELECT point_of_sale, sales_value_thsd, round(avg_sales_thsd, 2) AS pos_avg
FROM (
    SELECT point_of_sale, sales_value_thsd, avg(sales_value_thsd)  OVER
        (partition by point_of_sale) as avg_sales_thsd
    FROM sales_orders
) as subq;

Table3. Average Sales by Partition

point_of_sale sales_value_thsd pos_avg
1 2088.75 4489.37
1 5299.10 4489.37
1 6080.25 4489.37
2 8175.90 8175.45
2 8175.00 8175.45
3 3299.33 3299.33
5 1199.00 2456.03
5 2199.00 2456.03
5 3970.100 2456.03

Query4: Average Sales and RunningSum Partitioning by point_of_sale

--average and running sum of sales by point_of_sale
SELECT
	point_of_sale,
	date,
	order_id,
	sales_value_thsd,
	round(running_sum, 2) AS running_sum,
	round(running_avg, 2) AS running_avg
FROM (
	SELECT
		point_of_sale,
		date,
		order_id,
		sales_value_thsd,
		sum(sales_value_thsd) OVER w AS running_sum,
		avg(sales_value_thsd) OVER w AS running_avg
	FROM sales_orders
	WINDOW w AS (PARTITION BY point_of_sale	ORDER BY date ASC)
) AS subq;

Table4. Two Window Functions (sum and avg) Over the Same Window

point_of_sale date order_id sales_value_thsd running_sum running_avg
1 2019-11-29 1 6080.25 6080.25 6080.25
1 2020-11-29 7 2088.75 8169.00 4084.50
1 2021-01-29 8 5299.10 13468.10 4489.37
2 2019-11-29 2 8175.90 8175.90 8175.90
2 2020-05-25 3 8175.00 16350.90 8175.45
3 2020-10-29 6 3299.33 3299.33 3299.33
5 2020-06-29 4 2199.00 2199.00 2199.00
5 2020-07-29 5 3970.100 6169.10 3084.55
5 2022-12-22 9 1199.00 7368.10 2456.03

Window Function Terminology

A window function call always contains an OVER clause directly following the window function’s name and argument(s). This is what syntactically distinguishes it from a normal function or non-window aggregate. The OVER clause determines exactly how the rows of the query are split up for processing by the window function.

The PARTITION BY clause within OVER divides the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.

You can also control the order in which rows are processed by window functions using ORDER BY within OVER (the window ORDER BY does not even have to match the order in which the rows are output).

Window functions are permitted only in the SELECT list and the ORDER BY clause of the query. They are forbidden elsewhere, such as in GROUP BY, HAVING and WHERE clauses. This is because they logically execute after the processing of those clauses. Also, window functions execute after non-window aggregate functions. This means it is valid to include an aggregate function call in the arguments of a window function, but not vice versa. If there is a need to filter or group rows after the window calculations are performed, you can use a sub-select (see Query3, which uses a sub-select to execute a window function and an outer select to round it).

The frame_clause specifies the set of rows constituting the window frame, which is a subset of the current partition, for those window functions that act on the frame instead of the whole partition. The set of rows in the frame can vary depending on which row is the current row. The frame can be specified in RANGE, ROWS or GROUPS mode; in each case, it runs from the frame_start to the frame_end. If frame_end is omitted, the end defaults to CURRENT ROW.

A frame_start of UNBOUNDED PRECEDING means that the frame starts with the first row of the partition, and similarly a frame_end of UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition. The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Where BETWEEN is not specified, the frame is implicitly bounded by the current row.

With ORDER BY, this sets the frame to be all rows from the partition start up through the current row’s last ORDER BY peer. Without ORDER BY, this means all rows of the partition are included in the window frame, since all rows become peers of the current row.

Window Function Processing

When multiple window functions are used, all the window functions having syntactically equivalent PARTITION BY and ORDER BY clauses in their window definitions are guaranteed to be evaluated in a single pass over the data. Therefore they will see the same sort ordering, even if the ORDER BY does not uniquely determine an ordering. However, no guarantees are made about the evaluation of functions having different PARTITION BY or ORDER BY specifications. (In such cases a sort step is typically required between the passes of window function evaluations, and the sort is not guaranteed to preserve ordering of rows that its ORDER BY sees as equivalent.)

Currently, window functions always require presorted data, and so the query output will be ordered according to one or another of the window functions' PARTITION BY/ORDER BY clauses. It is not recommended to rely on this, however. Use an explicit top-level ORDER BY clause if you want to be sure the results are sorted in a particular way.

Table of General-Purpose Window Functions

function description
row_number () → bigint Returns the number of the current row within its partition, counting from 1.
rank () → bigint Returns the rank of the current row, with gaps; that is, the row_number of the first row in its peer group.
dense_rank () → bigint Returns the rank of the current row, without gaps; this function effectively counts peer groups.
percent_rank () → double precision Returns the relative rank of the current row, that is (rank - 1) / (total partition rows - 1). The value thus ranges from 0 to 1 inclusive.
cume_dist () → double precision Returns the cumulative distribution, that is (number of partition rows preceding or peers with current row) / (total partition rows). The value thus ranges from 1/N to 1.
ntile ( num_buckets integer ) → integer Returns an integer ranging from 1 to the argument value, dividing the partition as equally as possible.
lag ( value anycompatible [, offset integer [, default anycompatible ]] ) → anycompatible Returns value evaluated at the row that is offset rows before the current row within the partition; if there is no such row, instead returns default (which must be of a type compatible with value). Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to NULL.
lead ( value anycompatible [, offset integer [, default anycompatible ]] ) → anycompatible Returns value evaluated at the row that is offset rows after the current row within the partition; if there is no such row, instead returns default (which must be of a type compatible with value). Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to NULL.
first_value ( value anyelement ) → anyelement Returns value evaluated at the row that is the first row of the window frame.
last_value ( value anyelement ) → anyelement Returns value evaluated at the row that is the last row of the window frame.
nth_value ( value anyelement, n integer ) → anyelement Returns value evaluated at the row that is the n’th row of the window frame (counting from 1); returns NULL if there is no such row.

Bibliography