May 11, 2016

Subsets Through the Window

Many of you are no doubt familiar with SQL window functions. These functions provide capabilities similar to those provided by the GROUP BY clause in a SQL statement, but instead return result sets that preserve each row, rather than collapsing them into a single summary row. Window functions have been part of the SQL Standard since SQL:2003. This page of the PostgreSQL manual describes the rich set of these functions that PostgreSQL supports. Many relational database systems support window functions. I recently came up with what was to me, at least, a novel approach to generating a subset of rows from a database table by using the row_number() function.

The idea of using a subset of data is a powerful one in the realm of Test Data Management. One of the techniques described very well in the blog post “5 Best Practices for Test Data Management” (as well as in other sources) is to extract a subset of production data to build a testing database. There are many ways that such extracts could be created, but the row_number() window function provides a quick and easy one for accomplishing this task.

I’ll refer to the well-known “Adventure Works” database, which I’ve ported to PostgreSQL. The version of “Adventure Works” I used is an older iteration of the database and was originally migrated to MySQL. I brought it to our database platform of choice, PostgreSQL.

For this example, I’ll be working with the DIM_CUSTOMER table. Here is the structure of that table.


The requirement we have is to grab a subset of this data (18,484 total rows), in a way that provides data from each geographic group (“GeographyKey”) in the original table. We want our subset to include data from each GeographyKey to more closely resemble production. We’d like to have no more than five rows from each group, though.

Here is the query I came up with to achieve this objective:

select * from (
  select *, row_number()
  over (partition by "GeographyKey”)
  from dim_customer) as records
where row_number <= 5

As you can see from the output snippet below, this query returns exactly what we are looking for. For each GeographyKey, there are no more than five rows returned, which is enabled by using the row_number() function and then limiting on it in the outer query.


We can get insight into how PostgreSQL executes this query by looking at the query plan:


In order to return the partitioning by GeographyKey we requested, PostgreSQL performed a full scan of the DIM_CUSTOMER table, then sorted the data by GeographyKey; it then aggregated that sorted data using the window function row_number(), and then finally limited the result in each “bucket” to five rows.

Here’s the query plan (using explain analyze to actually execute the query and get true running times):

                            QUERY PLAN
Subquery Scan on records (cost=4312.80..4867.32 rows=6161
width=254) (actual time=60.658..77.614 rows=1496 loops=1)
 Filter: (records.row_number <= 5)
 Rows Removed by Filter: 16988
 -> WindowAgg (cost=4312.80..4636.27 rows=18484 width=246)
 (actual time=60.649..74.576 rows=18484 loops=1)
    -> Sort (cost=4312.80..4359.01 rows=18484 width=246) 
    (actual time=60.628..64.211 rows=18484 loops=1)
       Sort Key: dim_customer."GeographyKey"
       Sort Method: external merge Disk: 4808kB
       -> Seq Scan on dim_customer (cost=0.00..853.84 rows=18484
       width=246) (actual time=0.008..11.447 rows=18484 loops=1)
Planning time: 0.108 ms
Execution time: 103.503 ms

So, in my test database, this query executes in about 103 ms.

It’s often extremely useful to add an index on the “partition by” column, in this case GeographyKey. After creating the index on “GeographyKey”, the query plan now looks like this:


Now the query plan is a bit more efficient, because the full-table scan of table DIM_CUSTOMER is gone. Instead, PostgreSQL performed an index scan, followed by aggregating the data for the row_number() function, and finally limited the result set in each GeographyKey partition to only five rows. Note that not only is the full-table scan gone, the sort operation that was required before we added the index IX_GEOGRAPHY_KEY is no longer needed.

The corresponding explain analyze put shows a marked performance improvement:

                            QUERY PLAN
Subquery Scan on records (cost=0.29..3673.80 rows=6161 width=254)
(actual time=0.047..36.024 rows=1496 loops=1)
 Filter: (records.row_number <= 5)
 Rows Removed by Filter: 16988
 -> WindowAgg (cost=0.29..3442.75 rows=18484 width=246) (actual
 time=0.043..32.793 rows=18484 loops=1)
    -> Index Scan using ix_geography_key on dim_customer 
    (cost=0.29..3165.49 rows=18484 width=246) (actual 
    time=0.037..14.337 rows=18484 loops=1)
Planning time: 0.309 ms
Execution time: 36.150 ms

Instead of an execution of 103 ms, the query now executes in a total of about 36 ms.

Perhaps you’d like to have a random sample of rows within each geographic region, still limited to five rows. You can achieve this by using the order by random() construct in the inner query, like so:

select * from (
  select *, row_number()
  over (partition by "GeographyKey" order by "GeographyKey", random())
  from dim_customer) as records
where row_number <= 5

Here are snippets of the output from two different runs of this query, showing that the result set differs each time.

First run: image

Second run: image

If you are using PostgreSQL 9.5+, you have the ability to work with a sample of the table before applying your window function. The TABLESAMPLE qualifier for an SQL FROM clause is part of the SQL Standard, as of SQL:2003. PostgreSQL has implemented a version of it, starting with 9.5.0. An excellent discussion is available on the PostgreSQL Wiki page TABLESAMPLE Implementation.

To apply our windowing criterion over a random 5% sample of the data, the query looks like this:

select * from (
  select *, row_number()
  over (partition by "GeographyKey")
  from dim_customer tablesample system(5)) as records
where row_number <= 5

This query will generate the sample of data (using a “sample scan”) before doing any other work, meaning that it will be very fast.

Subquery Scan on records (cost=186.76..214.48 rows=308 width=254)
(actual time=1.366..2.179 rows=715 loops=1)
 Filter: (records.row_number <= 5)
 Rows Removed by Filter: 199
 -> WindowAgg (cost=186.76..202.93 rows=924 width=246) (actual 
 time=1.360..1.959 rows=914 loops=1)
    -> Sort (cost=186.76..189.07 rows=924 width=246) (actual 
    time=1.342..1.390 rows=914 loops=1)
       Sort Key: dim_customer."GeographyKey"
       Sort Method: quicksort Memory: 464kB
       -> Sample Scan on dim_customer (cost=0.00..141.24 
       rows=924 width=246) (actual time=0.009..0.575 rows=914
       Sampling: system ('5'::real)
Planning time: 0.658 ms
Execution time: 2.302 ms

Because we worked only with a subset of the data in the original table, this version is by far the fastest of the queries that we’ve looked at in this article. Note that, in this case, the index we created earlier doesn’t help us, because the number of rows in the subset is so small (only 924).

Window functions are a very powerful way to manipulate data when you want to perform some kind of summary calculations, but wish to retain details of each row. We’ve shown here that they can also help build a realistic testing environment. One thing to keep in mind is that the row_number() function is non-deterministic, meaning that you are not guaranteed always to receive the same result set from the query. For our purposes, this was not a problem.

Glenn Street

Data Architect