May 31, 2016

Tips & Tricks: Most Common Values in a Table Column (PostgreSQL)

Recently I was looking for an easy way to get a sense about the values stored in each column of a PostgreSQL table, for some early exploratory analysis. I wanted a way to get this information quickly without having to scan the table, because this particular table had millions of rows. From having worked with other relational databases, I knew that statistics about tables and rows are typically stored in the data dictionary. All modern relational database systems have cost-based query optimizers and basic statistical information is the lifeblood of a cost-based optimizer.

I discovered the pg_stats view for PostgreSQL.

With a little experimentation, I came up with the following query, to pull just the columns that I was interested in, and for just the table I wanted to explore:

select tablename as "Table", attname as "Column", most_common_vals as "Most Common Values"
from pg_stats
where schemaname = 'public' and tablename = 'call_center'
order by tablename, attname

Which returns the following results (results truncated):

image

There are some things to keep in mind before using pg_stats to view this kind of information.

  • The statistics are populated by the ANALYZE command. If this has never been run on a particular table, there won’t be any information for you. Also, if the command hasn’t been run in some time, the statistics will be out-of-date.
  • Don’t count on this information being completely accurate. For example, with a large table, ANALYZE uses a random sample of rows in the table to generate the statistics.

There are some things to keep in mind when using ANALYZE manually, which you may be tempted to do to get “up-to-date” statistics.

  • The autovacuum daemon will keep run periodically to keep such statistical information reasonably current (assuming that your DBA hasn’t disabled the daemon), so you may have no need to run ANALYZE manually.
  • If the autovaccum daemon has been disabled, a good rule of thumb is to run ANALYZE on tables once a day. However, if a given table has had a lot of CRUD activity on it, you may choose to run ANALYZE more often.
  • You can run ANALYZE manually and it requires only a read-lock on the table in question. There could be some negative impact on overall system performance, though, so if you can, run ANALYZE during non-peak times.

There’s an excellent discussion of ANALYZE in the always-excellent PostgreSQL manual.

While I was using the pg_stats view for exploring some characteristics of a new database, you can no doubt envision other situations for looking into the information contained in this view. For example, if you wanted to derive a realistic subset (see my previous post “Subsets Through the Window, having information about distinct values is extremely valuable in deriving subset criteria. Information from the pg_stats view, like average column width can help with capacity planning exercises. No doubt you can think of even more use cases.

Glenn Street

Data Architect