Recently I was looking for an easy way to get a sense about the values stored in each column of a PostgreSQL table, for some early exploratory analysis. I wanted a way to get this information quickly without having to scan the table, because this particular table had millions of rows. From having worked with other relational databases, I knew that statistics about tables and rows are typically stored in the data dictionary. All modern relational database systems have cost-based query optimizers and basic statistical information is the lifeblood of a cost-based optimizer.
I discovered the pg_stats view for PostgreSQL.
With a little experimentation, I came up with the following query, to pull just the columns that I was interested in, and for just the table I wanted to explore:
Which returns the following results (results truncated):
There are some things to keep in mind before using pg_stats to view this kind of information.
There are some things to keep in mind when using ANALYZE manually, which you may be tempted to do to get “up-to-date” statistics.
There’s an excellent discussion of ANALYZE in the always-excellent PostgreSQL manual.
While I was using the pg_stats view for exploring some characteristics of a new database, you can no doubt envision other situations for looking into the information contained in this view. For example, if you wanted to derive a realistic subset (see my previous post “Subsets Through the Window, having information about distinct values is extremely valuable in deriving subset criteria. Information from the pg_stats view, like average column width can help with capacity planning exercises. No doubt you can think of even more use cases.