December 23, 2015

Full Text Searching in PostgreSQL--Can it Measure Up?

At CCC, we have many different uses in our applications for full text searching and have had excellent results with both Solr and, more recently, Elasticsearch. However, the operational overhead of deploying sharded Solr or Elasticsearch clusters for internal-only applications sent us looking for a simpler solution.

We have used full text search functionality that is built into a commercial relational database in some of our legacy products, but that was in the days before the Cloud, before our thorough embrace of open source software, and before other technology changes. As we have switched all new development to using PostgreSQL for relational database storage, the question we set out to answer was “Is full text search in PostgreSQL good enough in certain cases?”

PostgreSQL offers a robust solution for full text search. It has support for generating lexemes, stemming, GIN indexes (more about those later), as well as built-in operators for working with and manipulating full text data. GIN stands for “Generalized Inverted Index”. I refer folks interested in these details to the always-excellent PostgreSQL manual pages on full text search.

Some Background

A full text document in PostgreSQL is a string column or combination of columns that has been converted to the PostgreSQL data type tsvector. A tsvector looks like this:

select to_tsvector('english', 'It was the best of times, it was the worst of times')

'best':4 'time':6,12 'worst':10

Note some of the transformations that have happened to the original string:

  • The stop words “It”, “was”, “the”, and “of” have been removed.
  • Words have been normalized into lexemes; thus “times” is represented as “time”.
  • The numbers next to the lexemes are the position in the original string where the corresponding word occurs, excluding spaces.
  • Since “times” occurs twice in the string, its position is noted by a comma-separated list of locations in the string (6,12).

The building block of text search in PostgreSQL is a tsquery that matches a tsvector, using the match operator, @@.

PostgreSQL Configuration

We did our research using AWS EC2, so that we could experiment easily with different host characteristics. Memory is an important factor in the speed of full text searching, as the more data that can fit into memory, the faster the response time for searches.

We chose an r3.4xlarge instance, which has 122GB of RAM and 16 virtual CPUs. We ran PostgreSQL 9.4.5, which was the most recent production release at the time of writing this post.

We made some modifications to the basic postgresql.conf file, as follows:

work_mem = 80MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 50GB

Following the advice of the PostgreSQL manual, we sized our shared_buffers to no more than 40% of the total system memory.

PostgreSQL offers a couple of different ways to support full text search with database objects. We tried both, to see if there were significant differences between the two. These two approaches are:

  1. Simply create a GIN index on any of the string columns, or combination of columns, against which you want to perform full text search. This approach is the simplest and requires the least amount of additional storage, as well as not requiring additional database objects.
  2. Create a new column that contains the tsvector representation of the “document” you want to search. Then, add a GIN index to this column. Note that in order to keep the tsvector column in-sync with the original column when updates and additions occur, you will also require a trigger. This seems like a drawback compared to the fact that a GIN index alone automatically stays in-sync with the data in its underlying column.

Optimization

It would be ideal if we could keep the GIN index in memory for fastest access. By default, there is no mechanism to instruct PostgreSQL to “pin” objects into the buffer cache. However, there is an optional PostgreSQL module that allows you to do this, pg_prewarm.

Once you install this module, adding an object to the buffer cache is as simple as the following:

SELECT pg_prewarm('object_name');

Unlike similar functionality in some databases, the object pushed into the buffer cache is not guaranteed to remain there over time. PostgreSQL’s normal management of the cache may cause the object to “age out”. However, if you are planning to execute many similar queries, it’s much less likely that PostgreSQL will eject the index or table from the cache with the passage of time.

How can we verify that the GIN index is in memory? Use another optional module, pg_buffercache to examine the current contents of the buffer cache. Then to verify that your index is cached, issue this query:

SELECT c.relname AS object, count(*) as buffers
FROM pg_class c
JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname
ORDER BY 2 DESC;

This query returns the list of relations (e.g., tables and indexes) in the buffer cache, along with the number of buffers stored in the cache. Here is a snippet of the output, which shows that the GIN index we want is stored in memory for faster access.

image

Since the size on disk of the index gin_work_metadata_main_title is about 25GB, we can see that the index entirely fits into the buffer cache (3338241 buffers * 8K blocks ≅ 25GB).

Testing

To test the performance our PostgreSQL full text search configuration, we created a JMeter test plan.

We used 5 threads, to simulate simultaneous users submitting search queries, but added a random pause between requests of from 4.5 to 5.5 seconds (using a Gaussian Random Timer), to introduce a more realistic delay interval (real users aren’t likely to be searching constantly within a tiny interval). In order to add more realism, we parameterized the SQL queries to read search terms from a CSV file, rather than repeating the same search each time. The searches returned a maximum of 10,000 records each time.

Approach 1: Using only a GIN Index

Our first set of queries executed a search against a VARCHAR column, which had a GIN index added to it to support full text searches.

SELECT work_id, main_title, main_title_ts
FROM work_metadata
WHERE to_tsvector('english', main_title) @@ to_tsquery('search_term')
limit 10000;

For each row that satisfies the query criteria, PostgreSQL must convert the VARCHAR column into its tsvector representation.

This table shows the results or our tests:

image

Approach 2: Using a TSVECTOR Column with a GIN Index

To enable our second test, we created a tsvector column and set it to the equal the tsvector representation of the original column (main_title in this case). We then created A GIN index on that new tsvector column.  We then queried the tsvector column.

SELECT work_id, main_title, main_title_ts
FROM work_metadata
WHERE main_title_ts @@ to_tsquery('search_term')
limit 10000;

The results, for the same test configuration are shown in the next table.

image

It’s not surprising that the second test shows results that are slightly faster, on average, than the first test. Having the tsvector column saves PostgreSQL from converting a string to tsvector for each row that satisfies the query criteria.

We plotted the % of peak density (using a probability density function) for both approaches. As the chart below shows, adding the tsvector column with a GIN index not only makes the response faster, but also makes it more consistent - the red line is underneath the blue line on both ends of the distribution.

image

The question thus becomes whether the tradeoff of additional storage and database objects, including a trigger to keep the primary column(s) in-sync with the tsvector column(s) is worth the performance improvement, as compared to simply querying a string column that has a GIN index created on it. The answer, of course, will depend on your particular use case and the expectations of application users.

Conclusion

Obviously there are limits to PostgreSQL’s full text searching which make it unlikely that you would want to use it for large-scale externally-facing search applications. However, for an internal-only application, PostgreSQL’s full text search is more than up to the job.

Glenn Street

Data Architect