At CCC, we have many different uses in our applications for full text searching and have had excellent results with both Solr and, more recently, Elasticsearch. However, the operational overhead of deploying sharded Solr or Elasticsearch clusters for internal-only applications sent us looking for a simpler solution.
We have used full text search functionality that is built into a commercial relational database in some of our legacy products, but that was in the days before the Cloud, before our thorough embrace of open source software, and before other technology changes. As we have switched all new development to using PostgreSQL for relational database storage, the question we set out to answer was “Is full text search in PostgreSQL good enough in certain cases?”
PostgreSQL offers a robust solution for full text search. It has support for generating lexemes, stemming, GIN indexes (more about those later), as well as built-in operators for working with and manipulating full text data. GIN stands for “Generalized Inverted Index”. I refer folks interested in these details to the always-excellent PostgreSQL manual pages on full text search.
A full text document in PostgreSQL is a string column or combination of columns that has been converted to the PostgreSQL data type tsvector. A tsvector looks like this:
Note some of the transformations that have happened to the original string:
The building block of text search in PostgreSQL is a tsquery that matches a tsvector, using the match operator, @@.
We did our research using AWS EC2, so that we could experiment easily with different host characteristics. Memory is an important factor in the speed of full text searching, as the more data that can fit into memory, the faster the response time for searches.
We chose an r3.4xlarge instance, which has 122GB of RAM and 16 virtual CPUs. We ran PostgreSQL 9.4.5, which was the most recent production release at the time of writing this post.
We made some modifications to the basic postgresql.conf file, as follows:
Following the advice of the PostgreSQL manual, we sized our shared_buffers to no more than 40% of the total system memory.
PostgreSQL offers a couple of different ways to support full text search with database objects. We tried both, to see if there were significant differences between the two. These two approaches are:
It would be ideal if we could keep the GIN index in memory for fastest access. By default, there is no mechanism to instruct PostgreSQL to “pin” objects into the buffer cache. However, there is an optional PostgreSQL module that allows you to do this, pg_prewarm.
Once you install this module, adding an object to the buffer cache is as simple as the following:
Unlike similar functionality in some databases, the object pushed into the buffer cache is not guaranteed to remain there over time. PostgreSQL’s normal management of the cache may cause the object to “age out”. However, if you are planning to execute many similar queries, it’s much less likely that PostgreSQL will eject the index or table from the cache with the passage of time.
How can we verify that the GIN index is in memory? Use another optional module, pg_buffercache to examine the current contents of the buffer cache. Then to verify that your index is cached, issue this query:
This query returns the list of relations (e.g., tables and indexes) in the buffer cache, along with the number of buffers stored in the cache. Here is a snippet of the output, which shows that the GIN index we want is stored in memory for faster access.
Since the size on disk of the index gin_work_metadata_main_title is about 25GB, we can see that the index entirely fits into the buffer cache (3338241 buffers * 8K blocks ≅ 25GB).
To test the performance our PostgreSQL full text search configuration, we created a JMeter test plan.
We used 5 threads, to simulate simultaneous users submitting search queries, but added a random pause between requests of from 4.5 to 5.5 seconds (using a Gaussian Random Timer), to introduce a more realistic delay interval (real users aren’t likely to be searching constantly within a tiny interval). In order to add more realism, we parameterized the SQL queries to read search terms from a CSV file, rather than repeating the same search each time. The searches returned a maximum of 10,000 records each time.
Approach 1: Using only a GIN Index
Our first set of queries executed a search against a VARCHAR column, which had a GIN index added to it to support full text searches.
For each row that satisfies the query criteria, PostgreSQL must convert the VARCHAR column into its tsvector representation.
This table shows the results or our tests:
Approach 2: Using a TSVECTOR Column with a GIN Index
To enable our second test, we created a tsvector column and set it to the equal the tsvector representation of the original column (main_title in this case). We then created A GIN index on that new tsvector column. We then queried the tsvector column.
The results, for the same test configuration are shown in the next table.
It’s not surprising that the second test shows results that are slightly faster, on average, than the first test. Having the tsvector column saves PostgreSQL from converting a string to tsvector for each row that satisfies the query criteria.
We plotted the % of peak density (using a probability density function) for both approaches. As the chart below shows, adding the tsvector column with a GIN index not only makes the response faster, but also makes it more consistent - the red line is underneath the blue line on both ends of the distribution.
The question thus becomes whether the tradeoff of additional storage and database objects, including a trigger to keep the primary column(s) in-sync with the tsvector column(s) is worth the performance improvement, as compared to simply querying a string column that has a GIN index created on it. The answer, of course, will depend on your particular use case and the expectations of application users.
Obviously there are limits to PostgreSQL’s full text searching which make it unlikely that you would want to use it for large-scale externally-facing search applications. However, for an internal-only application, PostgreSQL’s full text search is more than up to the job.