If you had asked me two years ago to describe the difference between a structured versus full text search you would have received a novice answer. However, when I inherited the technical stewardship of an application whose primary function was to surface publication title and article search results, I had to become a quick study in this arena. My resource bible was (and still is) Manning’s “Taming Text”, as it proved an invaluable resource. During this same time frame we were also undertaking a search technology leap so to speak, migrating from Solr to ElasticSearch. The project initiative was bold with an aggressive time frame. Just as I was about to introduce tagging and exclusion filters within our existing Solr facet implementation, I was immersed in an ElasticSearch ramp-up on aggregations, multi-tenancy, schemaless types, and the rich query DSL. All in all, a great proverbial baptism by fire. The end-product, RightFind Professional™, delivers on its promise of fulfilling an integrated workflow solution and a one stop shop for researchers to find articles within their local holdings and subscriptions or purchase additional content. However, we have to ask ourselves how relevant and meaningful are the search results we currently deliver to our varied end-users across the gamut of industries and academic institutions who utilize RightFind Professional™.
Relevancy is the black box within that big search engine. Matching is relatively easy and most everyone understands the underlying boolean logic. Just skimming the theoretical surface of relevancy isn’t too intimidating either. Relevance can be simply defined “as the numerical output of an algorithm that determines which documents are most textually similar to the query.”^1^Term Frequency/Inverse Document Frequency (TF/IDF) for single term and the Vector Space Model for multiterm queries are conceptually pretty straight forward. But the practical application and deeper dive in can be quite daunting and intimidating. Naturally the first question you must ask yourself is when to apply boosting? You have two choices at your disposal, index or query time boosting. The consensus in the search relevancy world, is that query time boosting shall be favored over index boosting in most situations. Index boosting can be literally applied when new documents are inserted into the index. If searching across multiple indices you could also apply boosts per index level when creating new indices within your cluster. Index boosting during index time usually involves applying custom logic or rules. Within our index builder code base, my preference was not to incorporate any such business rules or logic. Invariably they would be subject to change and could be rather haphazardly or arbitrarily defined. Plus any changes within this logic would entail a full re-index of 140+ million works. In addition some boosts must be applied at query time because there simply isn’t enough information at index time to calculate the boost. Borrowing from another Copyright Clearance Center product’s Solr implementation, we actually employed a hybrid index and query time approach to boost documents based on the existence of a category rank field. Articles sourced and loaded from select Publishers or Aggregators were stamped with a category ranking of “1” within our works metadata database tables and later propagated along during the index building. Within our ElasticSearch technology migration we effectively carried over the same hybrid index and query time solution. The query time implementation involved applying a boost on the should match clause associated with determining whether a work document had a category ranking of “1”. Obviously not a comprehensive boosting/relevancy strategy, but it addresses one core business requirement. Time now to evaluate some other query time based solution approaches….
In our back-end search implementation we introduced our own Elasticsearch client library wrapper. It’s a jar dependency which insulates our search index consuming applications from having to know the intricacies of Elasticsearch and having to craft programmatically together all those complex query builders and aggregations. Within our Elasticsearch client library we expose the boost parameter on all our query type classes, but only the one product team utilizes it (refer to the preceding paragraph and the category ranking match boost). And currently we do not have wrapper support for what is a foundational relevancy building block in the FunctionScoreQueryBuilder class. Two important pieces of the FunctionScoreQuery we have already proven pretty good good at, the query and applied filter components. Now it’s time to define which function(s) we could use to calculate a score, incorporating score_mode and boost_mode properties to effectively combine the output of these functions. The appropriate function(s) would ultimately be driven by our business use cases and surveying the various needs within our end-user communities. Maybe we employ a FunctionScoreQuery to boost articles that have been published within the past 6 months or year? In combination with a decay function and accounting for some pre-defined origin/time threshold and scale, we could add a lot of value here for end-users who tend to favor the latest and greatest. Actually another compelling use case has recently surfaced in discussions we have had with our business concerning boosting articles that are already in a particular organization’s digital library (aka holdings) or bundled within a widely used subscription held by a given organization. In essence, a boosting by popularity. It’s not your traditional article page hit or view count, but definitely in the same ballpark. However, there is a catch here, if we simply boost the score using a popularity type metric, we could completely swamp the effect of our full-text article “title” scores. To mitigate this problem, it’s strongly recommended that in the popularity or likes boosting scenario, that one utilizes a logarithm to temper the effect.^2^
I have already alluded to the Term Frequency/Inverse Document Frequency (TF/IDF), as it again serves as the default text relevance score in both Solr and ES. But reliance on this default can prove problematic and produce results rankings that are way off the mark. I recently came across a new term, signal modeling, which is the data science/analysis behind turning “relevance scores into smarter, domain specific signals that quantifiably measure important criteria to you and your data.”^3^ Signal modeling might encompass analysis which ultimately drives the definition and enablement of domain specific synonyms and stopwords which comprise the terms within your index, or tweaking the default TF/IDF scoring features, or amalgamating fields into larger fields leveraging copyFields. There are a slew of other signal modeling techniques at your disposal. But the prevailing message in my research has been that fields are not just for storage and retrieval, they are also containers for enabling scoring.^4^
On the subject of relevancy best practices and signal modeling, title fields are often portrayed as the example problem child or statistical outlier. TF/IDF is not always your friend here. Term Frequency does not always matter and that aboutness has no correlation to it. A good title search uses phrase matches. Phrase matches are predicated on term positions being enabled on a title field mapping. But the Catch-22 is that term positions are only available when term frequencies are enabled. Possible solutions I came across included writing a custom Lucene Similarity Plug-In. Seriously? I don’t like that one. Or better yet, define two title fields, one with term frequencies and positions disabled and the other with both features enabled turned on. A given query could conceivably include a match on the disabled flavor and a match phrase on the field with both features enabled. The same query could consist of boosts/weights to both queries, to in effect tune the influence of term frequencies versus phrase.
There are suggested strategies for handling other aspects of the TF-IDF scoring, including norms and the IDF ratio itself. This introduces an interesting word entry into the lexicon of relevant search, “Pantheon”. A Pantheon is defined as a “list of topical areas and or subjects in a specific domain, professionally curated by domain experts.”^5^ I think of it as a domain or industry specific term dictionary. Compiling and updating these domain specific lists could be itself problematic. But that’s countered with the argument that there are legitimate sources from which we could pull in this information. A perfect example, and one particular relevant to both medicine and Copyright, is PubMed’s, MeSH, MeSH being the official NLM controlled thesaurus for indexing articles on PubMed.
It is an inexact science and there is neither a silver bullet or comprehensive go to resource. This is why I am very excited to dig into what looks like a very promising book release in Manning’s Taming Search (aka Relevant Search). I downloaded the MEAP first chapter and quickly found comfort in the fact that others in the search space have been either confronted or confounded by what it means to deliver relevant results and how not to simply turn a blind eye. A blind eye or better yet, blind faith in the search engine’s default settings. As I peruse through the first chapter of “Taming Search”, I see a consistent theme emerging, as a solution will invariably incorporate a mix of technology and human and or business domain factors. Determination of what constitutes relevancy, is an on-going, collaborative, and continuous feedback loop. There is also a dramatic shift between from being the search engineer to what the book defines as a “relevance engineer”. There are business rules to account for and the workflow within the system itself establishes and detects patterns which in turn could be harnessed into relevancy factors and criteria. All of these things contribute to the overall domain specific relevance model and relevance strategy employed. In many respects, the modern search engine must have a human like quality in being able to interpret what we are really asking for. This may sound like something out of a science fiction novel, but it is the here and the now and part of the very evolution of machine learning and the Information Retrieval sciences. All in all, the subject of relevancy forces us all to revisit the application’s overall product strategy and re-assess the user community(ies). There’s a lot of analysis involved here as our team (business and engineering) must collectively identify the most important pieces of data to focus on for relevancy. And for those data elements identified, how do we inform the search engine about them? And finally we need to balance the weights of each piece of data against all others within the context of the end-user’s query. Within this final step, decisions are largely driven by a combination of Machine Learning and Classification. There are aspects of these data elements, called features, which are incorporated within the very algorithms that provide the relevancy decisions and determinations.
For me, there is still much to learn and abstractions which need to be more concrete and fully understood. However, unlike a few years ago, when I was more less a search apprentice, I have now a more solid foundation and several iterative search engine implementations under my belt. I look forward to in subsequent posts to share and impart my insights and practical knowledge gained within the realm of search relevancy.