The Theory of General Relevancy

  1. Synopsis
  2. Proposed Strategies
  3. Further Reading

I. Synopsis

Corporations across all vertical markets are re-architecting their search offerings to leverage Fusion “AI” capabilities. Initially, searches are most commonly performed using the default request handler, or possibly the edismax or Extended Dismax Parser. While generally effective out-of-the-box, this often does not provide the degree of relevancy ultimately sought or expected. Given this, many companies seek to extend and enhance their implementation to leverage supervised Deep Learning utilizing Apache Spark, or real-time unsupervised Machine Learning using JavaScript in a ‘master’ query pipeline, for example. In the background, generally, Spark jobs will be running analyzing signal and user data to develop profiles that can be applied to enhance the incoming query and enhance/improve relevancy. Where relevancy is concerned, the data you have to crunch, the better. Machine Learning, at the end-of-the-day, is really about crunching data into a usable, interpretable form.

A Note on expectations: This document proposes a generalized workflow for indexing and querying with relevancy. The actual implementation will vary. Tuning of the stages, jobs, and experiments will be required to find the optimal “sweet spot” offering the highest degree of relevancy.

II. Proposed Strategies

In order to meet the expectations of modern search users, companies will often deploy a multi-tiered Machine Learning (ML)/ Natural Language Processing (NLP) strategy that will provide optimal relevancy given the historical signal and aggregated meta information available. Some metadata will be extracted at request-time from the context of the request (E.g. WHERE); while other metadata will be queried from a previously indexed composite of external data sources. Primary intent will be determined by extrapolating the “WHO”, “WHAT”, “WHEN”, and “WHERE” context to any given query for enhanced relevance. The data sources for each of these primary categories should be determined prior to the implementation of any ML jobs. At the conclusion of the first phase of search upgrade, the goal should be to have built up metadata profiles for both the searchable documents as well as the users who perform the search; and then constructing a query using this metadata to optimize relevance.

  1. Metadata ML Index Pipeline Stage using SVM Model

Relevancy metadata will be ingested and aggregated into a composite profile index using, by way of one example, Fusion’s ML Index Pipeline Stage. This stage will leverage Apache Spark’s MLib framework. This would be used primarily for ingesting the various ‘WHO’ data sources. Given that this data will be from varied external sources, document fields and values should be normalized during indexing. Fields requiring any type of transformation should be identified prior to indexing. JavaScript stages would be used, in general, to perform the normalization on identified fields. Normalization will be to a) format the value of a field (E.g. a date field) in a normalized format; and b) place it either in a single value or multi-value field, depending on its determined usage.

The metadata index pipeline jobs will utilize the Scala programming language used in Spark. The ML/ Support Vector Machine (SVM) and ML/ Random Forest (RF) stages are just two of the possible avenues you may pursue to categorize and tokenize the normalized metadata, resulting in the production of the proposed relevancy composite profile.

These ML processes are normally supervised, leveraging the set of predetermined categories. The result of tokenization is analyzed for contextual relevance, and the classifier trained with the aforementioned result. Bodies of content will be indexed into statistically interesting tags and compiled into a composite of the given user. This information is then applied at query time.

Searchable Data ML Index Pipeline Stage using ALS and Random Forest

Data searchable by end users will be indexed using the ML Index Stage as well, and apply the Alternating Least Squares (ALS) and Random Forest Fusion jobs. The ALS is a job that creates an index for offering recommendations. The Random Forest job works to classify each document according to our previously mentioned pre-trained classifier. The goal here is to build up an index for providing recommendations as a feature, by way of facilitating potential relationships between users and documents. Persistent metadata and searchable data will have been classified with the same classifier — or same set of classifications. As such, we should see organic relationships emerge between the two sets of data. This will be verifiable by running Fusion experiments. More information on training a classifier may be found here. Experiments are mentioned later in this paper.

Background Spark Jobs doing extended data mining for relevancy

In order to avoid metadata becoming stale, and to encourage ever-improving relevancy, background jobs will run on a to-be-determined (TBD) schedule. Background jobs will not ingest data from external sources; rather, they will mine existing data for further relevancy. These sources would include all searchable indices, metadata profiles and signals collections. These jobs would use both the Random Forest and Matrix Decomposition jobs to more tightly relate documents to users.

Aggregate Signals Jobs and Experiments

Signals are end user interactions with the application. A signals aggregation job will leverage Head/Tail analysis, and will examine overall user interaction, as well as actions from any given end user. The result of this aggregated data mining will be applied at query time to enhance relevancy using the Boost with Signals stage.

Real-time ML/NLP in the Query Pipeline

The primary query pipeline will be based on the Item-for-User query stage, providing a layer of quasi-unsupervised ML. At query request time, this stage will fold relevant aggregated signal information and extracted metadata into the context of the request. Any features and/or use cases that fall beyond the scope of the Item-for-User stage — and that are unavailable Out-of-the-Box (OOTB) — will be handled via custom JavaScript stages, that will extract, enhance, and normalize for relevance as required. In general, OOTB stages will be able to handle 90% of the use cases, however in just about any enterprise deployment, some type of customized JavaScript implementation will be required to provide functionality for the remaining 10%. This will effectively represent a layer of supervised ML and NLP. A second function of this pipeline will be to function as a router, of sorts, to retrieve and respond with the result of ancillary query pipelines for identified custom or specialized queries beyond the standard user search (E.g. Calendar searches). If any type of transformation is required, it would happen after subsequent calls to these supporting pipelines using a custom JavaScript stage.

Search Intent in the Query and Index Pipelines

There are 4 basic types of search intent: Navigational/Informational/Transactional/Commercial. Thus these 4 intentions represent our end-result container for keyword buckets with regards to searchable data. From a user intent paradigm, our top-level container will be either “exhaustive” or “selective.” Exhaustive would be defined as requiring more than two queries of searchable data to obtain the final result. Selective would be defined as requiring two or less queries to produce a result.

The analysis of the user’s signal data and a prioritized set of other variants will determine the final intent. So for example if a user would use the search platform to consistently look for travel arrangements, we would consider this user to be selective. If the user’s search criteria spanned a much-wider set of categories — ranging consistently across a wide variety of keyword buckets, we would consider this user to be exhaustive. This categorization will factor in as to what type of boosting, ranking and querying would occur in the pipeline.

Contextual ambiguity will be resolved using interpretation prior to determining intent. Interpretation will leverage a tiered structure:

I) Dominant: what is the most common interpreted meaning (Selective)

II) Common: providing results matching multiple contexts (Exhaustive)

III) Minor: less common interpretations that may be locale or date specific. (Exhaustive + Selective)

Note: Another applicable context will be so-called “micro-moments.” This is a type of informational query where time is generally considered critical (E.g. a flight schedule). These are not considered commercial or transactional queries, but rather a need-to-know-now type of query.

From a high-level, the intent determination process:

  1. Accept the incoming request
  2. Determine the user and user type
  3. Perform Query Interpretation
  4. Determine Query Intent
  5. Route to appropriate query pipeline
  6. Send the response

Ultimately, Search Intent is an aggregative process, one step building upon the last, and amounts to an educated guess. It is a prediction, rather than a binary yes-or-no answer. Given this, the greater the number of variables you have with which to make your prediction, the better. Varying degrees of accuracy will be expected. By way of a baseline metric: 80% is generally considered high by industry standards. A continuing series of tuning iterations should be anticipated.

A user profile will be pre-determined by looking at the contextual range of the user’s querying habits in the signals aggregation, E.g. do they search for one type of thing, generally, or many. Are they in the office, or away? Are that on a project location, en route, or elsewhere? What is their role? The more of these types of questions that can be answered, the better the quality of predicted intent. The priority with which each item would be boosted would be determined by running experiments.

Signals aggregation will be leveraged once again for interpretation, determining whether the query phrase itself is dominant, common or minor. Once interpreted, query intent will determined by querying the signals query logs, as well as via Head/Tail analysis. It will further take into consideration ‘micro-moments’ (E.g. holidays, location, etc). The resulting outcome of these processes will determine intent, and thus the query pipeline to be called, and search data returned.

The Boost with Signals, JavaScript and other query pipeline stages will be used to ultimately determine intent.

Example 1: User ‘1’ conducts search for ‘X’ and ultimately clicks on ‘Y’. Thus there is a correlation between ‘X’ and ‘Y’ for user ‘1’ that may or may not exist for users ‘2’, ‘3’, ‘4’, etcetera. Result ‘Y’ has been classified as a member of category ‘Z’; as previously determined by aforementioned classification/categorization jobs. User ‘1’ has a high-ranking relationship to category ‘Z’ as well. This provides an organic path to contextually-relevant intent, as we can assume with a fair degree of certainty that when user ‘1’ searches for ‘X’ his most favored category of result will be ‘Z’.

Example 2: User ‘2’ is onsite at a construction project in Kuala Lumpur and conducts a search for cables. User ‘2’ is known to be a member of a construction team, so construction items, like steel cable, would be boosted, whereas computer cables would be less relevant. Further, the search would take into account the location and project. What suppliers are close by? What materials would be most relevant to this project? The answers to these questions would provide boosting relevancy.

Experiments will be run to determine optimal relevancy

Once the aforementioned pipelines and jobs have been created, Fusion experiments should be run to determine optimal relevancy tuning. Experiments are rudimentary component of relevancy tuning.This cannot be overstated. It’s easy to focus on the classifiers and stages, jobs and pipelines, etc. Make sure to prioritize and allocate time to become comfortable with the Fusion experiments UI, as well as reading and interpreting result data.

Experiments will leverage Signals extensively. Relevancy tuning will leverage the Relevancy Tuning Workbench.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: