- Proposed Strategies
- Further Reading
A Note on expectations: This document proposes a generalized workflow for indexing and querying with relevancy. The actual implementation will vary. Tuning of the stages, jobs, and experiments will be required to find the optimal “sweet spot” offering the highest degree of relevancy.
II. Proposed Strategies
In order to meet the expectations of modern search users, companies will often deploy a multi-tiered Machine Learning (ML)/ Natural Language Processing (NLP) strategy that will provide optimal relevancy given the historical signal and aggregated meta information available. Some metadata will be extracted at request-time from the context of the request (E.g. WHERE); while other metadata will be queried from a previously indexed composite of external data sources. Primary intent will be determined by extrapolating the “WHO”, “WHAT”, “WHEN”, and “WHERE” context to any given query for enhanced relevance. The data sources for each of these primary categories should be determined prior to the implementation of any ML jobs. At the conclusion of the first phase of search upgrade, the goal should be to have built up metadata profiles for both the searchable documents as well as the users who perform the search; and then constructing a query using this metadata to optimize relevance.
- Metadata ML Index Pipeline Stage using SVM Model
The metadata index pipeline jobs will utilize the Scala programming language used in Spark. The ML/ Support Vector Machine (SVM) and ML/ Random Forest (RF) stages are just two of the possible avenues you may pursue to categorize and tokenize the normalized metadata, resulting in the production of the proposed relevancy composite profile.
These ML processes are normally supervised, leveraging the set of predetermined categories. The result of tokenization is analyzed for contextual relevance, and the classifier trained with the aforementioned result. Bodies of content will be indexed into statistically interesting tags and compiled into a composite of the given user. This information is then applied at query time.
Searchable Data ML Index Pipeline Stage using ALS and Random Forest
Data searchable by end users will be indexed using the ML Index Stage as well, and apply the Alternating Least Squares (ALS) and Random Forest Fusion jobs. The ALS is a job that creates an index for offering recommendations. The Random Forest job works to classify each document according to our previously mentioned pre-trained classifier. The goal here is to build up an index for providing recommendations as a feature, by way of facilitating potential relationships between users and documents. Persistent metadata and searchable data will have been classified with the same classifier — or same set of classifications. As such, we should see organic relationships emerge between the two sets of data. This will be verifiable by running Fusion experiments. More information on training a classifier may be found here. Experiments are mentioned later in this paper.
Background Spark Jobs doing extended data mining for relevancy
In order to avoid metadata becoming stale, and to encourage ever-improving relevancy, background jobs will run on a to-be-determined (TBD) schedule. Background jobs will not ingest data from external sources; rather, they will mine existing data for further relevancy. These sources would include all searchable indices, metadata profiles and signals collections. These jobs would use both the Random Forest and Matrix Decomposition jobs to more tightly relate documents to users.
Aggregate Signals Jobs and Experiments
Signals are end user interactions with the application. A signals aggregation job will leverage Head/Tail analysis, and will examine overall user interaction, as well as actions from any given end user. The result of this aggregated data mining will be applied at query time to enhance relevancy using the Boost with Signals stage.
Real-time ML/NLP in the Query Pipeline
Search Intent in the Query and Index Pipelines
There are 4 basic types of search intent: Navigational/Informational/Transactional/Commercial. Thus these 4 intentions represent our end-result container for keyword buckets with regards to searchable data. From a user intent paradigm, our top-level container will be either “exhaustive” or “selective.” Exhaustive would be defined as requiring more than two queries of searchable data to obtain the final result. Selective would be defined as requiring two or less queries to produce a result.
The analysis of the user’s signal data and a prioritized set of other variants will determine the final intent. So for example if a user would use the search platform to consistently look for travel arrangements, we would consider this user to be selective. If the user’s search criteria spanned a much-wider set of categories — ranging consistently across a wide variety of keyword buckets, we would consider this user to be exhaustive. This categorization will factor in as to what type of boosting, ranking and querying would occur in the pipeline.
Contextual ambiguity will be resolved using interpretation prior to determining intent. Interpretation will leverage a tiered structure:
I) Dominant: what is the most common interpreted meaning (Selective)
II) Common: providing results matching multiple contexts (Exhaustive)
III) Minor: less common interpretations that may be locale or date specific. (Exhaustive + Selective)
Note: Another applicable context will be so-called “micro-moments.” This is a type of informational query where time is generally considered critical (E.g. a flight schedule). These are not considered commercial or transactional queries, but rather a need-to-know-now type of query.
From a high-level, the intent determination process:
- Accept the incoming request
- Determine the user and user type
- Perform Query Interpretation
- Determine Query Intent
- Route to appropriate query pipeline
- Send the response
Ultimately, Search Intent is an aggregative process, one step building upon the last, and amounts to an educated guess. It is a prediction, rather than a binary yes-or-no answer. Given this, the greater the number of variables you have with which to make your prediction, the better. Varying degrees of accuracy will be expected. By way of a baseline metric: 80% is generally considered high by industry standards. A continuing series of tuning iterations should be anticipated.
A user profile will be pre-determined by looking at the contextual range of the user’s querying habits in the signals aggregation, E.g. do they search for one type of thing, generally, or many. Are they in the office, or away? Are that on a project location, en route, or elsewhere? What is their role? The more of these types of questions that can be answered, the better the quality of predicted intent. The priority with which each item would be boosted would be determined by running experiments.
Signals aggregation will be leveraged once again for interpretation, determining whether the query phrase itself is dominant, common or minor. Once interpreted, query intent will determined by querying the signals query logs, as well as via Head/Tail analysis. It will further take into consideration ‘micro-moments’ (E.g. holidays, location, etc). The resulting outcome of these processes will determine intent, and thus the query pipeline to be called, and search data returned.
Example 1: User ‘1’ conducts search for ‘X’ and ultimately clicks on ‘Y’. Thus there is a correlation between ‘X’ and ‘Y’ for user ‘1’ that may or may not exist for users ‘2’, ‘3’, ‘4’, etcetera. Result ‘Y’ has been classified as a member of category ‘Z’; as previously determined by aforementioned classification/categorization jobs. User ‘1’ has a high-ranking relationship to category ‘Z’ as well. This provides an organic path to contextually-relevant intent, as we can assume with a fair degree of certainty that when user ‘1’ searches for ‘X’ his most favored category of result will be ‘Z’.
Example 2: User ‘2’ is onsite at a construction project in Kuala Lumpur and conducts a search for cables. User ‘2’ is known to be a member of a construction team, so construction items, like steel cable, would be boosted, whereas computer cables would be less relevant. Further, the search would take into account the location and project. What suppliers are close by? What materials would be most relevant to this project? The answers to these questions would provide boosting relevancy.
Experiments will be run to determine optimal relevancy
Once the aforementioned pipelines and jobs have been created, Fusion experiments should be run to determine optimal relevancy tuning. Experiments are rudimentary component of relevancy tuning.This cannot be overstated. It’s easy to focus on the classifiers and stages, jobs and pipelines, etc. Make sure to prioritize and allocate time to become comfortable with the Fusion experiments UI, as well as reading and interpreting result data.