The Future of Search is Semantic and Lexical

You, Thu Aug 11 2022 • infra search software ai/ml

“It’s what you all been waiting for, ain’t it?
Your weekly entertainment
For me to get a hold of this beat and go ‘head, claim it
I’m about to paint a picture, you ***** go ‘head, frame it
Since we getting Seinfeld on some Jerry and Elaine shit” — Drake, Barry Bonds Freestyle.

After Planet Earth, Game of Thrones and The Wire, I cannot think of a show more recognizable to my friends than Seinfeld. When I think of the two names, “Jerry and Elaine,” I think Seinfeld, and I never even watched the series seriously. My brain has placed the phrase Jerry and Elaine near the “Seinfeld” series. It’s a great illustration of how semantic search could work one day. It’s also a great view into the shortcomings in the lexical approach.

Seinfeld, for illustration purposes, sitting squarely between the two people in question

Lucene, the sub-system of most search engines used by humans and computers, does not ordinarily consider the phrase “Jerry and Elaine” to bear any relation to “Seinfeld.” Out of the box, the lexical method fails to infer intent that facilitates information discovery. For you to build a search engine in 2022 that would surface a document with a title of “Seinfeld” from a query of “Jerry and Elaine,” you would need to take at least one of the following three actions. (1) You could maintain a list of synonyms for phrase queries where “Jerry and Elaine” maps to “Seinfeld” for all new queries in the future. (2) You could pull in a document with “Seinfeld” in the title field based on a boost to a phrase query on a description field, an AND in a boolean operator with many clauses, and pray that no other titles in your corpus feature Jerry and Elaine. (3) You could encode all the documents as vectors in the future, index all those vectors in the search index, and calculate the similarity between the vectorized input and the stored vectors in your search index.

Hint: the one that sounds the hardest upfront is the easiest.

We lack humans. Humans in the loop are latency in the process. Retailers and media companies, financial services companies and security, all getting their asses collectively kicked by companies that have embraced it. The strongest solution is a combination of all three methods from the introduction. Part of the reason for that is real-life use cases and their associated requirements. People are constrained by the rules and the rules of vector space are ill-defined. I’m going to break down how the best “vector database,” cue buzzwords, is actually about satisfying all the requirements of a general purpose search engine while also being smarter than lexical matching. It still matters, but less so.

If I ask in my search query that I only want shows from 2020, I don’t give a damn about Seinfeld. Exclude it from the results. This is the fraudulent positioning of many of these so-called vector databases. They’ve never been companies powering production search workloads. Users and their end users want some modicum of control. ML doesn’t always win! For many search use cases and the capacity for their teams to manage complexity, ML-only is a much more expensive option than the correct approach. Search companies of the future will have three characteristics that will shape our usage of them and our success with those tools. A flexible data model that supports transactions, a lexical relevance framework, and dense vectors that enable approximate nearest neighbors will converge to power the systems that help us via domain specific search.

Each of these approaches have their shortcomings, yet the document model shall allay them. The document model is the modern representation of humanly readable data, as I have argued in an earlier post about cognitive track and field. I asked my research assistant to review some data on just how much developers love the document model, and more subtly, what share of programming languages that are more loved than hated rely largely on an object-oriented paradigm for giving instructions.

Without going too much into detail about databases because this post isn’t about why developers love them, 5 out of 6 of the most popular databases support JSON on some level, and three have rich first-class support for the document model: Elasticsearch, Google Cloud Firestore, and MongoDB.

SQLLite is super light on document support

And of the 42 most used languages, almost all of them are primarily used for objective-oriented programming. Of the ones that are very loved, Elixir (one of my personal favorites) and Clojure are the only primarily functional languages. Both of those support objects in a very natural manner. Users work with objects everywhere they look.

The document model and the paradigm shift that is the flexible schema make it such that users can build auxiliary attributes that enrich a search corpus to facilitate query understanding, the first step of great search relevance. If a customer enters in a zip or post code, an effective search system should be able to identify these entities. The document model would enable you to enrich documents by extracting entities in the corpus and dropping them in the appropriate buckets at index time. Woe is the developer still doing a wildcard search and praying that something hits without any data enrichment.

Do you understand Lucene? If not, I don’t blame you. It’s too hard and requires a multi-disciplinary set of skills. I’ve dedicated the past 4+ years to making it easier to exploit, but not enough time to make it easier to understand. That’s about to change. Big `tings on the horizon. In the meantime, let’s see if I can break it down in the next two paragraphs.

Lucene is language agnostic but gives you a load of options for language-specific analysis, a mechanism that signifies how your data will be altered at index and query time to support a broader encapsulation of human input. By default, Lucene strips special characters, punctuation, and other things that matter to humans like suffixes. That means a word in the corpus like friendliness might become frien so that it matches friendly, friend, friends, and friendship, which also have been stemmed to frien. You can control this behavior with some customization in your index and query analyzers, but almost every use case needs to be tested to truly understand.

Secondarily, Apache Lucene the DBMS is a very powerful indexing and query engine in its own right. You can have an index with thousands of fields, though I’d never recommend it. What that means under the hood is if you need to build a search system that filters on a non-deterministic number of fields, like all the newly added enrichment fields based on NER pipelines, it’s no problem for Lucene. The block k-d tree that Lucene builds on a single index makes Postgres index intersection look like a John Witherspoon (RIP) stand up set.

Postgres/file cabinet JOIN hell

Dense vectors, semantic search, and the language models that power them are a whole new world. They are far from perfect today because they are too damn complex to modify for hoi polloi. Floating points are still buoyant. The average search requirement is not multi-modal in nature. Companies don’t want to scale their clusters sufficiently to support new indexing and querying costs. And most users are satisfied with shitty results. In fact, most customers looking for Seinfeld are not going to type “Jerry and Elaine.” They will simply type “Seinfeld.”

The future search engine will help you find what you are looking for before you know what it is exactly, when it is on the tip of your tongue. A production-ready search engine will be a mixture of many tools lexical and semantic, none of them closed-source. The models need to be inspectable. The search engine needs to be inspectable. And the database needs to be inspectable. I wouldn’t trust anything else. If you find yourself in need to build a search engine that supports images, sound, text, and generally unstructured enquiry and intent like most search engines, let me know. I’ll keep it a buck!

I’m in this business to help people find information whether it be to sell products and services, or you need to find a cure. You need to know the tools at your disposal to improve this world.

A leading English-language poet today has conveyed the interconnectedness that powers our brains. I struggle to imagine a more perfect introduction to the future of search engines in the genius.com corpus. As a builder and user of search engines, I’ve long thought that the dataset was an interesting one. They force users to convert lyrics in a recognized dialect, Black Vernacular English, to the King’s tongue, minus a few extraneous u’s. I presume that the lack of standardization leads the team to make such demands, not ignorance. The search functionality on the site would not work as well out of the box. The lucene.english analyzer’s analysis chain would not remove stop words like da (the) or fo (for). They, too, need semantic search.

The first use case to tackle will be domain specific search like genius.com, but we can anticipate that challengers to Google based on Lucene will emerge. DuckDuckGo raised $100 Million from a star-studded cast of suspects and it is based on Lucene. There will be more. They will be smarter. At some point in the near future we as consumers will have an array of options and magnificent search experiences on most of the digital properties we visit. You can thank Euclid and his Egyptian tutors for the future that is the present.

I don’t often drink the ML Kool-Aid, but when I do I do it with syntactic sugar.