Unlocking the Power of Vector Search in Enterprise
The advance of large language models (LLMs) and generative AI has unlocked the potential of delivering more refined search experiences. To get the answers to their questions, people can now talk to AI chatbots such as ChatGPT and receive an answer within seconds, instead of spending hours sifting through long pages of search results. However, in its current form, LLM output is not grounded in accurate and trustworthy knowledge, and can sometimes generate responses based on flawed or biased information. To address this, one solution is to integrate a search system that provides LLMs with credible information.
Thankfully, the advent of LLMs has also lent itself to search. Vector search, which uses embeddings derived from these LLMs, has introduced a new level of intelligence to search systems. In this blog post, we will benchmark text embedding offerings in the market and show how Glean creates advanced embedding models for enterprises by refining them with specific company language. These enterprise-specific embedding models are then combined with traditional information retrieval methods and advanced personalization to create a hybrid system that elevates enterprise search capabilities.
An Introduction to Vector Search
Embeddings are numerical representations of text that capture its semantic information, making it easier for computers to understand relationships between concepts. Unlike traditional keyword-matching information retrieval methods, vector search leverages these representations to deliver more accurate results in certain scenarios.
To understand the performance of embeddings in enterprise search, we conducted an experiment using the best text embeddings from two leading LLM providers, and three top-performing open source models. The experiment evaluated the performance of different text embedding models on an enterprise search evaluation set. To assess the effectiveness of these embeddings, we used two key metrics: NDCG@10 and R@100. NDCG@10 measures the quality of the top 10 search results by taking into account both relevance and ranking, while R@100 gauges the search system's ability to retrieve relevant results, expressed as a percentage of the top 100 search results that contain relevant information.
The results of our experiment revealed that, in this enterprise search task, open source embeddings such as E5-large, Instructor-XL, and MPNet still outperformed the embeddings provided by commercial API providers such as OpenAI (text-embedding-ada-002) and Cohere (large). This indicates that, at least in this specific use case, open source embeddings are still a better choice for enterprise search than commercial API providers. However, AI is advancing rapidly, and it will be fascinating to observe how the field evolves in the future.
Adapting vector search to your company language
At Glean, we understand that the way you communicate within your company could differ greatly from other companies. For example, companies often have specific acronyms, project code names, or technical concepts that are unique to their business, contributing to an entirely endemic language for each particular workplace and vertical (medicine, legal, banking, etc). These terms and phrases may not be recognized by generic text embeddings, and thus may not deliver the desired results in enterprise search.
That's why we've developed a method for fine-tuning embeddings to the unique language of our clients, resulting in each customer having their own customized large language model that performs well in any business or vertical. In other words, no need for constructed vertical training sets! Our experiments demonstrate that this in-domain finetuning significantly improves vector search performance, surpassing both commercial API providers and top-performing open source models. This is in line with external research that highlights the shortcomings of dense vector search methods on out-of-domain data.
Not only does this method improve the initial search performance, but our research also shows that the longer a customer stays with us, the better their language model becomes. The continuous adaptation and fine-tuning of the model leads to an increasingly improved user experience and more accurate search results.
Vector Search as a Component of Modern Enterprise Search and Knowledge Discovery
While vector search is a foundational technology to machine semantic understanding, it alone isn’t enough to solve the problem of delivering quality results in enterprise search and knowledge discovery. Glean utilizes a multi-dimensional approach combining vector search, traditional keyword search, and advanced personalization into a powerful hybrid search system.
Interested to see how it all works? Get a demo today and discover how we're revolutionizing search for enterprise environments.