The Why, What, Who and Where of Vector Databases

ABSTRACT: Vector databases are the new category of NoSQL databases which stores, manages, and searches vector embeddings for unstructured data.

2023 is turning out to be the year in which vector databases emerge from the shadows and into the limelight. They will power the next generation of AI applications and serve as long-term memory for LLM (Large Language Model)-based workflow and workloads. LLMs are a subset of foundation models, trained on large volumes of structured and unstructured data, that represent language as dense vectors. 

The vector database is a new species of database that elegantly handles management of vectors, also called embeddings - i.e., compressed numerical representations of data objects. We can view the vector database as the next generation of search engines and an essential component for organizations that build proprietary large language models.

Why does one need it?

With data being generated at unprecedented rates, making sense of all the data through vector databases that store and retrieve these vectors will become increasingly important. The world is rapidly transitioning from keyword search to semantic search that finds data and returns results based on concepts.


Vector search excels at matching concepts rather than just words and text.


Today the world needs vector databases to unleash the value trapped in unstructured data. The landscape of vector databases and their solutions and use cases are evolving at breakneck speeds. Reference architecture and best practices have yet to emerge and mature.

Search used to be text-based, but with vector encoding of data it is becoming increasingly convenient to search through images, audio, video, and concepts, or even combinations of these. This leads to multi-modal search, which is already revolutionizing search – that allows searching for one type of content using another – for example, searching for text using images or vice-versa.

What is it?

Vectors are numerical arrays that characterize an object. Deep learning models generate vectors to simplify how they process multi-structured content. Designed to be multilingual and multimodal, vector databases can process any natural language and unstructured data in any form–images, videos, audio, text –within the same vector space (see figure 1).


A vector database stores, indexes, and searches large unstructured datasets by processing the embedded vectors of deep learning models.


Figure 1 shows a high-level scheme for generating vectors. Searching in vector databases uses similarity metrics and indices. The former defines how the database evaluates the distance and difference between two vectors. The most used similarity metric is the Euclidean distance, also known as the L2 norm. Vector databases can generate a wide range of indices, each having its advantages and disadvantages. Indices play a critical role in speeding up queries and handling concurrency. Vector databases allow configuring the indexing algorithm and similarity metric based on use case.

Before using and deploying vector databases to production, enterprises should understand criteria such as the following.

  • Scalability: Vectors can exceed the limits of a single machine. Scaling embedding vectors can prove challenging both across compute and storage. Practitioners can shard the data and  indices to help scale out storage.  This also helps them retrieve billions of vectors at high concurrency and guarantee low latencies across read, write, and update workloads

  • Stability and Reliability: When selecting vector databases for your use case based on non-functional requirements (NFRs), evaluate its ability to handle failures at the hardware and software level without data loss and with minimal downtime.

  • Speed: Vector databases support blazing fast performance, especially for processing and indexing data in real-time. For use cases that process and query  hundreds or thousands of data points per second, latency is a critical factor in making the choice of which vector database to select.

  • Flexibility: Ability to configure and choose different indexes and algorithms for search and support for different computing architectures - CPU, GPU, TPU etc.

The index is the core component of a vector database. A vector database relies on the index to provide performance consistently at scale. Vector databases abstract the underlying indexing technique from the end user. Algorithms for vector indices generation consider filter type, data characteristics like cardinality, distributions, and query load statistics. The vector database converts the predicates into a binary classifier or a boolean value. 

Who are the players?

Some of the major players include Vespa, Milvus, Qdrant, Weaviate, Pinecone, Zilliz, CozoDB, Twelve Labs – focused on video and Redis.

Relational and NoSQL databases are integrating vector data type and support as first-class citizens taking advantage of advances in vector-based data representation and integrating vector searches either natively or using vector search libraries.

Postgres supports a plugin called pgvector for vector-based similarity search.

Vector search libraries and engines include - Vald – vector search engine, Faiss and ScaNN are fast and efficient for performing vector similarity searches. These libraries help in creating the indices which speed up the query search process. Libraries are useful for small to medium sized datasets and vector databases are built on top of these libraries. Elasticsearch also includes vector search as a plugin. Plugin based approaches are limited and unoptimized and performance especially at scale can be questionable. 

Where to use them?

Search: Vector databases are designed for search and data mining, they especially excel in semantic search. The primary use case of vector databases is findingunstructured data - text, image, audio, and video. Unlike traditional searches which retrieve only exact matches, vector databases return similar, close approximations.

Similarity search-based use cases mostly involving unstructured data such as images, video, and audio are hard to classify in relational databases. Vector databases can analyze data in real-time. For example, they might search for objects or quickly look up objects in front of autonomous cars to identify and categorize them, or support facial recognition systems. Vector databases can also be applied to audio search use-cases to identify the name of a song or a user’s voice.

Vector databases can also be applied for searching through genomic data and DNA sequences that rely heavily on approximate nearest neighbors’ search results.

Cluster Identification: Use cases like de-duplicating a list or automatically categorizing items in a list are possible with a vector database. Vector databases could be applied to graphs by storing the graph embeddings (node, edge, sub-graph embedding) and identifying similar networks and clusters. Vector databases find usage in recommendation engines - where similarity search functionality can help in suggesting relevant items to users as well as for content management search and categorization of items. 

Recommendation: Recommender systems can use vector databases to store, index, search, and retrieve unstructured data. Home Depot, for example, dramatically improved the search experience using vector databases. Semantic search-based uses cases where searching is based on understanding natural language is good fit for vector databases. In such use cases, vector databases can be used to index vector embeddings from NLP models to understand the context of the text.

LLM Workflow: One of the shining-star use cases of vector databases is in providing a form of persistence for LLMs and for LLMs to query and retrieve domain specific information by performing similarity search to find the best or closest match for the request from the vector embeddings that are persisted. Vector databases have been successfully used to alter the workflow of prompts with ChatGPT. End user questions can be routed to a vector database to retrieve the most relevant results, consolidate them with the original question and resubmit to the ChatGPT engine. This altered workflow has been found to return better results.

Conclusion

Vector databases have brought on a new category to the NoSQL space. These databases are still relatively new and are continuously evolving and adding capabilities. Non-functional features like horizontal scaling, pagination, sharding and capabilities of GPU/TPU etc are still in development. In a world that is increasingly becoming AI driven, vector databases look to be an important cog of the data and AI flywheel.