Dr. Ram Sriharsha, is the VP of Engineering and R&D at Pinecone.
Before joining Pinecone, Ram had VP roles at Yahoo, Databricks, and Splunk. At Yahoo, he was both a principal software engineer and then research scientist; at Databricks, he was the product and engineering lead for the unified analytics platform for genomics; and, in his three years at Splunk, he played multiple roles including Sr Principal Scientist, VP Engineering and Distinguished Engineer.
Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. It combines vector search libraries, capabilities such as filtering, and distributed infrastructure to provide high performance and reliability at any scale.
What initially attracted you to machine learning?
High dimensional statistics, learning theory and topics like that were what attracted me to machine learning. They are mathematically well defined, can be reasoned and have some fundamental insights to offer on what learning means, and how to design algorithms that can learn efficiently.
Previously you were Vice President of Engineering at Splunk, a data platform that helps turn data into action for Observability, IT, Security and more. What were some of your key takeaways from this experience?
I hadn’t realized until I got to Splunk how diverse the use cases in enterprise search are: people use Splunk for log analytics, observability and security analytics among myriads of other use cases. And what is common to a lot of these use cases is the idea of detecting similar events or highly dissimilar (or anomalous) events in unstructured data. This turns out to be a hard problem and traditional means of searching through such data aren’t very scalable. During my time at Splunk I initiated research around these areas on how we could use machine learning (and deep learning) for log mining, security analytics, etc. Through that work, I came to realize that vector embeddings and vector search would end up being a fundamental primitive for new approaches to these domains.
Could you describe for us what is vector search?
In traditional search (otherwise known as keyword search), you are looking for keyword matches between a query and documents (this could be tweets, web documents, legal documents, what have you). To do this, you split up your query into its tokens, retrieve documents that contain the given token and merge and rank to determine the most relevant documents for a given query.
The main problem of course, is that to get relevant results, your query has to have keyword matches in the document. A classic problem with traditional search is: if you search for “pop” you will match “pop music”, but will not match “soda”, etc. as there is no keyword overlap between “pop” and documents containing “soda”, even though we know that colloquially in many areas in the US, “pop” means the same as “soda”.
In vector search, you start by converting both queries and documents to a vector in some high dimensional space. This is usually done by passing the text through a deep learning model like OpenAI’s LLMs or other language models. What you get as a result is an array of floating point numbers that can be thought of as a vector in some high dimensional space.
The core idea is that nearby vectors in this high dimensional space are also semantically similar. Going back to our example of “soda” and “pop”, if the model is trained on the right corpus, it is likely to consider “pop” and “soda” semantically similar and thereby the corresponding embeddings will be close to each other in the embedding space. If that is the case, then retrieving nearby documents for a given query becomes the problem of searching for the nearest neighbors of the corresponding query vector in this high dimensional space.
Could you describe what the vector database is and how it enables the building of high-performance vector search applications?
A vector database stores, indexes and manages these embeddings (or vectors). The main challenges a vector database solves are:
- Building an efficient search index over vectors to answer nearest neighbor queries
- Building efficient auxiliary indices and data structures to support query filtering. For example, suppose you wanted to search over only a subset of the corpus, you should be able to leverage the existing search index without having to rebuild it
Support efficient updates and keep both the data and the search index fresh, consistent, durable, etc.
What are the different types of machine learning algorithms that are used at Pinecone?
We generally work on approximate nearest neighbor search algorithms and develop new algorithms for efficiently updating, querying and otherwise dealing with large amounts of data in as cost effective a manner as possible.
We also work on algorithms that combine dense and sparse retrieval for improved search relevance.
What are some of the challenges behind building scalable search?
While approximate nearest neighbor search has been researched for decades, we believe there is a lot left to be uncovered.
In particular, when it comes to designing large scale nearest neighbor search that is cost effective, in performing efficient filtering at scale, or in designing algorithms that support high volume updates and generally fresh indexes are all challenging problems today.
What are some of the different types of use cases that this technology can be used for?
The spectrum of use cases for vector databases is growing by the day. Apart from its uses in semantic search, we also see it being used in image search, image retrieval, generative AI, security analytics, etc.
What is your vision for the future of search?
I think the future of search will be AI driven, and I don’t think this is very far off. In that future, I expect vector databases to be a core primitive. We like to think of vector databases as the long term memory (or the external knowledge base) of AI.
Thank you for the great interview, readers who wish to learn more should visit Pinecone.
Credit: Source link