Show HN: Vicinity – Fast, Lightweight Nearest Neighbors with Flexible Back Ends

github.com

54 points by Pringled 3 days ago

We’ve just open-sourced Vicinity, a lightweight approximate nearest neighbors (ANN) search package that allows for fast experimentation and comparison of a larger number of well known algorithms.

Main features:

- Lightweight: the base package only uses Numpy

- Unified interface: use any of the supported algorithms and backends with a single interface: HNSW, Annoy, FAISS, and many more algorithms and libraries are supported

- Easy evaluation: evaluate the performance of your backend with a simple function to measure queries per second vs recall

- Serialization: save and load your index for persistence

After working with a large number of ANN libraries over the years, we found it increasingly cumbersome to learn the interface, features, quirks, and limitations of every library. After writing custom evaluation code to measure the speed and performance for the 100th time to compare libraries, we decided to build this as a way to easily use a large number of algorithms and libraries with a unified, simple interface that allows for quick comparison and evaluation.

We are curious to hear your feedback! Are there any algorithms that are missing that you use? Any extra evaluation metrics that are useful?

davnn 2 days ago

That‘s actually quite similar to the nearness library [1]. The main difference appears to be vicinity‘s focus on simplicity while nearness tries to expose most of the functionality of the underlying backends.

[1] https://github.com/davnn/nearness

bravura 3 days ago

This is great.

I would actually perhaps think the next step would be to add some sugar that allows you to run a random / fixed grid of hyper-parameters and get a report of accuracy and speed for your specific data set.

  • Pringled 2 days ago

    Thanks! This is actually something that we have been experimenting with a bit already (auto-tuning on a specific dataset basically). It turned out to be quite complicated given how many index and parameter combinations you get with a grid-search (making it very costly on larger datasets), which is why we first opted for this approach where you can evaluate with a chosen index + parameter set, but it's definitely something we are still planning to do.

antman 2 days ago

What does it mean that insertion is only supported for a few of the indexes? Also will this allow hybrid search for the backends that support it?

  • Pringled 2 days ago

    Some backends/algorithms don't natively support dynamic inserts, and require you to rebuild your index when you want to add vectors to it (Annoy and Pynndescent are the only backends that don't support it).

    Hybrid search is a really cool idea though; it's not something we support at the moment, but definitely something we could investigate and add as an upcoming feature, thanks for the suggestion!

aravindputrevu 3 days ago

Some questions

1. When you say backends, do you plan to integrate like a client with some "vector" stores. 2. Also any benchmarks? 3. Lastly, why python?

  • Pringled 2 days ago

    1: that could be something for the future, but at the moment this is just meant as a way to quickly try out and evaluate various algorithms and libraries without having to learn the syntax for them (we call those backends).

    2: we adopted the same methodology as ann-benchmarks for our evaluation, so technically the benchmarks there are valid for the backends we support. However it's a good suggestion to add those explicitly to the repo, I'll add a todo for that.

    3: mainly because a: it's the language we are most the comfortable with developing in, b: it's the most widely used and adopted language for ML and c: (almost) all the algorithms we support are written in C/C++/Cython already.