1. Faiss
  2. MNIST KNN
  3. Faiss Gotchas

1. Faiss

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.

In this article, we will show how to use Faiss, on a GPU, for the KNN (K nearest neighbor) algorithm for a classification problem. While this is a small problem and the speedup factor versus Scikit-learn‘s CPU version is small, we want to show how to properly use Faiss and avoid some subtle errors.

Below are links to the Faiss documentation:

2. MNIST KNN

We modified the MNIST handwritten digits data set by passing the 2d data thru a CNN autoencoder to transform the 2d data into 1d vectors of length 20. This was not done for this specific project, as we could have used the flat 784 length vectors of the original data. The length 20 vectors were created for other projects, so we decided to use them here.

Calculations were performed on an AWS EC2 g3.4xlarge instance: 16 CPUs, NVIDIA Tesla M60 GPU,

The Faiss GPU implementation of the KNN algorithm:

import numpy as np
import faiss
from sklearn.metrics import classification_report

gpu_resource = faiss.StandardGpuResources()  # use a single GPU
cpu_index = faiss.IndexFlatL2(num_dimensions)  # create a CPU index
gpu_index = faiss.index_cpu_to_gpu(gpu_resource, 0, cpu_index)  # transfer the index to GPU
gpu_index.add(x_calib)  # add vectors to the index
_, array_knn_indices_gpu = gpu_index.search(x_prod, num_neighbors)  # _ = distances
   
# check dimensions
assert(array_knn_indices_gpu.shape[0] == num_prod_points)
assert(array_knn_indices_gpu.shape[1] == num_neighbors)

# get predicted y values for production data using majority voting
y_production_predicted_faiss_gpu = np.full(num_prod_points, -1)
for i in range(num_prod_points):
    array_calib_classes = np.take(y_calib, array_knn_indices_gpu[i], axis=0)
    array_unique, array_counts = np.unique(array_calib_classes, return_counts=True)
    y_production_predicted_faiss_gpu[i] = array_unique[np.argmax(array_counts)]
            
# evaluate predictions
print('\n\n *** classification_report *** \n')
print(classification_report(y_prod, y_production_predicted_faiss_gpu, 
                            labels=class_names_list, digits=3))

elapsed time = 0.003002 minutes
accuracy = 0.978

The Scikit-learn CPU version (nearest neighbors, ball tree):

import numpy as np
from sklearn.metrics import classification_report
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=neighbors, algorithm='ball_tree',
                      metric='minkowski', p=2, n_jobs=-1).fit(xcalib)
array_nn_indices_balltree = nn.kneighbors(X=xprod, return_distance=False)

# check dimensions
assert(array_nn_indices_balltree.shape[0] == num_prod_pts)
assert(array_nn_indices_balltree.shape[1] == neighbors)

# get predicted y values for production data using majority voting
y_prod_predicted_balltree = np.full(num_prod_pts, -1)
for i in range(num_prod_pts):
    array_calib_classes = np.take(ycalib, array_nn_indices_balltree[i], axis=0)
    array_unique, array_counts = np.unique(array_calib_classes, return_counts=True)
    y_prod_predicted_balltree[i] = array_unique[np.argmax(array_counts)]
            
# evaluate predictions
print('\n\n *** sklearn ball tree classification_report *** \n')
print(classification_report(yprod, y_prod_predicted_balltree, 
                            labels=list_class_names, digits=3))

elapsed time = 0.158966 minutes
accuracy = 0.978

While we do not show it here, the confusion matrices are identical.

3. Faiss Gotchas

1. When using Numpy arrays as input, they must be set to dtype = ‘float32’. If you are reading a saved Numpy array from a file:

npy_array = np.load(directory + 'npy_array.npy', allow_pickle=True).astype('float32')

2. Faiss assumes that Numpy arrays are contiguous. Use:

if not npy_array.flags['C_CONTIGUOUS']:
    ny_array = np.ascontiguousarray(npy_array)