1. Faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.
In this article, we will show how to use Faiss, on a GPU, for the KNN (K nearest neighbor) algorithm for a classification problem. While this is a small problem and the speedup factor versus Scikit-learn‘s CPU version is small, we want to show how to properly use Faiss and avoid some subtle errors.
Below are links to the Faiss documentation:
2. MNIST KNN
We modified the MNIST handwritten digits data set by passing the 2d data thru a CNN autoencoder to transform the 2d data into 1d vectors of length 20. This was not done for this specific project, as we could have used the flat 784 length vectors of the original data. The length 20 vectors were created for other projects, so we decided to use them here.
Calculations were performed on an AWS EC2 g3.4xlarge instance: 16 CPUs, NVIDIA Tesla M60 GPU,
The Faiss GPU implementation of the KNN algorithm:
import numpy as np import faiss from sklearn.metrics import classification_report gpu_resource = faiss.StandardGpuResources() # use a single GPU cpu_index = faiss.IndexFlatL2(num_dimensions) # create a CPU index gpu_index = faiss.index_cpu_to_gpu(gpu_resource, 0, cpu_index) # transfer the index to GPU gpu_index.add(x_calib) # add vectors to the index _, array_knn_indices_gpu = gpu_index.search(x_prod, num_neighbors) # _ = distances # check dimensions assert(array_knn_indices_gpu.shape[0] == num_prod_points) assert(array_knn_indices_gpu.shape[1] == num_neighbors) # get predicted y values for production data using majority voting y_production_predicted_faiss_gpu = np.full(num_prod_points, -1) for i in range(num_prod_points): array_calib_classes = np.take(y_calib, array_knn_indices_gpu[i], axis=0) array_unique, array_counts = np.unique(array_calib_classes, return_counts=True) y_production_predicted_faiss_gpu[i] = array_unique[np.argmax(array_counts)] # evaluate predictions print('\n\n *** classification_report *** \n') print(classification_report(y_prod, y_production_predicted_faiss_gpu, labels=class_names_list, digits=3))
elapsed time = 0.003002 minutes
accuracy = 0.978
The Scikit-learn CPU version (nearest neighbors, ball tree):
import numpy as np from sklearn.metrics import classification_report from sklearn.neighbors import NearestNeighbors nn = NearestNeighbors(n_neighbors=neighbors, algorithm='ball_tree', metric='minkowski', p=2, n_jobs=-1).fit(xcalib) array_nn_indices_balltree = nn.kneighbors(X=xprod, return_distance=False) # check dimensions assert(array_nn_indices_balltree.shape[0] == num_prod_pts) assert(array_nn_indices_balltree.shape[1] == neighbors) # get predicted y values for production data using majority voting y_prod_predicted_balltree = np.full(num_prod_pts, -1) for i in range(num_prod_pts): array_calib_classes = np.take(ycalib, array_nn_indices_balltree[i], axis=0) array_unique, array_counts = np.unique(array_calib_classes, return_counts=True) y_prod_predicted_balltree[i] = array_unique[np.argmax(array_counts)] # evaluate predictions print('\n\n *** sklearn ball tree classification_report *** \n') print(classification_report(yprod, y_prod_predicted_balltree, labels=list_class_names, digits=3))
elapsed time = 0.158966 minutes
accuracy = 0.978
While we do not show it here, the confusion matrices are identical.
3. Faiss Gotchas
1. When using Numpy arrays as input, they must be set to dtype = ‘float32’. If you are reading a saved Numpy array from a file:
npy_array = np.load(directory + 'npy_array.npy', allow_pickle=True).astype('float32')
2. Faiss assumes that Numpy arrays are contiguous. Use:
if not npy_array.flags['C_CONTIGUOUS']: ny_array = np.ascontiguousarray(npy_array)