1. Finding Similar Stocks
  2. Results
  3. Code

1. Finding Similar Stocks

There are many ways to find stocks with similar behavior based on how one defines similarity and the data used. In this article we use a 12 period channel where, for each period, we have (current adjusted close price – minimum value)/(maximum value – minimum value). Maximum and minimum values are computed for the adjusted close prices for the past 21 trading days (representing a trading month), then 42 days, …, 252 days. Our channel will then be normalized so that all values are in the interval [0, 1]. We use the Euclidean distance measure as our similarity.

After transforming our data into normalized channels, our task then becomes finding the K nearest neighbors. We will use the Faiss Python library.

2. Results

Our stock basket consists of those from the Russell 3000 index. However, we used a version of the index from last year, so there are differences between our basket and the current Russell 3000 index. Our last data date is 20201111, and the channel extends backwards for 252 trading days. We find the 10 nearest neighbors for each stock.

Below we show the top and bottom 5 stocks arranged alphabetically by symbol.

NN_1NN_2NN_3NN_4NN_5NN_6NN_7NN_8NN_9NN_10
APCARMTNCLCTMTDPDCOCROXCLFDAMHEYEFIZZ
AAMLRRYIENVACPRIFGBICDXSVNDAFMBHFSBWMVBF
AALRUBYZGNXMRCUTLAJRDDSTHSTWINFCNTDC
AANCOHUONTONSITCBSHITGKOSPIHTHRFLOZK
AAOINBIXLRNCLXTVYGRPINGCTXSAXDXADMACFMSLIVX
ZTSFNDEPAMTPXTOLCCSMDCNVDAMITKMCDESCA
ZUMZTUPBLCLCTRICKCLFDMTDBKEHCASYKXLNX
ZUOPRDOCACCCNSLDBXVBIVQLYSVAPONUVAZYXIBLKB
ZYNEADVMUTMDIVRBCELCALAFULCSIEBNINEHPLQDA
ZYXIINOCNSLKALAVBIVCACCLIVXZUOPRDOGTHXNAT

The entire spread sheet (zip file) can be accessed here.

For such a small problem, use of Scikit-learn‘s nearest neighbor CPU based implementation will suffice. What we wanted to show here is a template that you can use for your own data when the dataset is sufficiently large to require the computational speed available via GPU computing. For example, if you are performing similarity search for intraday data, the size of the dataset can become large enough to make the use of GPU computing necessary.

When coding your own similarity search using the code below as a template, there are two issues that you should note:

  1. See Faiss Gotchas for how to configure Numpy arrays for input into Faiss.
  2. See MetricType and distances for distance types supported by Faiss.

3. Code

import pandas as pd
import numpy as np
import faiss

if __name__ == '__main__':
    base_dir = YOUR DIRECTORY
    
    df = pd.read_pickle(base_dir + 'df_channel_21_252.pkl')
    
    list_symbols = df.index.to_list()
    array_channel = df.values.astype('float32')
    if not array_channel.flags['C_CONTIGUOUS']:
        array_channel = np.ascontiguousarray(array_channel)
    
    neighbors = 10 + 1
    dimension = array_channel.shape[1]
    num_pts = array_channel.shape[0]
    faiss_index = faiss.IndexFlatL2(dimension)
    # https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs
    res = faiss.StandardGpuResources()  # use a single GPU
    gpu_ind_flat = faiss.index_cpu_to_gpu(res, 0, faiss_index)
    gpu_ind_flat.add(array_channel)  # add vectors to the index
    _, array_nn_indices = gpu_ind_flat.search(array_channel, neighbors)  # _ = distances
    
    
    # check dimensions
    assert(array_nn_indices.shape[0] == num_pts)
    assert(array_nn_indices.shape[1] == neighbors)
    
    
    list_nn = []
    for i in range(num_pts):
        array_nn = np.take(list_symbols, array_nn_indices[i], axis=0)
        list_nn.append(array_nn[1:])  # exclude self
        
    header = []
    for i in range(neighbors - 1):
        header.append('NN_' + str(i+1))
        
    df_results = pd.DataFrame(index=df.index, data=list_nn, columns=header)