1. Introduction
We use machine learning algorithms to match current input data with historical data. As we know the future prices for past prices, we use these matched future prices to form percent returns, zscore (standard deviation units) returns and up/down percents. Returns are then divided into percentiles, thus yielding statistical ranges of possible returns for current assets. Price projections are derived from percent return projections.
2. Input And Output Data
We use formation days of length 252 days, divided into lengths of 21, 42, …, 231, 252 days as indicated below. The prices for these days are used to create the following inputs.
- Single Moving Average (SMA) – (current price – N day average)/N day average, where N = 21, 42, etc. as noted above, formed into an array. Additionally, we convert this array into binary format with > 0 -> A and <=0 -> B, and merge the results into a word.
- Dual Moving Average (DMA) – Same as SMA with 21 day average substituted for the current price.
- Bollinger Band (BB) – Same as SMA with the denominator replaced with the N day standard deviation.
- Percent Return (PR) – 100*(current price – N day price)/(N day price) where N is the same as used for the SMA input. We use the same procedure to create a word.
Outputs are projections of percent returns and prices for SMA, DMA, and PR real array inputs, zscore returns for BB real array inputs, for all input data, including words, we compute up/down percents. Projections are computed for 10 percentiles for 21, 63, 126, 189, and 252 days forward (one market month and all four market quarters).
3. Calibration Data
For the calibration data, the formation days data begins in 2017 such that the first projections are in 2018. Data consists of stocks and ETFs subject to a minimum close price of $1.0 and minimum volume of 1000 shares. Computed words used all available data. Real arrays used subsampled data consisting of 50,000 data points per data year thus totaling over 250,000 data points.
4. Exact Pattern Matching and Machine Learning Algorithms
At the end of each trading day, arrays and words are computed as described above.
Exact pattern matching is a string match of words. The outputs are up/down percents for days forward indicated above. Additionally, we present a word count which is the number of data points for a given word in the calibration data. Word counts are an indicator of the prevalence of occurrences of particular patterns in the data.
Array matching is done with the k nearest neighbors algorithm for k=125 and the Euclidean distance metric. Also, outliers are handled and raw data is scaled appropriately. Outputs are presented as return percentiles and price percentiles (derived from percent returns), up/down percents, median positive and negative returns, expectancy { (up fraction)*median positive return – (down fraction)*abs(median negative return) }, and outliers are noted. Outliers are computed by taking the maximum computed distance for each current data point, computing the mean and standard deviation, and marking those greater than two standard deviations from the mean as outliers.
5. K-Means Clusters
K-Means clusters are computed from the input arrays for current data only. Using 100 points per cluster, the number of clusters = number of data points/points per cluster. Assigned cluster numbers for each input data type are in ‘K-Means Cluster Number.csv’.
The K-Means algorithm computes K points in the data space, called centroids, such that the Euclidean distance between input data points and the nearest centroid are minimized. We compute a distance matrix of the centroid locations so that nearby and distant clusters can be identified. The distance matrix is presented as CSV files and interactive HTML files for the various input data types.
6. Percent Returns
Percent returns for year, quarter, and month to date as well as for 21, 42, …, 231, 252 days are computed and presented in ‘Percent Returns.csv’.
7. Results
All results are in ‘Historical Data Pattern Matching And Projections YYYY-MM-DD.zip’ and contains the following files:
Bollinger Band K-Means Centroids Distance Matrix.csv
Bollinger Band K-Means Centroids Distance Matrix.html
Bollinger Band KNN Projections 126 Days.csv
Bollinger Band KNN Projections 189 Days.csv
Bollinger Band KNN Projections 21 Days.csv
Bollinger Band KNN Projections 252 Days.csv
Bollinger Band KNN Projections 63 Days.csv
Bollinger Band Word Letters Up Down Percents.csv
DMA K-Means Centroids Distance Matrix.csv
DMA K-Means Centroids Distance Matrix.html
DMA KNN Projections 126 Days.csv
DMA KNN Projections 189 Days.csv
DMA KNN Projections 21 Days.csv
DMA KNN Projections 252 Days.csv
DMA KNN Projections 63 Days.csv
DMA Word Letters Up Down Percents.csv
K-Means Cluster Number.csv
Percent Return K-Means Centroids Distance Matrix.csv
Percent Return K-Means Centroids Distance Matrix.html
Percent Return KNN Projections 126 Days.csv
Percent Return KNN Projections 189 Days.csv
Percent Return KNN Projections 21 Days.csv
Percent Return KNN Projections 252 Days.csv
Percent Return KNN Projections 63 Days.csv
Percent Return Word Letters Up Down Percents.csv
Percent Returns.csv
SMA K-Means Centroids Distance Matrix.csv
SMA K-Means Centroids Distance Matrix.html
SMA KNN Projections 126 Days.csv
SMA KNN Projections 189 Days.csv
SMA KNN Projections 21 Days.csv
SMA KNN Projections 252 Days.csv
SMA KNN Projections 63 Days.csv
SMA Word Letters Up Down Percents.csv
A free data sample is available at Quantitative And Machine Learning Asset Analysis – Free Data.
See Quantitative And Machine Learning Asset Analysis to subscribe to daily updated data.