We use the K-means algorithm to answer two questions regarding portfolio diversification. How diversified is a given portfolio? How can a diversified portfolio be constructed? Additionally, we use the multidimensional scaling (MDS) algorithm to visualize results.


  1. Take the last 120 days of adjusted close data.
  2. Zscore the data (substract the mean and divide by the standard deviation).
  3. Apply the piecewise aggregate approximation (PAA).
  4. Apply K-means.
  5. Plot cluster centroids using MDS.

We arbitrarily choose to use 120 days of data. This parameter should be chosen to capture stock behavior that you deem important. Additionally, we applied a daily minimum average volume of 50000 shares to eliminate thinly traded stocks.

The PAA algorithm takes a time series and slices it into P contiguous, non-overlapping pieces, each of length L. Then each piece is averaged, thus reducing dimensionality and smoothing the data. Here we use P=12 and L=10. K-means uses the Euclidean distance measure so it is recommended to keep the dimension as low as possible and not too much greater than ~10, as the meaning of distance in high dimensions is not straightforward. Additionally, stock market data is noisy, so some means of smoothing the raw data is beneficial.

If you wish to determine if a given portfolio is sufficiently diversified, then the number of clusters should equal the number of assets in the portfolio. If you wish to construct a diversified portfolio, then the number of assets desired should be determined based on some criteria and this should be the number of clusters for K-means.

MDS takes data in D dimensions and projects it to a dimension < D. We project the cluster centroids to 2 dimensions and plot them to facilitate visualization.

For this project we use only stocks that are traded on the NYSE and NASDAQ exchanges. We chose not to use ETFs as a matter of convenience to avoid redundant assets which could only be handled by manually creating a list of ETFs.


Cluster number assignments can be downloaded here: Cluster Number Assignments.

Number of Points in Cluster

Cluster NumberNumber of Points in Cluster

As mentioned previously, there are two use cases for portfolio diversification. In the first case, that in which we want to know information about a given portfolio, we examine how many components reside in the same cluster and in clusters that are in close proximity.

In the second case, that in which we want to construct a diversified portfolio, we would choose assets that reside in different clusters and be wary of clusters that are close to other clusters.

For both cases, there are no hard rules that can be used. However, K-means clustering does yield enough information to create some criteria for ensuring and measuring portfolio diversification.