Dataset Cluster Analysis

This will be a bit of a different post, almost like a mini-paper with an iPython notebook documenting my code and work along the way. I hope to make this the new standard format of my posts to be more helpful to me, my team, and others.


  • In our self-driving dataset, specifically focusing on the “direct” mode subset, where the car is lane following, there are several distinct “sub- behaviors” that occur.
    • Some example “sub-behaviors” include sharp turning to left/right, slow movement forward, speedy movement forward.
  • My long term goal is to create a Hierarchical Task Network (HTN) 1 with one network actuating the sub-behaviors (moving the car appropriately for given sub-behaviors) and a second network that chooses between these sub- behaviors.
    • The first stage of this project is to create a model that auto-classifies behaviors given the motor/steer output. This is what I explore here, using unsupervised classification techniques.


  • We start with a K-Means Clustering approach to analyze the dataset, this is a computationally inexpensive clustering technique that we can explore the dataset with.
  • Once we’ve understood the optimal hyperparameters from the K-Means clustering approach we can extend to a Gaussian Mixture Model (GMM) which provides us with a probability value for each cluster rather than binary assignment, which we can utilize later in the network.
  • The GMM will be described later, here I simply describe exploration using the K-Means Clustering Approach


Exploratory Analysis

Elbow Method

In the elbow method2, one plots the percent of variance vs. number of clusters for a dataset. When one sees an “elbow” or the start of complete linearization in the plot, that is the natural number of clusters. Here is an example from Wikipedia:



import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from sklearn.cluster import KMeans
import numpy as np
from scipy.spatial.distance import cdist, pdist
import progressbar

a = pd.read_csv('out.csv')
tss_bar = progressbar.ProgressBar()
tss = 0
for k,g in tss_bar(a.groupby(np.arange(len(a))//10)):
    tss += sum(pdist(g)**2)
tss /= a.values.shape[0]
# Source Code adapted from

def elbow(df, n):
    cluster_bar = progressbar.ProgressBar()
    # Calculate total sum of squares with dividing dataset to fit
    # into memory in chunks
    final_variance = []

    for k in cluster_bar(range(1,n + 1)):
        kMeansVar = KMeans(n_clusters=k).fit(df.values)
        centroids = kMeansVar.cluster_centers_
        k_euclid = cdist(df.values, centroids)
        dist = np.min(k_euclid, axis=1)
        wcss = sum(dist**2) # Calculate within sum of squares
        bss = tss - wcss  # Calculate sum of squared deviations
    plt.plot(range(1,n + 1),final_variance)
    plt.xticks(range(1, n + 1))
    return final_variance

%matplotlib inline

# Load dataset
a = pd.read_csv('out.csv')
# Run Algorithm
final_variance = elbow(a, 20)



These results look pretty good, we can pretty clearly see a elbow-type shape in the graph. But it’s not quite visible where this is. After blowing the image up and studying it for some time I could conclude that the elbow appeard at k=10.

elbow_idx = 10
plt.figure(figsize=(15, 8))
plt.plot(range(1,21), final_variance, '--b', zorder=-1)
plt.scatter([x for x in xrange(1,21) if x != elbow_idx], list(final_variance[:elbow_idx]) + list(final_variance[elbow_idx+1:]), c='b')
plt.scatter(elbow_idx, final_variance[elbow_idx], c='g', s=100)
plt.xlabel('# of Clusters')
plt.ylabel('Explained Variance')
plt.xticks(range(1, 21))


When zooming in closely to the image it’s apparent that the final point of linearization occurs at the point where “k” or the number of clusters is equal to 10. This is likely to be the natural number of clusters that fit our dataset.

Cluster Visualization

Now we can take this a step further with cluster visualization at this k. We do this with a 2D “Histogram” of the data we have available.

from sklearn import cluster
import pandas
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np

%matplotlib inline

# Load dataset
a = pandas.read_csv('out.csv')
# Initialize model
k = 10
kmeans = cluster.KMeans(n_clusters=k)

# Fit model to data
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print len(labels)

# Visualize Model
fig, axes = plt.subplots(figsize=(15, 8))
for i in range(k):
    # select only data observations with cluster label == i
    cluster_points = np.where(labels==i)
    ds = a['steer'].values[cluster_points],a['motor'].values[cluster_points]
    # plot the data observations
    axes.set_title('Motor vs. Steer {}-Means Cluster'.format(k))
    # plot the centroids
    lines = axes.plot(centroids[i,0],centroids[i,1],'kx')
    # make the centroid x's bigger


  • This is very interesting, and although steering splitting is used, there is more interesting clusters happening, especially in the middle section of this graph.
  • The question is how do these clusters relate to the actual distribution of motor and steer in the dataset, to do this we will graph a histogram.

    Motor Steer Histogram

fig, axes = plt.subplots()
axes.set_title('Motor vs. Steer Histogram')
h = axes.hist2d(a['steer'],a['motor'], bins=50, norm=LogNorm())


  • Interestingly, there are three large dataless horizontal lines, these were most likely because of certain ranges that don’t register on the controller that we use to control the car.
  • Overall we see a majority of the data is focused on straight driving, as this is the most common modality in standard lane following + obstacle avoidance courses. Although the majority of the data is here, we still have quite a lot of turning samples. The motor power roughly stays constant in the range below 75.
  • We notice that in both this graph and the cluster visualization, the graphs are symmetric, this is because we mirror every training datapoint to make sure there is no inherent bias between left and right for the networks.


  • Our dataset is fairly condensed into a certain range of motor and steering values visibile from the histogram.
  • From the cluster diagram we can see that the clustering is at least semi- interesting, especially in the center. These clusters should make for interesting “sub-behaviors” for a network to learn as there is wide variation within these zones which the low-level HTN Network can learn.
  • The modal info tensor should have 10 channels, for the 10 clusters that we’ve found optimally in the dataset, this is about 3x bigger than the modal tensor I described in my MTL paper, but should yield more interesting results for the higher level network to pick between.
  • We can map these steering and power zones to an image, and overlay this image on top of the input image. This essentially becomes a “potential field” in the image, where we can see which areas have the highest probabilities.

Future Work/Next Steps

  1. Next step should be to use these 10 clusters, but to research the clusters with Guassian Mixture Models, so as to get a mixture of the probabilities for each area in the field.
  2. Once the Gaussian Mixture Model is created, I can begin to train the low- level network on our dataset.
  3. Obviously after this I will have to design the higher level network and train it end-to-end with the pretrained low-level network.

The Guassian Mixture Model isn’t as interesting as the other tasks so my next post will most likely be when I’ve moved on to the low-level network design.

Written on October 7, 2017