New Code Base

I’ve been distanced from the repository we use at the lab for a few months now, and several large modifications have been made to our training system to allow it to provide reproducible, consistent results in a way that is scalable to running several different forms of research experiments consistently. However, all these improvements are useless if I don’t know how to use them! I decided to go through the current code base and publish a detailed guide on it’s usage so that my coworkers and I can continue to train networks efficiently.

Parameter System

One of the largest modifications made to our training system, and the one which I’ll be focusing on in this post is the configuration system that was introduced to replace the existing Parameters system I had implemented. In the past, I implemented Python’s argparse module for training our networks which required a string of command line arguments for each network trained, defined in a separate Parameters file. Training networks looked something like this:

python Train.py --gpu 0 --epoch 100 --batch-size 100 --verbose --ignore left play racing --resume save/epoch5.torch

If the data handling was to be changed or a new/different network was to be used it would require creating a new branch/version of the repository for that new network and data handling strategy and directly modifying the code to do so. Obviously, while this worked for small sets of networks it was not scalable to training epochs and epochs of networks.

Configuration Files

In the new system this is scrapped almost completely in favor of a JSON Style Configuration system. A few sample config files look like this:

Parent Configuration File Ex.
{
  "parent_config": null,
  "logging": {
    "verbose": false
  },

  "training": {
    "start_epoch": 0,
    "num_epochs": 100,
    "learning_rate": null,
    "rand_seed": 123123123,
    "p_exclude_run": 0.9,
    "dataset": {
      "path": "/hostroot/home/ehou/trainingAll/training/data/train/",
      "p_subsample": 0.005,
      "batch_size": 75,
      "shuffle": true,
      "train_ratio": 1.0
    }
  },

  "validation": {
    "rand_seed": 123123123,
    "shuffle": true,
    "dataset": {
      "path": "/hostroot/home/ehou/trainingAll/training/data/val/",
      "batch_size": 50,
      "shuffle": true,
      "train_ratio": 0.975
    }
  },

  "hardware": {
    "use_gpu": true,
    "gpu": 0
  },

  "model": {
    "past_frames": 6,
    "future_frames": 20,
    "frame_stride": 10,
    "save_path": "./save/nonsqueeze_0.1/",
    "resume_path": "./save/nonsqueeze_0.1/"
  },

  "dataset": {
    "path": "/hostroot/data/dataset/bair_car_data_Main_Dataset/",
    "train_path": null,
    "ignore_labels": ["reject_run", "left", "out1_in2", "play", "Smyth", "racing"],
    "train_ratio": 0.8,
    "val_ratio": 0.2,
    "include_labels": ["campus", "local", "Tilden"],
    "use_aruco": false
  }
}
Daughter Configuration File Ex.
{
  "parent_config": "eric/nonsqueeze/default.json",
  "logging": {
    "path": "nvidia_training.log",
    "training_loss": "./logs/nonsqueeze_0.1/nvidia_trainloss.log",
    "validation_loss": "./logs/nonsqueeze_0.1/nvidia_valloss.log"
  },

  "model": {
    "name": "nvidia",
    "py_path": "nets.eric.nonsqueeze.Nvidia",
    "separate_frames": true,
    "metadata_shape": [6, 8, 23, 41]
  }
}
JSON Key/Value Table
Key Value
parent_config Parent config file whose configuration gets inhereted. Useful for defining several different variations of a network for training under the same conditions for an experiment.
logging/verbose Defines the level of logging to be used. verbose=True, makes the logs contain debug level information, while False contains everything but.
training/start_epoch The starting epoch to be used, traditionally set to 0 however if resuming from a file this can be modified.
training/num_epochs Number of epochs to train before halting
training/learning_rate Optional argument used to change the default lr of the Adadelta optimizer
training/rand_seed Important argument that should be the same for the same experiment, it decides which data will be dropped/kept and so should be a constant between trials for deterministic results.
training/p_exclude_run Percentage of runs to discard (0.9 = Discard 90% of the runs)
training/dataset/path Path to dataset should point to directory of directories, each containing several sub folders representing different segments.
training/dataset/p_subsample Probability of keeping any specific data moment (0.005 = 0.5% probability of being kept)
training/dataset/batch_size Number of data moments in a batch
training/dataset/shuffle Shuffling of data moments within the epoch (true/false)
training/dataset/train_ratio Percentage of dataset to be isolated to training data (if using seperate train/val datasets keep at 1.0)
validation/rand_seed Important argument that should be the same for the same experiment, it decides which data will be dropped/kept and so should be a constant between trials for deterministic results.
validation/dataset/path Path to dataset should point to directory of directories, each containing several sub folders representing different segments.
validation/dataset/p_subsample Probability of keeping any specific data moment (0.005 = 0.5% probability of being kept)
validation/dataset/train_ratio Percentage of dataset to be isolated to validation data (if using seperate train/val datasets keep at 1.0)
hardware/use_gpu Quite obvious, use gpu or not
hardware/gpu CUDA GPU ID for GPU to use
model/past_frames How many past frames to include in the input data
model/future_frames How many future frames to include in each input data moment
model/frame_stride Number of timesteps between input/output frames
model/save_path path to model save directory (path must exist)
model/resume_path path to model resume directory (path must exist)
dataset/ignore_labels ignore a data moment if one of these labels is present (list of strings)
dataset/include_labels only use data moments which contain these labels (list of strings)
logging/path Path to logging file (doesn’t need to exist)
logging/training_loss Path to training loss logging file (path needs to exist)
logging/validation_loss Path to validation loss logging file (path needs to exist)
model/name Name of neural network for training
model/py_path Directory structure to follow to get to the neural network, shgould be importable from this, file included should set the variable Net to refer to the main neural network class
model/separate_frames Usually false, set to true if a data moment would consist of only one frame
model/metadata_shape Shape of metadata binary image to create, varies depending on the network used.

With the new system a configuration file stores all the information regarding your networks and an experiments can be run by having a single parent config representing the controlled variables, and several sub config files representing your independent variables, while the results (dependent variables) will be logged to the appropriate directories.

Network Defining System

"""SqueezeNet 1.1 modified for regression."""
import torch
import torch.nn as nn
import torch.nn.init as init
from torch.autograd import Variable
import logging
logging.basicConfig(filename='training.log', level=logging.DEBUG)

class Nvidia(nn.Module):

    def __init__(self, n_steps=10, n_frames=2):
        super(Nvidia, self).__init__()

        self.n_steps = n_steps
        self.n_frames = n_frames
        self.conv_nets = nn.Sequential(
            nn.Conv2d(3 * 2 * self.n_frames, 24, kernel_size=5, stride=2),
            nn.Conv2d(24, 36, kernel_size=5, stride=2),
            nn.Conv2d(36, 48, kernel_size=5, stride=2),
            nn.Conv2d(48, 64, kernel_size=3, stride=2),
            nn.Conv2d(64, 64, kernel_size=3, stride=1)
        )
        self.fcl = nn.Sequential(
            nn.Linear(768, 100),
            nn.Linear(100, 4 * self.n_steps)
        )


    def forward(self, x, metadata):
        x = self.conv_nets(x)
        x = x.view(x.size(0), -1)
        x = self.fcl(x)
        x = x.view(x.size(0), -1, 4)
        return x

    def num_params(self):
        return sum([reduce(lambda x, y: x * y, [dim for dim in p.size()], 1) for p in self.parameters()])

def unit_test():
    test_net = Nvidia(20, 6)
    a = test_net(Variable(torch.randn(5, 36, 94, 168)),
                 Variable(torch.randn(5, 12, 23, 41)))
    logging.debug('Net Test Output = {}'.format(a))
    logging.debug('Network was Unit Tested')
    print(test_net.num_params())

unit_test()

Net = Nvidia

In the new system, b/c the network itself is a variable that can be set, there needs to be a simple modification to the training file. At the end of your network defining file, simply set a variable named “Net” to the main training class. This allows the main training code to find the appropriate network in situations where there may be more than one class in your network defining file, and allows for extended usability.

Training

With these new changes, training becomes as simple as

python Train.py --config abcd.config
Written on December 27, 2017