Training the model

Now that we have created our (toy) dataset, we can proceed to train our model.

The configuration file

Training orchestration, including the model to be used, the dataset location and training schedule, is specified in a YAML config file. This file will be consumed by Luminoth and merged to the default configuration, to start the training session.

You can see a minimal config file example in sample_config.yml. This file illustrates the entries you’ll most probably need to modify, which are:

  • train.run_name: the run name for the training session, used to identify it.
  • train.job_dir: directory in which both model checkpoints and summaries (for TensorBoard consumption) will be saved. The actual files will be stored under <job_dir>/<run_name>.
  • dataset.dir: directory from which to read the TFRecord files.
  • model.type: model to use for object detection (fasterrcnn, or ssd).
  • network.num_classes: number of classes to predict (depends on your dataset).

For looking at all the possible configuration options, mostly related to the model itself, you can check the base_config.yml file.

Building the config file for your dataset

Probably the most important setting for training is the learning rate. You will most likely want to tune this depending on your dataset, and you can do it via the train.learning_rate setting in the configuration. For example, this would be a good setting for training on the full COCO dataset:

learning_rate:
  decay_method: piecewise_constant
  boundaries: [250000, 450000, 600000]
  values: [0.0003, 0.0001, 0.00003, 0.00001]

To get to this, you will need to run some experiments and see what works best.

train:
  # Run name for the training session.
  run_name: traffic
  job_dir: <change this directory>
  learning_rate:
    decay_method: piecewise_constant
    # Custom dataset for Luminoth Tutorial
    boundaries: [90000, 160000, 250000]
    values: [0.0003, 0.0001, 0.00003, 0.00001]
dataset:
  type: object_detection
  dir: <directory with your dataset>
model:
  type: fasterrcnn
  network:
    num_classes: 8
  anchors:
    # Add one more scale to be better at detecting small objects
    scales: [0.125, 0.25, 0.5, 1, 2]

Running the training

Assuming you already have both your dataset (TFRecords) and the config file ready, you can start your training session by running the command as follows:

lumi train -c config.yml

You can use the -o option to override any configuration option using dot notation (e.g. -o model.rpn.proposals.nms_threshold=0.8).

If you are using a CUDA-based GPU, you can select the GPU to use by setting the CUDA_VISIBLE_DEVICES environment variable (see here for more info).

When the training is running, you should see Luminoth print out for each step, the minibatch (single image), and the training loss related to that minibatch.

Image to image, the training loss will jump around, and this is expected. However, the trend will be that the loss will gradually start to decrease. For this, it is interesting to look at it using tools like TensorBoard.

Storing partial weights (checkpoints)

As the training progresses, Luminoth will periodically save a checkpoint with the current weights of the model. These weights let you resume training from where you left off!

The files will be output in your <job_dir>/<run_name> folder. By default, they will be saved every 600 seconds of training, but you can configure this with the train.save_checkpoint_secs setting in your config file.

The default is to only store the latest checkpoint (that is, when a checkpoint is generated, the previous checkpoint gets deleted) in order to conserve storage. You might find the train.checkpoints_max_keep option in your train YML configuration useful if you want to keep more checkpoints around.


Next: Using TensorBoard to visualize the training process