Step by Step: Faster RCNN on a CPU platform

In the prior blog, I shared a survey of some of the best Faster RCNN tutorials out there. I am now going to dive into the details of how I got mine working — from beginning to end.

Step 1: Install Anaconda if you have not already. As of the writing of this blog, the default download using Python 3.8.5 or higher should work.

Step 2: Launch Anaconda Prompt from your Windows Start menu. Should look like the below:

Step 3: Familiarize yourself with Sovit’s tutorial from Nov 2021 — “A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model”. I will recreate some of the early steps here — but definitely refer back to this tutorial as you go through the remaining steps. That way as I deviate OR fill-in the blanks — you have a frame of reference.

Step 4: Create the environment and install OpenCV and PyTorch. This assumes PyTorch version 1.10 or higher.

From Anaconda Prompt:

conda create –name frcnnenv
conda activate frcnnenv
conda install -c conda-forge opencv
conda install -c conda-forge pytorch-cpu
conda install -c conda-forge pytorch-ligthning

Step 5: Download the source code from Sovit’s tutorial – Section “Setting Up the Training Configuration”. You will have to enter your email address. This will send you to a Google Drive where you can download a *.ZIP file that contains the original source.

Step 6: Unzip (Extract All) folder contents into the Directory Structure as outlined in Sovit’s tutorial into a folder by which you want to run your training from. The content of the ZIP is exactly the structure you need. Here is a screen shot.

As a bit of clean-up; we are NOT going to do the Uno Image detection. So…

  • Under “data”, instead of “Uno Cards.v2-raw.voc” folder; create a folder called “hands”, with 3 subfolders “test”, “train”, and “valid”
  • Under “data”, delete folder “uno_custom_test_data”. This contains a video that makes the whole package big.
  • Under “outputs”, delete everything there. We will be training our own custom model.

Step 7: Get images off the internet of people with hands. There are many ways to obtain images. For Uno tutorial, Sovit got a dataset from Roboflow. In my case, I wanted to really drive towards a very small dataset.

  • I installed Faktun Batch Download for Chrome.
  • I put Google into Image mode. Google Image link. And then searched terms like:
    • people with hands
    • people cheering
    • people at work
  • Moved all 45 images into the folder “data/hands/train”
  • I repeated the same steps to get more images. Put 12 images into the folder “data/hands/valid”

Step 8: Renaming the images. In this step, I borrow heavily off of the first Faster RCNN tutorial; Josh Schmidt. It makes image annotation cleaner if names were sequenced 001, 002, etc.

Josh’s version of was built with maybe an older version of PyTorch. The torch.utils didn’t have get_filenames_of_path; so I rewrote a version that just uses the original pathlib. And of course, I now pointed foot to “./data/hands” and inputs to either train or valid.

Here is the source to my version of “

import pathlib
#from torch.utils import get_filenames_of_path
#Created a custom "get_filenames_of_path" using pathlib
def get_filenames_of_path(path: pathlib.Path, ext: str = '*'):
    """Returns a list of files in a directory/path. Uses pathlib."""
    filenames = [file for file in path.glob(ext) if file.is_file()]
    return filenames

root = pathlib.Path("./data/hands")

#inputs = get_filenames_of_path(root / 'train')
inputs = get_filenames_of_path(root / 'valid')

for idx, path in enumerate(inputs):
    old_name = path.stem
    old_extension = path.suffix
    dir = path.parent
    new_name = str(idx).zfill(3) + old_extension
    path.rename(pathlib.Path(dir, new_name))

From Anaconda Prompt:

cd <path to your main folder>

The results are cleanly numbered images – see below.

NOTE: After I renamed my “valid” images; which were in fact a completely different set; I did go back to the 12 images and added a “1” to the top. So “101.jpg” up to “112.jpg”.

Step 9: Annotating the Images. I used to annotate my images.

  • Open the folder where the images are; select them all; and drag into the browser with open.
  • Create two classes – ‘left_hand’ and ‘right_hand’ — and hit “Start Project”
  • Annotate the images one by one. This is where perspective becomes very tricky!!!
  • Export the annotations as a Pascal VOC XML.
  • Unzip (Extract All) the XML files into the same folder as the images.

Step 10: Updating and running I changed four things to the original With a small image set, I reduced BATCH_SIZE to 4. Also changed the directory for TRAIN_DIR and VALID_DIR. And finally, modified the CLASSES to background, left_hand and right_hand.

Here is the source to my version of “

import torch

BATCH_SIZE = 4 # increase / decrease according to GPU memeory
RESIZE_TO = 416 # resize the image for training and transforms
NUM_EPOCHS = 10 # number of epochs to train for

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# training images and XML files directory
TRAIN_DIR = 'data/hands/train'
# validation images and XML files directory
VALID_DIR = 'data/hands/valid'

    '__background__', 'left_hand', 'right_hand'


# whether to visualize images after clearing the data loaders

# location to save model and plots
OUT_DIR = './outputs'

NOTE: I turned VISUALIZE_TRANSFORMED_IMAGES = False because of an error. Have not the chance to go back and debug.

STEP 11: Modifying

I discovered the hard way that no longer overwrites the previous files. So the first time I did the training, the outputs folder still kept the original weights that had been populated by Sovit. My solve was to save output files with the addition epoch number; and for “best weight” situations, I would tack on the loss value. The first 3 mods are done for every epoch. The last mod is for the best model situation. Search in code for the following:

savepath = “./output/last_model”+ str(epoch) + “.pth”;

savepath1 = “./output/train_loss”+ str(epoch) + “.png”;

savepath2 = “./output/valid_loss”+ str(epoch) + “.png”;

savepath = “./output/best_model”+ str(epoch) + str(current_valid_loss)+”.pth”;

Here is the source to my version of “

import albumentations as A
import cv2
import numpy as np
import torch
import matplotlib.pyplot as plt

from albumentations.pytorch import ToTensorV2
from config import DEVICE, CLASSES

import pathlib
save_dest = pathlib.Path("./output")'ggplot')

# this class keeps track of the training and validation loss values...
# ... and helps to get the average for each epoch as well
class Averager:
    def __init__(self):
        self.current_total = 0.0
        self.iterations = 0.0
    def send(self, value):
        self.current_total += value
        self.iterations += 1
    def value(self):
        if self.iterations == 0:
            return 0
            return 1.0 * self.current_total / self.iterations
    def reset(self):
        self.current_total = 0.0
        self.iterations = 0.0

class SaveBestModel:
    Class to save the best model while training. If the current epoch's 
    validation loss is less than the previous least less, then save the
    model state.
    def __init__(
        self, best_valid_loss=float('inf')
        self.best_valid_loss = best_valid_loss
    def __call__(
        self, current_valid_loss, 
        epoch, model, optimizer
        if current_valid_loss < self.best_valid_loss:
            self.best_valid_loss = current_valid_loss
            print(f"\nBest validation loss: {self.best_valid_loss}")
            print(f"\nSaving best model for epoch: {epoch+1}\n")
            savepath = "./output/best_model"+ str(epoch) + str(current_valid_loss)+".pth"; 
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, savepath)
                #}, './output/best_model.pth')

def collate_fn(batch):
    To handle the data loading as different images may have different number 
    of objects and to handle varying size tensors as well.
    return tuple(zip(*batch))

# define the training tranforms
def get_train_transform():
    return A.Compose([
        A.MedianBlur(blur_limit=3, p=0.1),
        A.Blur(blur_limit=3, p=0.1),
    ], bbox_params={
        'format': 'pascal_voc',
        'label_fields': ['labels']

# define the validation transforms
def get_valid_transform():
    return A.Compose([
    ], bbox_params={
        'format': 'pascal_voc', 
        'label_fields': ['labels']

def show_tranformed_image(train_loader):
    This function shows the transformed images from the `train_loader`.
    Helps to check whether the tranformed images along with the corresponding
    labels are correct or not.
    Only runs if `VISUALIZE_TRANSFORMED_IMAGES = True` in
    if len(train_loader) > 0:
        for i in range(1):
            images, targets = next(iter(train_loader))
            images = list( for image in images)
            targets = [{k: for k, v in t.items()} for t in targets]
            boxes = targets[i]['boxes'].cpu().numpy().astype(np.int32)
            labels = targets[i]['labels'].cpu().numpy().astype(np.int32)
            sample = images[i].permute(1, 2, 0).cpu().numpy()
            for box_num, box in enumerate(boxes):
                            (box[0], box[1]),
                            (box[2], box[3]),
                            (0, 0, 255), 2)
                cv2.putText(sample, CLASSES[labels[box_num]], 
                            (box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 
                            1.0, (0, 0, 255), 2)
            cv2.imshow('Transformed image', sample)

def save_model(epoch, model, optimizer):
    Function to save the trained model till current epoch, or whenver called
    savepath = "./output/last_model"+ str(epoch) + ".pth";{
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, savepath)
                #}, './output/last_model.pth')

def save_loss_plot(OUT_DIR, train_loss, val_loss,epoch):
    savepath1 = "./output/train_loss"+ str(epoch) + ".png";  
    savepath2 = "./output/valid_loss"+ str(epoch) + ".png";        
    figure_1, train_ax = plt.subplots()
    figure_2, valid_ax = plt.subplots()
    train_ax.plot(train_loss, color='tab:blue')
    train_ax.set_ylabel('train loss')
    valid_ax.plot(val_loss, color='tab:red')
    valid_ax.set_ylabel('validation loss')


Step 12: Train the weights

From Anaconda Prompt:


The will used to pull in the images and confirm annotations. Then it starts training across the 10 epochs. Here is a screenshot of my system when it was done.

Step 13: Modifying

This is had the most re-write of the original I had to use imutils. I got rid of the hardcoded default to recorded video. And completely swapped out the methodology on while (true). This version actually works more like the Hotwheels and Matchbox detector from a previous blog.

Here is the source to my version of “

import numpy as np
import cv2
import torch
#pchenmod import os
import time
#pchenmod import argparse
#pchenmod import pathlib
from import VideoStream
import imutils

from model import create_model
from config import (

# this will help us create a different color for each class
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

# load the best model and trained weights
model = create_model(num_classes=NUM_CLASSES)

checkpoint = torch.load('outputs/best_model.pth', map_location=DEVICE)

# define the detection threshold...
# ... any detection having score below this will be discarded
detection_threshold = 0.45
RESIZE_TO = (512, 512)

vs = VideoStream(src=0).start()
#cap = cv2.VideoCapture(0)
frame_gap = 0

# grab the current frame
frame =
if frame is None:

frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second

    # capture each frame of the video
    frame =
    frame = imutils.resize(frame, width = 512, height = 512)
    image = frame.copy()
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
    # make the pixel range between 0 and 1
    image /= 255.0
    # bring color channels to front
    image = np.transpose(image, (2, 0, 1)).astype(np.float32)
    # convert to tensor
    #pchenmod image = torch.tensor(image, dtype=torch.float).cuda()
    image = torch.tensor(image, dtype=torch.float)
    # add batch dimension
    image = torch.unsqueeze(image, 0)
    # get the start time
    start_time = time.time()
    with torch.no_grad():
        # get predictions for the current frame
        outputs = model(
    end_time = time.time()
        # get the current fps
    fps = 1 / (end_time - start_time)
        # add `fps` to `total_fps`
    total_fps += fps
        # increment frame count
    frame_count += 1
    # load all detection to CPU for further operations
    outputs = [{k:'cpu') for k, v in t.items()} for t in outputs]
    # carry further only if there are detected boxes
    if len(outputs[0]['boxes']) != 0:
        boxes = outputs[0]['boxes'].data.numpy()
        scores = outputs[0]['scores'].data.numpy()
        # filter out boxes according to `detection_threshold`
        boxes = boxes[scores >= detection_threshold].astype(np.int32)
        draw_boxes = boxes.copy()
        # get all the predicited class names
        pred_classes = [CLASSES[i] for i in outputs[0]['labels'].cpu().numpy()]
        # draw the bounding boxes and write the class name on top of it
        for j, box in enumerate(draw_boxes):
            class_name = pred_classes[j]
            color = COLORS[CLASSES.index(class_name)]
                        (int(box[0]), int(box[1])),
                        (int(box[2]), int(box[3])),
                        color, 2)
            cv2.putText(frame, class_name, 
                        (int(box[0]), int(box[1]-5)),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 
                        2, lineType=cv2.LINE_AA)

    cv2.imshow("Frame", frame)
    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):

# close all frames and video windows

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

The camera source (src = 0) defaults to the laptop camera. If you want to use a USB webcam, plug it in and change line 41 to:

vs = VideoStream(src=1).start()

If you examine line 36 of the code above, we sent the detection_threshold to 0.45. Feel free to adjust down if Step 15 doesn’t actually detect hands.

Step 15: Running the

From Anaconda Prompt:


The results are like the previous post:

