Step by Step: Faster RCNN on a CPU platform

In the prior blog, I shared a survey of some of the best Faster RCNN tutorials out there. I am now going to dive into the details of how I got mine working — from beginning to end.

Step 1: Install Anaconda if you have not already. As of the writing of this blog, the default download using Python 3.8.5 or higher should work.

Step 2: Launch Anaconda Prompt from your Windows Start menu. Should look like the below:

Anaconda command prompt

Step 3: Familiarize yourself with Sovit’s tutorial from Nov 2021 — “A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model”. I will recreate some of the early steps here — but definitely refer back to this tutorial as you go through the remaining steps. That way as I deviate OR fill-in the blanks — you have a frame of reference.

Step 4: Create the environment and install OpenCV and PyTorch. This assumes PyTorch version 1.10 or higher.

From Anaconda Prompt:

conda create –name frcnnenv
conda activate frcnnenv
conda install -c conda-forge opencv
conda install -c conda-forge pytorch-cpu
conda install -c conda-forge pytorch-ligthning

Step 5: Download the source code from Sovit’s tutorial – Section “Setting Up the Training Configuration”. You will have to enter your email address. This will send you to a Google Drive where you can download a *.ZIP file that contains the original source.

Step 6: Unzip (Extract All) folder contents into the Directory Structure as outlined in Sovit’s tutorial into a folder by which you want to run your training from. The content of the ZIP is exactly the structure you need. Here is a screen shot.

As a bit of clean-up; we are NOT going to do the Uno Image detection. So…

  • Under “data”, instead of “Uno Cards.v2-raw.voc” folder; create a folder called “hands”, with 3 subfolders “test”, “train”, and “valid”
  • Under “data”, delete folder “uno_custom_test_data”. This contains a video that makes the whole package big.
  • Under “outputs”, delete everything there. We will be training our own custom model.

Step 7: Get images off the internet of people with hands. There are many ways to obtain images. For Uno tutorial, Sovit got a dataset from Roboflow. In my case, I wanted to really drive towards a very small dataset.

  • I installed Faktun Batch Download for Chrome.
  • I put Google into Image mode. Google Image link. And then searched terms like:
    • people with hands
    • people cheering
    • people at work
  • Moved all 45 images into the folder “data/hands/train”
  • I repeated the same steps to get more images. Put 12 images into the folder “data/hands/valid”

Step 8: Renaming the images. In this step, I borrow heavily off of the first Faster RCNN tutorial; Josh Schmidt. It makes image annotation cleaner if names were sequenced 001, 002, etc.

Josh’s version of rename_images.py was built with maybe an older version of PyTorch. The torch.utils didn’t have get_filenames_of_path; so I rewrote a version that just uses the original pathlib. And of course, I now pointed foot to “./data/hands” and inputs to either train or valid.

Here is the source to my version of “rename_input.py

"""
Created on Fri Jun 10 20:28:58 2022

@author: squac
"""

import pathlib
#from torch.utils import get_filenames_of_path
#Created a custom "get_filenames_of_path" using pathlib
def get_filenames_of_path(path: pathlib.Path, ext: str = '*'):
    """Returns a list of files in a directory/path. Uses pathlib."""
    filenames = [file for file in path.glob(ext) if file.is_file()]
    return filenames

root = pathlib.Path("./data/hands")

#inputs = get_filenames_of_path(root / 'train')
inputs = get_filenames_of_path(root / 'valid')
inputs.sort()


for idx, path in enumerate(inputs):
    old_name = path.stem
    old_extension = path.suffix
    dir = path.parent
    new_name = str(idx).zfill(3) + old_extension
    path.rename(pathlib.Path(dir, new_name))

From Anaconda Prompt:

cd <path to your main folder>
python rename_input.py

The results are cleanly numbered images – see below.

NOTE: After I renamed my “valid” images; which were in fact a completely different set; I did go back to the 12 images and added a “1” to the top. So “101.jpg” up to “112.jpg”.

Step 9: Annotating the Images. I used https://www.makesense.ai/ to annotate my images.

  • Open the folder where the images are; select them all; and drag into the browser with Makesense.ai open.
  • Create two classes – ‘left_hand’ and ‘right_hand’ — and hit “Start Project”
  • Annotate the images one by one. This is where perspective becomes very tricky!!!
  • Export the annotations as a Pascal VOC XML.
  • Unzip (Extract All) the XML files into the same folder as the images.

Step 10: Updating and running config.py. I changed four things to the original config.py. With a small image set, I reduced BATCH_SIZE to 4. Also changed the directory for TRAIN_DIR and VALID_DIR. And finally, modified the CLASSES to background, left_hand and right_hand.

Here is the source to my version of “config.py

import torch

BATCH_SIZE = 4 # increase / decrease according to GPU memeory
RESIZE_TO = 416 # resize the image for training and transforms
NUM_EPOCHS = 10 # number of epochs to train for
NUM_WORKERS = 4

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


# training images and XML files directory
TRAIN_DIR = 'data/hands/train'
# validation images and XML files directory
VALID_DIR = 'data/hands/valid'


CLASSES = [
    '__background__', 'left_hand', 'right_hand'
]

NUM_CLASSES = len(CLASSES)

# whether to visualize images after clearing the data loaders
#VISUALIZE_TRANSFORMED_IMAGES = True
VISUALIZE_TRANSFORMED_IMAGES = False

# location to save model and plots
OUT_DIR = './outputs'

NOTE: I turned VISUALIZE_TRANSFORMED_IMAGES = False because of an error. Have not the chance to go back and debug.

STEP 11: Modifying custom_utils.py.

I discovered the hard way that torch.save no longer overwrites the previous files. So the first time I did the training, the outputs folder still kept the original weights that had been populated by Sovit. My solve was to save output files with the addition epoch number; and for “best weight” situations, I would tack on the loss value. The first 3 mods are done for every epoch. The last mod is for the best model situation. Search in code for the following:

savepath = “./output/last_model”+ str(epoch) + “.pth”;

savepath1 = “./output/train_loss”+ str(epoch) + “.png”;

savepath2 = “./output/valid_loss”+ str(epoch) + “.png”;

savepath = “./output/best_model”+ str(epoch) + str(current_valid_loss)+”.pth”;

Here is the source to my version of “custom_utils.py

import albumentations as A
import cv2
import numpy as np
import torch
import matplotlib.pyplot as plt

from albumentations.pytorch import ToTensorV2
from config import DEVICE, CLASSES

import pathlib
save_dest = pathlib.Path("./output")

plt.style.use('ggplot')

# this class keeps track of the training and validation loss values...
# ... and helps to get the average for each epoch as well
class Averager:
    def __init__(self):
        self.current_total = 0.0
        self.iterations = 0.0
        
    def send(self, value):
        self.current_total += value
        self.iterations += 1
    
    @property
    def value(self):
        if self.iterations == 0:
            return 0
        else:
            return 1.0 * self.current_total / self.iterations
    
    def reset(self):
        self.current_total = 0.0
        self.iterations = 0.0

class SaveBestModel:
    """
    Class to save the best model while training. If the current epoch's 
    validation loss is less than the previous least less, then save the
    model state.
    """
    def __init__(
        self, best_valid_loss=float('inf')
    ):
        self.best_valid_loss = best_valid_loss
        
    def __call__(
        self, current_valid_loss, 
        epoch, model, optimizer
    ):
        if current_valid_loss < self.best_valid_loss:
            self.best_valid_loss = current_valid_loss
            print(f"\nBest validation loss: {self.best_valid_loss}")
            print(f"\nSaving best model for epoch: {epoch+1}\n")
            savepath = "./output/best_model"+ str(epoch) + str(current_valid_loss)+".pth"; 
            torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, savepath)
                #}, './output/best_model.pth')

def collate_fn(batch):
    """
    To handle the data loading as different images may have different number 
    of objects and to handle varying size tensors as well.
    """
    return tuple(zip(*batch))

# define the training tranforms
def get_train_transform():
    return A.Compose([
        A.Flip(0.5),
        A.RandomRotate90(0.5),
        A.MotionBlur(p=0.2),
        A.MedianBlur(blur_limit=3, p=0.1),
        A.Blur(blur_limit=3, p=0.1),
        ToTensorV2(p=1.0),
    ], bbox_params={
        'format': 'pascal_voc',
        'label_fields': ['labels']
    })

# define the validation transforms
def get_valid_transform():
    return A.Compose([
        ToTensorV2(p=1.0),
    ], bbox_params={
        'format': 'pascal_voc', 
        'label_fields': ['labels']
    })


def show_tranformed_image(train_loader):
    """
    This function shows the transformed images from the `train_loader`.
    Helps to check whether the tranformed images along with the corresponding
    labels are correct or not.
    Only runs if `VISUALIZE_TRANSFORMED_IMAGES = True` in config.py.
    """
    if len(train_loader) > 0:
        for i in range(1):
            images, targets = next(iter(train_loader))
            images = list(image.to(DEVICE) for image in images)
            targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
            boxes = targets[i]['boxes'].cpu().numpy().astype(np.int32)
            labels = targets[i]['labels'].cpu().numpy().astype(np.int32)
            sample = images[i].permute(1, 2, 0).cpu().numpy()
            for box_num, box in enumerate(boxes):
                cv2.rectangle(sample,
                            (box[0], box[1]),
                            (box[2], box[3]),
                            (0, 0, 255), 2)
                cv2.putText(sample, CLASSES[labels[box_num]], 
                            (box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 
                            1.0, (0, 0, 255), 2)
            cv2.imshow('Transformed image', sample)
            cv2.waitKey(0)
            cv2.destroyAllWindows()

def save_model(epoch, model, optimizer):
    """
    Function to save the trained model till current epoch, or whenver called
    """
    savepath = "./output/last_model"+ str(epoch) + ".pth";  
    torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, savepath)
                #}, './output/last_model.pth')

def save_loss_plot(OUT_DIR, train_loss, val_loss,epoch):
    savepath1 = "./output/train_loss"+ str(epoch) + ".png";  
    savepath2 = "./output/valid_loss"+ str(epoch) + ".png";        
    figure_1, train_ax = plt.subplots()
    figure_2, valid_ax = plt.subplots()
    train_ax.plot(train_loss, color='tab:blue')
    train_ax.set_xlabel('iterations')
    train_ax.set_ylabel('train loss')
    valid_ax.plot(val_loss, color='tab:red')
    valid_ax.set_xlabel('iterations')
    valid_ax.set_ylabel('validation loss')
    #figure_1.savefig(f"{OUT_DIR}/train_loss.png")
    #figure_2.savefig(f"{OUT_DIR}/valid_loss.png")
    figure_1.savefig(savepath1)
    figure_2.savefig(savepath2)
    print('SAVING PLOTS COMPLETE...')

    plt.close('all')

Step 12: Train the weights

From Anaconda Prompt:

python train.py

The train.py will used config.py to pull in the images and confirm annotations. Then it starts training across the 10 epochs. Here is a screenshot of my system when it was done.

Step 13: Modifying inference_video.py

This is had the most re-write of the original inference.py. I had to use imutils. I got rid of the hardcoded default to recorded video. And completely swapped out the methodology on while (true). This version actually works more like the Hotwheels and Matchbox detector from a previous blog.

Here is the source to my version of “inference_mod_video.py

import numpy as np
import cv2
import torch
#pchenmod import os
import time
#pchenmod import argparse
#pchenmod import pathlib
from imutils.video import VideoStream
import imutils

from model import create_model
from config import (
    NUM_CLASSES, DEVICE, CLASSES
)


# this will help us create a different color for each class
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

# load the best model and trained weights
model = create_model(num_classes=NUM_CLASSES)

checkpoint = torch.load('outputs/best_model.pth', map_location=DEVICE)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(DEVICE).eval()

# define the detection threshold...
# ... any detection having score below this will be discarded
detection_threshold = 0.45
RESIZE_TO = (512, 512)


vs = VideoStream(src=0).start()
#cap = cv2.VideoCapture(0)
time.sleep(2.0)
frame_gap = 0

# grab the current frame
frame = vs.read()
if frame is None:
    exit()

frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second


while(True):
    # capture each frame of the video
    frame = vs.read()
    frame = imutils.resize(frame, width = 512, height = 512)
    image = frame.copy()
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
    # make the pixel range between 0 and 1
    image /= 255.0
    # bring color channels to front
    image = np.transpose(image, (2, 0, 1)).astype(np.float32)
    # convert to tensor
    #pchenmod image = torch.tensor(image, dtype=torch.float).cuda()
    image = torch.tensor(image, dtype=torch.float)
    # add batch dimension
    image = torch.unsqueeze(image, 0)
    # get the start time
    start_time = time.time()
    with torch.no_grad():
        # get predictions for the current frame
        outputs = model(image.to(DEVICE))
    end_time = time.time()
        
        # get the current fps
    fps = 1 / (end_time - start_time)
        # add `fps` to `total_fps`
    total_fps += fps
        # increment frame count
    frame_count += 1
        
    # load all detection to CPU for further operations
    outputs = [{k: v.to('cpu') for k, v in t.items()} for t in outputs]
    # carry further only if there are detected boxes
    if len(outputs[0]['boxes']) != 0:
        boxes = outputs[0]['boxes'].data.numpy()
        scores = outputs[0]['scores'].data.numpy()
        # filter out boxes according to `detection_threshold`
        boxes = boxes[scores >= detection_threshold].astype(np.int32)
        draw_boxes = boxes.copy()
        # get all the predicited class names
        pred_classes = [CLASSES[i] for i in outputs[0]['labels'].cpu().numpy()]
        
        # draw the bounding boxes and write the class name on top of it
        for j, box in enumerate(draw_boxes):
            class_name = pred_classes[j]
            color = COLORS[CLASSES.index(class_name)]
            cv2.rectangle(frame,
                        (int(box[0]), int(box[1])),
                        (int(box[2]), int(box[3])),
                        color, 2)
            cv2.putText(frame, class_name, 
                        (int(box[0]), int(box[1]-5)),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 
                        2, lineType=cv2.LINE_AA)

    cv2.imshow("Frame", frame)
    #out.write(frame)
    # press `q` to exit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break


vs.stop()
# close all frames and video windows
cv2.destroyAllWindows()

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

The camera source (src = 0) defaults to the laptop camera. If you want to use a USB webcam, plug it in and change line 41 to:

vs = VideoStream(src=1).start()

If you examine line 36 of the code above, we sent the detection_threshold to 0.45. Feel free to adjust down if Step 15 doesn’t actually detect hands.

Step 15: Running the inference_mod_video.py

From Anaconda Prompt:

python inference_mod_video.py

The results are like the previous post:

One Comment Add yours

Leave a comment