In the prior blog, I shared a survey of some of the best Faster RCNN tutorials out there. I am now going to dive into the details of how I got mine working — from beginning to end.
Step 1: Install Anaconda if you have not already. As of the writing of this blog, the default download using Python 3.8.5 or higher should work.
Step 2: Launch Anaconda Prompt from your Windows Start menu. Should look like the below:
Step 3: Familiarize yourself with Sovit’s tutorial from Nov 2021 — “A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model”. I will recreate some of the early steps here — but definitely refer back to this tutorial as you go through the remaining steps. That way as I deviate OR fill-in the blanks — you have a frame of reference.
Step 4: Create the environment and install OpenCV and PyTorch. This assumes PyTorch version 1.10 or higher.
From Anaconda Prompt:
conda create –name frcnnenv
conda activate frcnnenv
conda install -c conda-forge opencv
conda install -c conda-forge pytorch-cpu
conda install -c conda-forge pytorch-ligthning
Step 5: Download the source code from Sovit’s tutorial – Section “Setting Up the Training Configuration”. You will have to enter your email address. This will send you to a Google Drive where you can download a *.ZIP file that contains the original source.
Step 6: Unzip (Extract All) folder contents into the Directory Structure as outlined in Sovit’s tutorial into a folder by which you want to run your training from. The content of the ZIP is exactly the structure you need. Here is a screen shot.
As a bit of clean-up; we are NOT going to do the Uno Image detection. So…
- Under “data”, instead of “Uno Cards.v2-raw.voc” folder; create a folder called “hands”, with 3 subfolders “test”, “train”, and “valid”
- Under “data”, delete folder “uno_custom_test_data”. This contains a video that makes the whole package big.
- Under “outputs”, delete everything there. We will be training our own custom model.
Step 7: Get images off the internet of people with hands. There are many ways to obtain images. For Uno tutorial, Sovit got a dataset from Roboflow. In my case, I wanted to really drive towards a very small dataset.
- I installed Faktun Batch Download for Chrome.
- I put Google into Image mode. Google Image link. And then searched terms like:
- people with hands
- people cheering
- people at work
- Moved all 45 images into the folder “data/hands/train”
- I repeated the same steps to get more images. Put 12 images into the folder “data/hands/valid”
Step 8: Renaming the images. In this step, I borrow heavily off of the first Faster RCNN tutorial; Josh Schmidt. It makes image annotation cleaner if names were sequenced 001, 002, etc.
Josh’s version of rename_images.py was built with maybe an older version of PyTorch. The torch.utils didn’t have get_filenames_of_path; so I rewrote a version that just uses the original pathlib. And of course, I now pointed foot to “./data/hands” and inputs to either train or valid.
Here is the source to my version of “rename_input.py“
"""
Created on Fri Jun 10 20:28:58 2022
@author: squac
"""
import pathlib
#from torch.utils import get_filenames_of_path
#Created a custom "get_filenames_of_path" using pathlib
def get_filenames_of_path(path: pathlib.Path, ext: str = '*'):
"""Returns a list of files in a directory/path. Uses pathlib."""
filenames = [file for file in path.glob(ext) if file.is_file()]
return filenames
root = pathlib.Path("./data/hands")
#inputs = get_filenames_of_path(root / 'train')
inputs = get_filenames_of_path(root / 'valid')
inputs.sort()
for idx, path in enumerate(inputs):
old_name = path.stem
old_extension = path.suffix
dir = path.parent
new_name = str(idx).zfill(3) + old_extension
path.rename(pathlib.Path(dir, new_name))
From Anaconda Prompt:
cd <path to your main folder>
python rename_input.py
The results are cleanly numbered images – see below.
NOTE: After I renamed my “valid” images; which were in fact a completely different set; I did go back to the 12 images and added a “1” to the top. So “101.jpg” up to “112.jpg”.
Step 9: Annotating the Images. I used https://www.makesense.ai/ to annotate my images.
- Open the folder where the images are; select them all; and drag into the browser with Makesense.ai open.
- Create two classes – ‘left_hand’ and ‘right_hand’ — and hit “Start Project”
- Annotate the images one by one. This is where perspective becomes very tricky!!!
- Export the annotations as a Pascal VOC XML.
- Unzip (Extract All) the XML files into the same folder as the images.
Step 10: Updating and running config.py. I changed four things to the original config.py. With a small image set, I reduced BATCH_SIZE to 4. Also changed the directory for TRAIN_DIR and VALID_DIR. And finally, modified the CLASSES to background, left_hand and right_hand.
Here is the source to my version of “config.py“
import torch
BATCH_SIZE = 4 # increase / decrease according to GPU memeory
RESIZE_TO = 416 # resize the image for training and transforms
NUM_EPOCHS = 10 # number of epochs to train for
NUM_WORKERS = 4
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# training images and XML files directory
TRAIN_DIR = 'data/hands/train'
# validation images and XML files directory
VALID_DIR = 'data/hands/valid'
CLASSES = [
'__background__', 'left_hand', 'right_hand'
]
NUM_CLASSES = len(CLASSES)
# whether to visualize images after clearing the data loaders
#VISUALIZE_TRANSFORMED_IMAGES = True
VISUALIZE_TRANSFORMED_IMAGES = False
# location to save model and plots
OUT_DIR = './outputs'
NOTE: I turned VISUALIZE_TRANSFORMED_IMAGES = False because of an error. Have not the chance to go back and debug.
STEP 11: Modifying custom_utils.py.
I discovered the hard way that torch.save no longer overwrites the previous files. So the first time I did the training, the outputs folder still kept the original weights that had been populated by Sovit. My solve was to save output files with the addition epoch number; and for “best weight” situations, I would tack on the loss value. The first 3 mods are done for every epoch. The last mod is for the best model situation. Search in code for the following:
savepath = “./output/last_model”+ str(epoch) + “.pth”;
savepath1 = “./output/train_loss”+ str(epoch) + “.png”;
savepath2 = “./output/valid_loss”+ str(epoch) + “.png”;
savepath = “./output/best_model”+ str(epoch) + str(current_valid_loss)+”.pth”;
Here is the source to my version of “custom_utils.py“
import albumentations as A
import cv2
import numpy as np
import torch
import matplotlib.pyplot as plt
from albumentations.pytorch import ToTensorV2
from config import DEVICE, CLASSES
import pathlib
save_dest = pathlib.Path("./output")
plt.style.use('ggplot')
# this class keeps track of the training and validation loss values...
# ... and helps to get the average for each epoch as well
class Averager:
def __init__(self):
self.current_total = 0.0
self.iterations = 0.0
def send(self, value):
self.current_total += value
self.iterations += 1
@property
def value(self):
if self.iterations == 0:
return 0
else:
return 1.0 * self.current_total / self.iterations
def reset(self):
self.current_total = 0.0
self.iterations = 0.0
class SaveBestModel:
"""
Class to save the best model while training. If the current epoch's
validation loss is less than the previous least less, then save the
model state.
"""
def __init__(
self, best_valid_loss=float('inf')
):
self.best_valid_loss = best_valid_loss
def __call__(
self, current_valid_loss,
epoch, model, optimizer
):
if current_valid_loss < self.best_valid_loss:
self.best_valid_loss = current_valid_loss
print(f"\nBest validation loss: {self.best_valid_loss}")
print(f"\nSaving best model for epoch: {epoch+1}\n")
savepath = "./output/best_model"+ str(epoch) + str(current_valid_loss)+".pth";
torch.save({
'epoch': epoch+1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, savepath)
#}, './output/best_model.pth')
def collate_fn(batch):
"""
To handle the data loading as different images may have different number
of objects and to handle varying size tensors as well.
"""
return tuple(zip(*batch))
# define the training tranforms
def get_train_transform():
return A.Compose([
A.Flip(0.5),
A.RandomRotate90(0.5),
A.MotionBlur(p=0.2),
A.MedianBlur(blur_limit=3, p=0.1),
A.Blur(blur_limit=3, p=0.1),
ToTensorV2(p=1.0),
], bbox_params={
'format': 'pascal_voc',
'label_fields': ['labels']
})
# define the validation transforms
def get_valid_transform():
return A.Compose([
ToTensorV2(p=1.0),
], bbox_params={
'format': 'pascal_voc',
'label_fields': ['labels']
})
def show_tranformed_image(train_loader):
"""
This function shows the transformed images from the `train_loader`.
Helps to check whether the tranformed images along with the corresponding
labels are correct or not.
Only runs if `VISUALIZE_TRANSFORMED_IMAGES = True` in config.py.
"""
if len(train_loader) > 0:
for i in range(1):
images, targets = next(iter(train_loader))
images = list(image.to(DEVICE) for image in images)
targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
boxes = targets[i]['boxes'].cpu().numpy().astype(np.int32)
labels = targets[i]['labels'].cpu().numpy().astype(np.int32)
sample = images[i].permute(1, 2, 0).cpu().numpy()
for box_num, box in enumerate(boxes):
cv2.rectangle(sample,
(box[0], box[1]),
(box[2], box[3]),
(0, 0, 255), 2)
cv2.putText(sample, CLASSES[labels[box_num]],
(box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX,
1.0, (0, 0, 255), 2)
cv2.imshow('Transformed image', sample)
cv2.waitKey(0)
cv2.destroyAllWindows()
def save_model(epoch, model, optimizer):
"""
Function to save the trained model till current epoch, or whenver called
"""
savepath = "./output/last_model"+ str(epoch) + ".pth";
torch.save({
'epoch': epoch+1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, savepath)
#}, './output/last_model.pth')
def save_loss_plot(OUT_DIR, train_loss, val_loss,epoch):
savepath1 = "./output/train_loss"+ str(epoch) + ".png";
savepath2 = "./output/valid_loss"+ str(epoch) + ".png";
figure_1, train_ax = plt.subplots()
figure_2, valid_ax = plt.subplots()
train_ax.plot(train_loss, color='tab:blue')
train_ax.set_xlabel('iterations')
train_ax.set_ylabel('train loss')
valid_ax.plot(val_loss, color='tab:red')
valid_ax.set_xlabel('iterations')
valid_ax.set_ylabel('validation loss')
#figure_1.savefig(f"{OUT_DIR}/train_loss.png")
#figure_2.savefig(f"{OUT_DIR}/valid_loss.png")
figure_1.savefig(savepath1)
figure_2.savefig(savepath2)
print('SAVING PLOTS COMPLETE...')
plt.close('all')
Step 12: Train the weights
From Anaconda Prompt:
python train.py
The train.py will used config.py to pull in the images and confirm annotations. Then it starts training across the 10 epochs. Here is a screenshot of my system when it was done.
Step 13: Modifying inference_video.py
This is had the most re-write of the original inference.py. I had to use imutils. I got rid of the hardcoded default to recorded video. And completely swapped out the methodology on while (true). This version actually works more like the Hotwheels and Matchbox detector from a previous blog.
Here is the source to my version of “inference_mod_video.py“
import numpy as np
import cv2
import torch
#pchenmod import os
import time
#pchenmod import argparse
#pchenmod import pathlib
from imutils.video import VideoStream
import imutils
from model import create_model
from config import (
NUM_CLASSES, DEVICE, CLASSES
)
# this will help us create a different color for each class
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))
# load the best model and trained weights
model = create_model(num_classes=NUM_CLASSES)
checkpoint = torch.load('outputs/best_model.pth', map_location=DEVICE)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(DEVICE).eval()
# define the detection threshold...
# ... any detection having score below this will be discarded
detection_threshold = 0.45
RESIZE_TO = (512, 512)
vs = VideoStream(src=0).start()
#cap = cv2.VideoCapture(0)
time.sleep(2.0)
frame_gap = 0
# grab the current frame
frame = vs.read()
if frame is None:
exit()
frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second
while(True):
# capture each frame of the video
frame = vs.read()
frame = imutils.resize(frame, width = 512, height = 512)
image = frame.copy()
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
# make the pixel range between 0 and 1
image /= 255.0
# bring color channels to front
image = np.transpose(image, (2, 0, 1)).astype(np.float32)
# convert to tensor
#pchenmod image = torch.tensor(image, dtype=torch.float).cuda()
image = torch.tensor(image, dtype=torch.float)
# add batch dimension
image = torch.unsqueeze(image, 0)
# get the start time
start_time = time.time()
with torch.no_grad():
# get predictions for the current frame
outputs = model(image.to(DEVICE))
end_time = time.time()
# get the current fps
fps = 1 / (end_time - start_time)
# add `fps` to `total_fps`
total_fps += fps
# increment frame count
frame_count += 1
# load all detection to CPU for further operations
outputs = [{k: v.to('cpu') for k, v in t.items()} for t in outputs]
# carry further only if there are detected boxes
if len(outputs[0]['boxes']) != 0:
boxes = outputs[0]['boxes'].data.numpy()
scores = outputs[0]['scores'].data.numpy()
# filter out boxes according to `detection_threshold`
boxes = boxes[scores >= detection_threshold].astype(np.int32)
draw_boxes = boxes.copy()
# get all the predicited class names
pred_classes = [CLASSES[i] for i in outputs[0]['labels'].cpu().numpy()]
# draw the bounding boxes and write the class name on top of it
for j, box in enumerate(draw_boxes):
class_name = pred_classes[j]
color = COLORS[CLASSES.index(class_name)]
cv2.rectangle(frame,
(int(box[0]), int(box[1])),
(int(box[2]), int(box[3])),
color, 2)
cv2.putText(frame, class_name,
(int(box[0]), int(box[1]-5)),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, color,
2, lineType=cv2.LINE_AA)
cv2.imshow("Frame", frame)
#out.write(frame)
# press `q` to exit
if cv2.waitKey(1) & 0xFF == ord('q'):
break
vs.stop()
# close all frames and video windows
cv2.destroyAllWindows()
# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")
The camera source (src = 0) defaults to the laptop camera. If you want to use a USB webcam, plug it in and change line 41 to:
vs = VideoStream(src=1).start()
If you examine line 36 of the code above, we sent the detection_threshold to 0.45. Feel free to adjust down if Step 15 doesn’t actually detect hands.
Step 15: Running the inference_mod_video.py
From Anaconda Prompt:
python inference_mod_video.py
The results are like the previous post:
One Comment Add yours