CN115588150A

CN115588150A - Pet dog video target detection method and system based on improved YOLOv5-L

Info

Publication number: CN115588150A
Application number: CN202211151017.8A
Authority: CN
Inventors: 黄步添; 汪志刚; 刘振广; 焦颖颖; 许曼迪
Original assignee: Hangzhou Yunxiang Network Technology Co Ltd
Current assignee: Hangzhou Yunxiang Network Technology Co Ltd
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2023-01-10

Abstract

The invention provides a pet dog video target detection method based on improved YOLOv5-L, which comprises the following steps: collecting pet dog image data for constructing an initial training set; collecting video data containing pet dogs for constructing a test set; performing frame extraction on the video in the test set, and storing the obtained frame image; preprocessing the initial training set to obtain a final training set; the YOLOv5-L model is improved, and the specific steps are as follows: building a BackBone network, improving a Pred module, and adding an SK attention mechanism behind the BackBone network; setting training parameters, training the improved model, and storing an optimal weight parameter file; and (4) putting the weight parameter file into a detector, detecting the video in the test set, storing all video frames of the detected pet dog, and evaluating the detection result by using the AP index. The invention reduces the parameter quantity of the model and improves the accuracy of detecting the blurred and shielded video frame image.

Description

Pet dog video target detection method and system based on improved YOLOv5-L

Technical Field

The invention relates to the technical field of video target detection, in particular to a pet dog video target detection method and system based on improved YOLOv 5-L.

Background

Currently, pet dogs, which are the most commonly living partners of many people who keep them for the purpose of eliminating their loneliness or for entertainment, are smart animals that are easy to move, good for human mind, and loyal for the owner after human domestication, and it is an important research work to understand the behavior of pet dogs.

The target detection is a hotspot in the field of computer vision at present, the traditional classification task generally only concerns the whole and obtains the content description of one image, but the target detection tasks are different, the target detection focuses on a specific object target, the target detection needs to extract the target of interest from the background and determine the position of the target, and therefore the target detection output is a list which comprises the category and the position of the target. Existing target detection algorithms are generally classified into two types: a two-stage detection model and a one-stage detection model. the two-stage detection model is firstly generated in a region which is called as region pro-mesa, samples are classified through a convolution network, and common two-stage detection models comprise R-CNN, SPP-Net, fast R-CNN and the like. The one-stage detection model does not need to generate region proxy, directly extracts features from input data, and directly predicts the category and position information of an object, and common algorithms include: SSD and YOLO.

Although the existing two-stage detection models have good testing accuracy on a universal data set, the detection speed of the models is very slow, and particularly when video detection is carried out, the two-stage detection models cannot carry out real-time detection on videos with fps larger than 25. Compared with a two-stage detection model, the speed of the one-stage detection model is higher, wherein the detection speed of the YOLOv5 model is far higher than that of the two-stage detection model. However, the existing target detection model is only suitable for detecting objects with regular shapes, and in the video target detection of the pet dog, when the pet dog moves, the form changes, and the model is difficult to detect accurately.

Disclosure of Invention

In view of the above problems, the present invention aims to provide an improved YOLOv 5-L-based target detection model, and to enhance data by preprocessing a data set, so as to improve the accuracy of detecting a motion video frame of a pet dog.

Based on the above purpose, the invention provides a pet dog video target detection method and system based on improved YOLOv 5-L.

A pet dog video target detection method based on improved YOLOv5-L comprises the following steps:

respectively constructing an initial training set test set based on the acquired image data containing the pet dog and the acquired video data containing the pet dog;

carrying out frame extraction on the video containing the pet dog to obtain a frame image;

preprocessing the initial training set to obtain a final training set;

improving and training a YOLOv5-L model, specifically comprising the following steps: building a BackBone network, improving a Pred module, and adding an SK attention mechanism behind the BackBone network; setting training parameters, training the improved YOLOv5-L model, and storing an optimal weight parameter file; putting the optimal weight parameter file into a detector, detecting the final test concentrated video, storing all video frames of the detected pet dog, and evaluating a detection result by using an AP index to obtain an optimal improved YOLOv5-L model;

and inputting the video of the pet dog to be detected into the optimal YOLOv5-L model to obtain a corresponding detection result. As an implementation, the constructing the initial training set and the test set includes the following steps:

obtaining all marked pet dog pictures based on the obtained image data containing the pet dog;

labeling all the pictures with different background noises by using a LabelImg labeling tool to obtain labeled pictures of the pet dog, wherein the different background noises at least comprise one or more of grassland, snow mountain, indoor and street;

merging the marked pet dog pictures into an initial training set;

searching videos of interaction between a person and a pet dog on a Video website, and downloading and storing the videos by using a 4K Video tool;

and cutting the stored video, splitting the original video into short videos of 3s-10s, and storing all the short videos to obtain a test set.

As an implementation, the frame extraction of the video in the test set and the preprocessing of the initial training set include the following steps:

extracting the video in the test set frame by frame through an extractor algorithm, and storing all video frame images;

selecting and labeling pictures with abnormal shapes and motion blurs of part of pet dogs from the video frame images to obtain labeled pictures;

randomly selecting a plurality of labeled pictures to perform left-right translation, multi-picture superposition and proportional scaling to obtain processed labeled pictures with various morphological characteristics;

and merging the processed labeled picture and the initial training set to obtain a final training set.

As an implementable mode, the BackBone network is built and comprises a down-sampling module, a CBR module, a Res module and a CSP _ X module;

the down-sampling module; dividing the 640 pixel-640 pixel RGB image into 12-channel feature maps by using a split algorithm, and obtaining 64-channel feature maps by convolution;

the CBR module; the device comprises 3 × 3 convolution layers, a regularization layer and a Relu function;

the Res module; comprises two CBR modules and a null layer residual and are connected with each other;

the CSP _ X module; the method is used for extracting features and comprises a CBR module, X Res modules and a hollow layer residual error which are connected with one another, wherein X represents the number.

As an implementation, the modified Pred module includes: and adding a flatten algorithm in front of the output layer, carrying out one-dimensional feature diagram, and replacing the convolution layer in the output layer with a full-connection layer.

As one possible implementation, the SK attention mechanism comprises a split unit, a fuse unit and a select unit; the split unit convolutes the original characteristic diagram through convolution cores with three sizes; the fuse unit calculates the weight of each convolution kernel, sums the feature graphs of the three branches according to elements, generates channel statistical information through global average pooling, and obtains a new feature dimension of C x 1; the select unit calculates the weight of each convolution kernel by using softmax, and fuses all the convolution kernels to form the final output convolution kernel.

As an implementation, the improving YOLOv5-L model and training further comprises the following steps:

modifying number class change detection categories in the YAML configuration file, the categories including: dog, human;

setting an NMS mechanism for reserving a prediction box with the best prediction, and reducing the confidence coefficient of the rest prediction boxes to 0;

setting a Loss function as DIOU _ Loss;

setting a training hyper-parameter, setting the number of training rounds to be 300, setting an optimizer to be an improved SGD, setting an initial learning rate to be 0.01, setting a learning rate momentum to be 0.95 and setting a training batch to be 64;

and (4) training the training set in a model, obtaining the optimal weight parameter through multiple iterations, and saving the file as best.

As an implementation mode, the optimal weight parameters are put into a detector, a scaling algorithm is added to fix the size of the incoming video frame to be 640 pixels by 640 pixels, the video frame is put into a test set video for detection, and all video frames of the pet dog are stored; the accuracy of the AP index evaluation model is adopted, and the AP index calculation mode is as follows: AP = number of detected pet dog occurring video frames/number of pet dog occurring video frames in all.

A pet dog video target detection system based on improved YOLOv5-L comprises a data acquisition module, an image extraction module, a preprocessing module, a model improvement training module and a result detection module;

the data acquisition module is used for respectively constructing an initial training set test set based on the acquired image data containing the pet dog and the acquired video data containing the pet dog;

the image extraction module is used for carrying out frame extraction on the video containing the pet dog to obtain a frame image;

the preprocessing module is used for preprocessing the initial training set to obtain a final training set;

the model improvement training module is used for improving and training a YOLOv5-L model, and specifically comprises the following steps: building a BackBone network, improving a Pred module, and adding an SK attention mechanism behind the BackBone network; setting training parameters, training the improved YOLOv5-L model, and storing an optimal weight parameter file; putting the optimal weight parameter file into a detector, detecting the final test concentrated video, storing all video frames of the detected pet dog, and evaluating the detection result by using an AP index to further obtain an optimal improved YOLOv5-L model;

and the result detection module is used for inputting the video of the pet dog to be detected into the optimal YOLOv5-L model to obtain a corresponding detection result.

Compared with the prior art, the pet dog video target detection method based on the improved YOLOv5-L has the following beneficial effects:

1. by combining a plurality of data sets as training sets, the data volume during training is increased, and the features which can be trained by the model are enriched;

2. by improving the YOLOv5-L model, the parameter quantity of the model is reduced, and the detection speed is increased;

3. the fuzzy frame and the shielding frame of the video in the test set are extracted, and the training set is combined, so that the accuracy of detecting the motion blur of the pet dog is improved, and when the form of the pet dog is changed, the detection accuracy is higher than that of an unmodified YOLOv5-L model;

4. an SK attention mechanism is added, the attention degree of the model to important features is improved, and local and global relations are better acquired.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention.

Fig. 2 is a diagram of the steps of frame extraction for videos in the test set and preprocessing for the initial training set.

Fig. 3 shows the detection result of a video frame of a video in the test set.

Detailed Description

In order to clearly illustrate the present invention, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, so that those skilled in the art can implement the embodiments according to the present invention with reference to the description text.

FIG. 1 is a diagram of the overall process steps of the invention, and a pet dog video target detection method based on improved YOLOv5-L comprises the following steps:

step one, constructing and constructing an initial training set and a test set: collecting a data set dogbred and a data set Dogs vs Cats Redox on a kaggee, and extracting pictures related to pet Dogs in the two data sets; collecting pictures with different background noises (such as grassland, snow mountain, indoor and street), wherein the pictures contain the pictures of pet dogs; labeling all the pictures by using a LabelImg labeling tool to obtain labeled pictures of the pet dog; merging the marked pet dog pictures into an initial training set; collecting videos of interaction between a person and a pet dog on a youtube website, and downloading and storing the videos by using a 4Kvideo tool; and cutting the stored video, splitting the original video into short videos of 3s-10s, and storing all the short videos to obtain a test set.

Step two, performing frame extraction on the video in the test set and preprocessing the initial training set, and specifically comprising the following steps: extracting the video in the test set frame by using an extractor algorithm, and storing all video frame images; selecting and marking pictures with abnormal shapes and motion blurs of part of pet dogs from the video frame images to obtain marked pictures; randomly selecting pictures in the training set to perform left-right translation, multi-picture superposition and scaling, thereby enriching the morphological characteristics of the pet dog; and merging the marked pictures and the initial training set to obtain a final training set.

Step three, improving a YOLOv5-L model, and firstly building a BackBone network, wherein the BackBone network specifically comprises the following steps: the device comprises a down-sampling module, a CBR module, a Res module and a CSP _ X module; the down-sampling module divides the 640 pixel-640 pixel RGB image into 12-channel feature maps by adopting a split algorithm, and obtains 64-channel feature maps by convolution; the Backbone comprises 5 CBR modules, and each CBR module consists of a 3 × 3 convolution layer, a regularization layer and a Relu function; the Res module is connected with the empty layer residual error by two CBR modules; the CSP _ X module is used for extracting main features and is connected with the empty layer residual error by the CBR module, the X Res modules; the Backbone comprises a CSP _2, two CSP _4 and a CSP _8 module.

Step four, improving a YOLOv5-L model, and then improving a Pred module, wherein the method specifically comprises the following steps: adding a flatten algorithm in front of an output module, carrying out one-dimensional feature, and replacing the convolution layer in the output module with a full-connection layer; the model has fewer detection types, the full connection layer does not increase excessive parameter calculation, and the detection accuracy can be better.

Step five, improving a YOLOv5-L model, and adding an SK attention mechanism behind the BackBone network, wherein the SK attention mechanism consists of split, fuse and select; the split part firstly convolutes the original characteristic diagram through convolution cores with three sizes; the fuse part calculates the weight of each convolution kernel, sums the feature graphs of the three branches according to elements, generates channel statistical information through global average pooling, and obtains a new feature dimension C1; the select part calculates the weight of each convolution kernel by using softmax, and fuses all the convolution kernels to form the convolution kernel of the final output.

Step six, training the improved model, specifically as follows: modifying number class change detection categories in the YAML configuration file, wherein the categories comprise: dog, human; setting an NMS mechanism for reserving a prediction box with the best prediction, and reducing the confidence coefficient of the rest prediction boxes to 0; setting a Loss function as DIOU _ Loss; setting a training hyper-parameter, setting the number of training rounds to be 300, setting an optimizer to be an improved SGD, setting an initial learning rate to be 0.01, setting a learning rate momentum to be 0.95 and setting a training batch to be 64; and (4) training the training set in a model, obtaining the optimal weight parameter through multiple iterations, and saving the file as best.

Step seven, putting the weight parameter file best.pt into a detector, adding a scaling algorithm to fix the size of the transmitted video frame to be 640 pixels by 640 pixels, putting the video into a test set for detection, and storing all video frames of the detected pet dog; the accuracy of the AP index evaluation model is adopted, and the AP index calculation mode is as follows: AP = number of detected pet dog occurring video frames/number of pet dog occurring video frames in all.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A pet dog video target detection method based on improved YOLOv5-L is characterized by comprising the following steps:

preprocessing the initial training set to obtain a final training set;

and inputting the video of the pet dog to be detected into the optimal YOLOv5-L model to obtain a corresponding detection result.

2. The improved YOLOv 5-L-based pet dog video target detection method as claimed in claim 1, wherein the constructing of the initial training set and the test set comprises the following steps:

merging the marked pet dog pictures into an initial training set;

searching a video of interaction between a person and a pet dog on a video website, and downloading and storing the video by using a 4Kvideo tool;

and cutting the stored video, splitting the original video into 3s-10s short videos, and storing all the short videos to obtain a test set.

3. The improved YOLOv 5-L-based pet dog video target detection method as claimed in claim 1, wherein the steps of performing frame extraction on the videos in the test set and performing pre-processing on the initial training set comprise the following steps:

selecting and marking pictures with abnormal shapes and motion blurs of part of pet dogs from the video frame images to obtain marked pictures;

4. The improved YOLOv 5-L-based pet dog video target detection method as claimed in claim 1, wherein the BackBone network is built and comprises a down-sampling module, a CBR module, a Res module and a CSP _ X module;

the down-sampling module; dividing the 640 pixel-640 pixel RGB image into 12-channel feature maps by adopting a split algorithm, and obtaining 64-channel feature maps by convolution;

the CSP _ X module; the method is used for extracting features and comprises a CBR module, X Res modules and a blank layer residual error which are connected with one another, wherein X represents the number.

5. The improved YOLOv 5-L-based pet dog video target detection method as claimed in claim 1, wherein the improved Pred module comprises: and adding a flatten algorithm in front of the output layer, carrying out one-dimensional operation on the characteristic diagram, and replacing the convolution layer in the output layer with a full-connection layer.

6. The improved YOLOv 5-L-based pet dog video target detection method according to claim 1, wherein the SK attention mechanism comprises a split unit, a fuse unit and a select unit; the split unit convolutes the original characteristic diagram through convolution cores with three sizes; the fuse unit calculates the weight of each convolution kernel, sums the feature graphs of the three branches according to elements, generates channel statistical information through global average pooling, and obtains a new feature dimension C x 1; the select unit calculates the weight of each convolution kernel by using softmax, and fuses all the convolution kernels to form the final output convolution kernel.

7. The improved YOLOv5-L based pet dog video target detection method of claim 1, wherein the improved YOLOv5-L model is trained and further comprises the following steps:

modifying the number class change detection category in the YAML configuration file;

setting an NMS mechanism for reserving a prediction box with the best prediction and reducing the confidence coefficient of other prediction boxes to 0;

setting a Loss function as DIOU _ Loss;

and the training set enters a model for training, and the optimal weight parameter is obtained through multiple iterations.

8. The improved YOLOv 5-L-based pet dog video target detection method as claimed in claim 1, wherein the optimal weight parameters are put into a detector, a scaling algorithm is added to fix the size of the incoming video frame to 640 pixels by 640 pixels, the incoming video frame is put into a test set video for detection, and all video frames of the detected pet dog are saved; the accuracy of the AP index evaluation model is adopted, and the AP index calculation mode is as follows: AP = number of detected pet dog occurring video frames/number of pet dog occurring video frames in all.

9. A pet dog video target detection system based on improved YOLOv5-L is characterized by comprising a data acquisition module, an image extraction module, a preprocessing module, a model improvement training module and a result detection module;

the model improvement training module is used for improving and training a YOLOv5-L model, and specifically comprises the following steps: building a BackBone network, improving a Pred module, and adding an SK attention mechanism behind the BackBone network; setting training parameters, training the improved YOLOv5-L model, and storing an optimal weight parameter file; putting the optimal weight parameter file into a detector, detecting the final test concentrated video, storing all video frames of the detected pet dog, and evaluating a detection result by using an AP index to obtain an optimal improved YOLOv5-L model;