CN112036214A

CN112036214A - Method for identifying small target in low-image-quality video in real time

Info

Publication number: CN112036214A
Application number: CN201910479019.1A
Authority: CN
Inventors: 张昭智
Original assignee: Shanghai Paidao Intelligent Technology Co ltd
Current assignee: Shanghai Paidao Intelligent Technology Co ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2020-12-04

Abstract

The invention provides a method for identifying small targets in a low-image-quality video in real time, wherein the small targets and the large targets have necessary relative position relation, and the small targets have a standard state and a non-standard state; extracting a certain number of pictures in a video as a data set for labeling; the data set is marked as two categories of a small target standard state and a small target non-standard state, each category corresponds to two rectangular frames, the first frame is a large target in the standard state, and the second frame is the standard state of the small target; the first frame is a large target in the non-standard state, and the second frame is a small target in the non-standard state; training reference by using the data set as a target detection algorithm; and carrying out target state identification on the video by using a target detection algorithm. The invention has higher detection accuracy.

Description

Method for identifying small target in low-image-quality video in real time

Technical Field

The invention relates to the field of computer vision, in particular to a method for identifying small targets in a low-quality video in real time.

Background

The task of detecting a target by using computer vision is to analyze information which can be understood by a computer from an image, and in the actual detection process, besides the category information of the target in the picture, the position information of the target also needs to be obtained. At present, target detection algorithms based on deep learning are mainly divided into two major categories, namely target detection algorithms based on classification and target detection algorithms based on regression.

The classification-based target detection algorithm mainly divides the target detection process into two stages. The first stage is mainly to select a candidate region, the second stage is to classify the candidate region and adjust the position, and the target detection result is obtained after the two stages. The typical model of this scheme is a Faster region-based convolutional neural network algorithm (fast R-CNN) proposed by Ren S et al in 2015, and the target detection system is divided into two modules by a candidate region generation network (RPN), the first module is a deep full convolution network for extracting candidate regions, and the second module uses a fast R-CNN detector for detection based on region extraction. The whole system is a single and unified target detection network. The Faster R-CNN algorithm framework is shown. Firstly, taking the whole picture as input, obtaining a feature layer through convolution calculation, and then inputting the convolution feature into an RPN network to obtain feature information of a candidate frame; then, judging whether the features extracted from the candidate frame belong to a specific class by using a classifier; and finally, further adjusting the position of the candidate frame belonging to a certain characteristic by using a regressor, wherein the whole network process shares the characteristic information extracted by the convolutional neural network.

In a convolution feature map with a certain size, the RPN network can generate candidate frames with a plurality of sizes, and the problems of variable target size and inconsistent fixed receptive field are caused. If the number of the candidate boxes is increased, the detection speed of the algorithm is reduced, and the requirement of the actual production environment on the real-time performance is difficult to meet.

The regression-based target detection algorithm simplifies the target detection process into a uniform end-to-end regression problem, so that the position and category information of the detected target can be obtained simultaneously only by processing the picture once (comparing with multiple candidate region selection classification). Unlike two-stage models based on region extraction, the single-stage approach can achieve feature sharing through a complete single training. Typical representatives of such algorithms are you just need to look Once (YOLO) SSD, etc. The following description focuses on the SSD as an example.

In 2016, LiuW et al propose an SSD algorithm to apply a single deep neural network to image target detection. The SSD algorithmic framework is shown with its localization bounding boxes defined as a set of spatially discrete default boxes and corresponding to different aspect ratios and mapping locations. During prediction, the network generates a corresponding probability score for the target class in each default box, and adjusts the default boxes to achieve a good match with the target shape. In addition, the network also makes complete prediction on the targets with different image qualities by combining a plurality of feature maps of the targets, and realizes the detection task of the multi-size targets.

In the SSD algorithm, when no candidate region exists, the region regression difficulty is high, and the problem of difficult convergence is easy to occur; feature maps of different layers of the SSD are used as independent input of the classification network, so that the same object is detected by frames with different sizes at the same time, and repeated operation is caused; since small targets correspond to small areas in the feature map and cannot be trained sufficiently, the detection effect of the SSD on small targets is still not ideal.

When the existing computer vision technology is used for detecting and identifying small targets in a low-quality video, the detection accuracy rate obtained by using a traditional deep learning method is low due to the fact that the targets are small.

Therefore, the problem of real-time detection and identification of small targets in low-quality videos in the prior art is urgently needed to be solved.

Disclosure of Invention

The invention aims to solve the problem of detecting and identifying small targets in a low-quality video in real time. The existing computer vision technology is improved, and the small target in the low-quality video is detected and identified with high accuracy.

In order to achieve the purpose, the invention provides a method for identifying small targets in a low-quality video in real time, wherein the small targets and the large targets have necessary relative position relation, and the small targets have a standard state and a non-standard state; extracting a certain number of pictures in a video as a data set for labeling; the data set is marked as two categories of a small target standard state and a small target non-standard state, each category corresponds to two rectangular frames, the first frame is a large target in the standard state, and the second frame is the standard state of the small target; the first frame is a large target in the non-standard state, and the second frame is a small target in the non-standard state; training reference by using the data set as a target detection algorithm; and carrying out target state identification on the video by using a target detection algorithm.

Wherein the small target is inside the large target. The small target is a human head, the large target is a human body, and the standard state of the small target is a state in which a safety helmet is worn on the human head.

Further, the target detection algorithm is an SSD algorithm. The localization bounding box of the object detection algorithm is defined as a set of spatially discrete default boxes and corresponds to different aspect ratios and mapping locations. When the target detection algorithm performs prediction, the network generates a corresponding probability score for the target category in each default frame, and adjusts the default frame to achieve good matching with the target shape. The network in the target detection algorithm also makes a complete prediction on targets with different image qualities by combining a plurality of feature mappings of the targets, and realizes a detection task on the targets with multiple sizes.

And forming a new corresponding relation according to the relation and the state classification of the large and small targets in the target detection algorithm to replace the corresponding characteristic layer in the algorithm. Specifically, an input picture for detection is firstly input into the target detection algorithm by image compression, a first loss value is obtained at the same time, then the corresponding image position is extracted by using the output position information, then the corresponding image position is input into the algorithm to obtain a second loss value and a detection result, a total loss value is obtained by using the linear combination of the first loss value and the second loss value, and model training work is carried out by using the process. Further, in the model prediction stage, the detection result can be directly output through an algorithm for obtaining the second loss value, so that the calculation speed of the model is accelerated.

Compared with the prior art, the method has the advantages that the method utilizes the information of the correlation between the objects, so that the target position can be quickly positioned in the object detection process, and then the target area is directly classified, so that a detection scheme with high detection accuracy and high detection speed can be obtained. Meanwhile, the scheme has the advantages of high detection speed and low storage occupancy rate, and has higher detection accuracy rate for the small target detection problem of the low-image-quality video compared with other deep learning network models with the same detection speed and the same storage occupancy rate.

In order to make the aforementioned objects, features and embodiments of the present invention more comprehensible, the following detailed description of the structural design and operational procedures of the present invention is provided in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram of an algorithm according to an embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Referring first to fig. 1, fig. 1 is a schematic view illustrating the present invention, and one embodiment of the present invention is a monitoring and identification of whether a worker in a work area wears a crash helmet. Because the safety helmet is a small target, the safety helmet is difficult to clearly identify in the existing low-quality monitoring video, especially the occasion needing automatic identification of a computer. In this embodiment, since the head of the person is attached to the body of the person according to the correlation of the object, a method of detecting the body of the person, then intercepting the detected body from the picture, and then performing a secondary detection on whether the head of the person wears the crash helmet is adopted. The human body is larger than the head relative to the original image, so the detection accuracy is higher.

In one embodiment of the invention, the network structure, the model size and the calculation speed of the SSD network model are optimized, and a target detection and identification algorithm capable of effectively identifying small targets in the low-quality video is established. The description will be given taking an example of detecting whether a worker wears a helmet in an actual plant environment.

Monitoring data collected by actual video monitoring equipment are utilized, and a certain number of pictures are extracted to be used as data sets to be labeled through monitoring videos in different seasons, different weather and different time of a year. And the problem of data imbalance is avoided by artificial data selection. And then carrying out data annotation on the selected data. When in marking, the collected data set is marked into two categories of a wearable safety helmet and a non-wearable safety helmet, and each category corresponds to two rectangular frames as targets. For the type of wearing the safety helmet, the first frame is the whole human body, and the second frame is the safety helmet; for the class of no safety helmet, the first frame is the whole human body, and the second frame is the head.

The algorithm model adopted by the scheme is based on the existing SSD model, and part of connection structures of the algorithm model are shown in Table 1, an average pooling layer and a Softmax layer in the original SSD network are removed, and three new feature layers are added through three groups of single-depth and single-point convolution kernel groups after a Conv2d _13_ pointwise layer. Con2d _11_ pointwise and Con2d _13_ pointwise layers of the original mobile convolutional network MobileNet network and newly added Con2d _14_ pointwise, Con2d _115_ pointwise, Con2d _16_ pointwise and Con2d _17_ pointwise layers are taken as feature extraction layers of the SSD anchor block. The configuration of the anchor block is: the minimum scale factor is set to 0.2, the maximum scale factor is set to 0.9, the size factors of the anchor blocks on the six feature layers are respectively 0.2, 0.34, 0.48, 0.62, 0.86 and 0.9, five aspect ratios and an additional 1 are configured for the anchor block of each layer: 1, so that there are six anchor boxes per anchor position per feature layer.

By analyzing the network structure in the MobileNet, the MobileNet can still keep a higher image classification effect under the condition of greatly reducing network parameters and computation. Meanwhile, the features in the image can be extracted well under the condition of greatly reducing the network operation amount. After the network is modified by the method, the size of the extracted feature map is smaller than that of the SSD, and the number of anchor point frames required by the new network is only one third of that of the SSD network. Meanwhile, experience and experiments show that the adjustment has obvious improvement effect on the detection effect of the algorithm.

Table 1 network structure added by this scheme compared with the original network model

The above algorithm process is abbreviated as MSSD, and the data flow diagram of the present invention is shown in FIG. 1. The method comprises the steps of firstly, inputting an input picture for detection into an algorithm MSSD _1 by means of image compression, obtaining a first Loss value Loss _1 at the same time, then, extracting a corresponding image position by means of output position information, then, inputting the image position into an algorithm MSSD _2, obtaining a second Loss value Loss _2, obtaining a total Loss value by means of linear combination of the first Loss value Loss _1 and the second Loss value Loss _2, and conducting model training work by means of the process. In the model prediction stage, the detection result can be directly output through the algorithm MSSD _2, so that the calculation speed of the model is accelerated, and the occupation of the storage space is reduced.

According to the detection algorithm provided by the text, when the trained model is used for detecting the small target in the low-image-quality video, the detection effect better than that of the original SSD algorithm can be obviously achieved.

Referring next to fig. 2, fig. 2 is a flowchart illustrating an embodiment of the present invention. In the embodiment shown in fig. 2, in the first step, a large target and a small target are selected, the small target and the large target have a necessary relative position relationship, and the small target can be in the large target or outside the determined position. The small target has a standard state and a non-standard state. Secondly, extracting a certain number of pictures in the video as a data set for labeling; the data set is marked as two categories of a small target standard state and a small target non-standard state, each category corresponds to two rectangular frames, the first frame is a large target in the standard state, and the second frame is the standard state of the small target; the first frame is a large target in the non-standard state, and the second frame is a small target in the non-standard state; and thirdly, forming a new corresponding relation in the target detection algorithm according to the relation and the state classification of the large target and the small target to replace the corresponding characteristic layer in the algorithm. And fourthly, firstly, inputting the input picture for detection into an algorithm by utilizing image compression, and simultaneously obtaining a first loss value. And fifthly, extracting the corresponding image position by using the output position information, and inputting the image position into an algorithm to obtain a second loss value and a detection result. And sixthly, obtaining a total loss value by utilizing the linear combination of the first loss value and the second loss value, and carrying out model training work by utilizing the process.

In the model prediction stage, the detection result can be directly output through an algorithm for obtaining the second loss value, so that the calculation speed of the model is accelerated.

The localization bounding box of the object detection algorithm is defined as a set of spatially discrete default boxes and corresponds to different aspect ratios and mapping locations. When the target detection algorithm performs prediction, the network generates a corresponding probability score for the target category in each default frame, and adjusts the default frame to achieve good matching with the target shape. The network in the target detection algorithm also makes a complete prediction on targets with different image qualities by combining a plurality of feature mappings of the targets, and realizes a detection task on the targets with multiple sizes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The method for identifying the small target in the low-image-quality video in real time is characterized in that the small target and the large target have necessary relative position relation, and the small target has a standard state and a non-standard state; extracting a certain number of pictures in a video as a data set for labeling; the data set is marked as two categories of a small target standard state and a small target non-standard state, each category corresponds to two rectangular frames, the first frame is a large target in the standard state, and the second frame is the standard state of the small target; the first frame is a large target in the non-standard state, and the second frame is a small target in the non-standard state; training reference by using the data set as a target detection algorithm; and carrying out target state identification on the video by using a target detection algorithm.

2. The method of claim 1, wherein the small object is inside the large object.

3. The method according to claim 1, wherein the small object is a head, the large object is a human body, and the standard state of the small object is a state in which a helmet is worn on the head.

4. The method of claim 1, wherein the object detection algorithm is a Shot Multi-box Detector (SSD) algorithm.

5. The method as claimed in claim 1, wherein the location bounding box of the object detection algorithm is defined as a set of spatially discrete default boxes corresponding to different aspect ratios and mapping locations.

6. The method of claim 1, wherein the target detection algorithm generates a probability score for each target class in the default frame during the prediction, and adjusts the default frame to achieve a good match with the target shape.

7. The method as claimed in claim 1, wherein the network of the object detection algorithm further performs a complete prediction of the objects with different image quality combined with the feature maps thereof to perform a task of detecting the objects with different sizes.

8. The method as claimed in claim 1, wherein the object detection algorithm is further characterized by forming a new corresponding relation to replace the corresponding feature layer in the algorithm according to the relation and status classification of the large and small objects.

9. The method as claimed in claim 8, wherein the input image for detection is input into the algorithm by image compression to obtain a first loss value, the corresponding image position is extracted by using the output position information, and then the input image is input into the algorithm to obtain a second loss value and a detection result, and the total loss value is obtained by using the linear combination of the first loss value and the second loss value, and the model training is performed by using the process.

10. The method as claimed in claim 9, wherein the detecting result is directly outputted by an algorithm for obtaining the second loss value in the model predicting stage, thereby speeding up the calculation of the model.