CN112818913B

CN112818913B - Real-time smoking calling identification method

Info

Publication number: CN112818913B
Application number: CN202110207092.0A
Authority: CN
Inventors: 张全; 赵磊; 彭博; 周文俊; 张伟; 涂然
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2023-04-07
Anticipated expiration: 2041-02-24
Also published as: CN112818913A

Abstract

The invention discloses a real-time smoking calling identification method, which comprises the following steps: s1: establishing a real-time smoking and calling identification model; s2: positioning a pedestrian area by utilizing a multi-target tracking algorithm according to a monitoring video of a target scene; s3: and performing smoking and calling real-time identification on the pedestrian area according to the real-time smoking and calling identification model. According to the invention, a backbone network is constructed by taking Se-Res2Block as a basic module, so that more characteristics can be fused, and the detection speed can be improved; aiming at the problem of low identification precision of small targets, the resolution of an input image is increased, an SPP module and an ASFF module are introduced, information interaction between contexts is enhanced, and the identification precision of the small targets is improved; most of the existing smoking and calling methods are single-frame detection, the false detection rate is high, multi-frame information is introduced through a multi-target tracking algorithm, the false detection rate can be reduced by the IOU method for calculating the distance between a pedestrian area and a rectangular frame of a mobile phone and a cigarette, and the robustness is higher.

Description

Real-time smoking calling identification method

Technical Field

The invention relates to the technical field of target detection in computer vision, in particular to a real-time smoking and calling identification method.

Background

The smoking calling identification has an important role in the fields of gas stations, chemical engineering and the like. In practical applications, the existing smoking and calling recognition algorithm has the following disadvantages: (1) Smoking call detection typically employs target detection algorithms such as: SDD, RCNN, etc., but these algorithms have higher requirements for GPU resources, increasing deployment costs; (2) small mobile phone and cigarette targets, difficult to detect; and (3) single-frame detection is performed, so that the false detection rate is high, and the robustness is low.

Disclosure of Invention

In view of the above problems, the present invention is directed to a real-time smoking and phone call recognition method.

The technical scheme of the invention is as follows:

a real-time smoking and calling identification method comprises the following steps:

s1: establishing a real-time smoking and calling identification model;

s2: positioning a pedestrian area by utilizing a multi-target tracking algorithm according to a monitoring video of a target scene;

s3: and performing smoking and calling real-time identification on the pedestrian area according to the real-time smoking and calling identification model.

Preferably, in step S1, the establishing of the real-time smoking and calling recognition model specifically includes the following sub-steps:

s11: collecting smoking and calling picture data to obtain a data set, labeling objects in the data set, and dividing the data set into a training set, a verification set and a test set, wherein the objects comprise pedestrians, mobile phones and cigarettes;

s12: establishing an improved YOLOV3 model, which specifically comprises the following substeps:

s121: constructing a lightweight backbone network by taking Se-Res2Block as a basic module;

s122: introducing an SPP module behind the lightweight backbone network, and adjusting the resolution of an input image;

s123: adding an ASFF module to obtain the improved Yolov3 model;

s13: generating anchors by using the data in the training set through a kmeans clustering algorithm, wherein the number of the anchors required to be generated is 9;

s14: training and verifying the improved Yolov3 model by taking data of a training set and a verification set as input of the improved Yolov3 model;

s15: taking data of a test set as input of the improved Yolov3 model, and testing the accuracy of the improved Yolov3 model; and when the accuracy reaches a target threshold value, obtaining the real-time smoking and calling identification model.

Preferably, in step S11, when labeling the object in the data set, labeling is performed using yolomark.

Preferably, in step S11, the division ratio of the training set, the validation set, and the test set is 8.

Preferably, in step S121, the Se-Res2Block module is composed of two parts, namely a Res2Bolck module and a channel attention mechanism, and the channel attention mechanism is arranged behind the Res2Block module.

Preferably, the lightweight backbone network comprises two 3 × 3 convolutions with step length of 2, a bottleneck layer, three Se-Res2Block modules, convolution kernel number of 36, a bottleneck layer, three Se-Res2Block modules, convolution kernel number of 72, a bottleneck layer, three Se-Res2Block modules, and convolution kernel number of 144, which are connected in sequence.

Preferably, in step S122, introducing the SPP module after the lightweight backbone network specifically includes: and downsampling the output of the upper layer through three maximum pooling kernels with the step length of 1 and the sizes of 5 × 5, 9 × 9 and 13 × 13, and realizing multi-feature fusion of the three downsampled outputs and the output of the upper layer in a splicing mode to obtain the receptive fields with different sizes.

Preferably, in step S122, when the resolution of the input image is adjusted, the resolution of the input image is adjusted to 512 × 512.

Preferably, step S123 is specifically: increasing an ASFF fusion mode on the basis of YOLOV3, wherein YOLOV3 is output in three scales, ASFF designs a weight parameter for a feature layer in each scale, the sum of the three weight parameters is 1, the feature layers in different scales are adjusted to be the same in size through up-sampling or down-sampling during fusion, and each feature layer is multiplied by the respective weight parameter to serve as output.

Preferably, in step S14, when training the improved YOLOV3 model, data enhancement is performed on the data in the training set, specifically, data enhancement is performed by changing one or more of an angle, a contrast and a brightness of a picture of the data set.

Preferably, in step S2, when the pedestrian area is located, a depsort multi-target tracking algorithm is used for locating.

Preferably, in step S3, when performing real-time recognition of smoking and making a call to the pedestrian area: when the confidence threshold of the real-time smoking and calling identification model is larger than 0.5, the correct cigarette or mobile phone is considered to be detected, and smoking and calling judgment is carried out at the moment, wherein the specific judgment method comprises the following steps:

determining a pedestrian head area;

if the detection result is the mobile phone and the IOU values of the mobile phone area and the pedestrian head area are larger than 0.08, the calling behavior is considered to exist;

if the detection result is a cigarette and the IOU values of the cigarette area and the pedestrian head area are larger than 0, the smoking behavior is considered to exist;

recording the judgment result of each frame of each pedestrian, if the judgment result in the current frame is that smoking or calling behaviors exist, multiplying 1 by 0.2, and if the judgment result is normal, multiplying 1 by 0 to obtain the instantaneous judgment result of the current frame;

adding the instantaneous judgment results of the same pedestrian in 5 continuous frames, if the calculation result is more than 0.5, considering that the pedestrian has smoking or calling behaviors in the period of time, otherwise, judging that the pedestrian is normal.

Preferably, when the head area of the pedestrian is determined, the pedestrian is divided according to the proportion of the head to the body, wherein the proportion of the head to the body is 1.

The invention has the beneficial effects that:

in the real-time smoking and calling identification model, the Se-Res2Block is used as a basic module to construct a backbone network, so that more features can be fused and the speed is higher; the resolution of an input picture is increased, and the SPP module and the ASFF module are introduced, so that context information can be better fused, and the detection precision of a small target object can be improved; the prior smoking and calling method mostly adopts single-frame detection and has high false detection rate, and the invention introduces multi-frame information through a multi-target tracking algorithm and calculates the IOU between a pedestrian area and a rectangular frame of a mobile phone and a cigarette, so that the false detection rate can be reduced, and the robustness is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a real-time smoking and phone call recognition method according to the present invention;

FIG. 2 is a schematic diagram of a Se-Res2Blcok base module according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating sizes of convolution kernels and a connection manner adopted by an SPP module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an ASFF module fusion in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of an improved YOLOV3 detection network according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating the results of cigarette testing in accordance with one embodiment of the improved YOLOV3 network of the present invention;

FIG. 7 is a diagram illustrating a cigarette inspection result of an embodiment of an original Yolov3 network;

fig. 8 is a schematic diagram illustrating a mobile phone detection result of an embodiment of the improved YOLOV3 network according to the present invention;

fig. 9 is a schematic diagram of a mobile phone detection result of an embodiment of an original YOLOV3 network;

fig. 10 is a schematic diagram of a cigarette, a cell phone area and a pedestrian head area IOU according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

It should be noted that, in the present application, the embodiments and the technical features of the embodiments may be combined with each other without conflict.

It is noted that, unless otherwise indicated, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

As shown in fig. 1-10, a real-time smoking and calling identification method includes the following steps:

s1: establishing a real-time smoking and calling identification model, which specifically comprises the following substeps:

s11: the method comprises the steps of collecting smoking and calling picture data to obtain a data set, marking objects in the data set, and dividing the data set into a training set, a verification set and a test set, wherein the objects comprise pedestrians, mobile phones and cigarettes.

In a specific embodiment, smoking and calling behaviors are simulated under a monitoring camera, then a monitoring video is collected, a picture is captured from the obtained video every 24 frames, useless pictures are manually removed, and effective pictures are used as a data set.

In a specific embodiment, when the objects in the data set are labeled, yolomark is used for labeling. It should be noted that, in addition to the labeling by using the method in the present embodiment, other labeling methods in the prior art may also be used for labeling.

In a specific embodiment, the division ratio of the training set, the validation set and the test set is 8. It should be noted that the division ratio may be adjusted according to the amount of data in the data set, and in addition to the division ratio used in the present embodiment, other division ratios such as 6.

s121: constructing a lightweight backbone network by taking Se-Res2Block as a basic module, wherein the Se-Res2Block module consists of a Res2Bolck module and a channel attention mechanism, and the channel attention mechanism is arranged behind the Res2Block module; the Res2Block module firstly passes through a bottleneck layer on input features, then is divided into four parts of X1, X2, X3 and X4, the X2 part adopts a grouping convolution with the convolution kernel size of 3 and the group number of 2 to extract features, then is fused with the X3 part in an addition mode, the X3 part and the X4 part adopt the same mode to fuse the features, finally, the X1 part and the X2, X3 and X4 which are subjected to grouping convolution to extract the features are fused in a splicing mode, the fused features utilize a channel attention mechanism to extract useful features, and finally, the bottleneck layer is used for reducing dimensionality; a channel attention mechanism, which compresses the features by adopting average pooling, then reduces the dimension to 1/r by utilizing an FC layer, takes 16 from r in a specific embodiment, finally adjusts the weight of each channel by using a logistic function, and multiplies the obtained weight by the input features correspondingly; the lightweight backbone network comprises two 3 x3 convolutions with the step length of 2, a bottleneck layer, three Se-Res2Block modules, convolution kernel number of 36, a bottleneck layer, three Se-Res2Block modules, convolution kernel number of 72, a bottleneck layer, three Se-Res2Block modules and convolution kernel number of 144, wherein the two convolution layers are sequentially connected.

S122: and introducing an SPP module behind the lightweight backbone network, and adjusting the resolution of the input image.

In a specific embodiment, the introduction of the SPP module after the lightweight backbone network specifically is: and an SPP module is connected behind the lightweight backbone network, the output of the upper layer is downsampled through three maximum pooling kernels with the step length of 1 and the sizes of 5 × 5, 9 × 9 and 13 × 13, and the three downsampled outputs and the output of the upper layer are spliced to realize multi-feature fusion to obtain the receptive fields with different sizes.

In a specific embodiment, when the resolution of the input image is adjusted, if the resolution of the input image is smaller, the resolution of the input image is increased to 512 × 512, so that the identification precision of small objects can be improved; if the resolution of the input image is large, the resolution of the input image is reduced to 512 × 512, which can reduce the amount of calculation and increase the calculation speed. It should be noted that the resolution of 512 × 512 is a preferable resolution in this embodiment, and in actual application, other resolutions may be adopted according to the recognition accuracy requirement and the calculation requirement.

S123: adding an ASFF module to obtain the improved Yolov3 model; specifically, the ASFF fusion mode is added on the basis of YOLOV3, and the fact that YOLOV3 outputs threeThe characteristics of the scales of level1, level2 and level3 are X respectively ¹ 、X ² 、X ³ Then, respectively designing a weight parameter alpha for the characteristic layer of each scale ³ 、β ³ 、γ ³ And the sum of the three weight parameters is 1, because an addition mode is adopted, the feature graphs need to be ensured to be the same in size during fusion, feature layers with different scales are adjusted to be the same in size through upsampling or downsampling, and each feature layer is multiplied by respective weight parameter to serve as output. Can be expressed by the following formula:

the weight parameters α, β, and γ are obtained by convolution of 1 × 1 through the level1-level3 feature maps, and the parameters α, β, and γ are made to be in the range of [0,1] by softmax after being spliced.

S13: and generating anchors by using the data in the training set through a kmeans clustering algorithm, wherein the number of the anchors required to be generated is 9.

S14: taking data of a training set and a verification set as input of an improved YOLOV3 model, and training and verifying the improved YOLOV3 model;

in a specific embodiment, when the improved YOLOV3 model is trained, data enhancement is performed on data in the training set, specifically by changing one or more of an angle, a contrast and a brightness of a picture of the data set. It should be noted that, training by using the data enhancement method can increase the sample size, enhance the diversity of the data set, and further improve the accuracy and robustness of the model, but this method is not an essential technical means, and the data enhancement training may not be performed even when the data size of the data set is large enough, the samples are diversified, and the like.

In a specific embodiment, the target threshold is 80%, it should be noted that the target threshold is determined according to the requirement of the user for accuracy, and besides the target threshold of the embodiment, other target thresholds such as 85%, 90%, 95% and the like may also be used.

In a specific embodiment, the detection performance of the improved YOLOV3 model of the present invention and the original YOLOV3 model on cell phones and cigarettes was verified on a test data set, and the test results are shown in fig. 6-9. In this embodiment, the precision of the improved YOLOV3 model of the present invention is 90.81%, and the precision of the original YOLOV3 model is 75.6%. Meanwhile, the experimental result shows that the improved YOLOV3 model can detect the mobile phone and the cigarette more easily in the same picture, and the detection effect on the mobile phone and the cigarette is obvious.

S2: and positioning the pedestrian area by utilizing a multi-target tracking algorithm according to the monitoring video of the target scene. In a specific embodiment, when the pedestrian area is located, a deepsort multi-target tracking algorithm is adopted for location. It should be noted that the multi-frame information is introduced mainly by using the multi-target tracking algorithm, and besides the deppsort multi-target tracking algorithm adopted in this embodiment, other backend tracking optimization algorithms such as sort and the like matched with kalman filtering, hungarian and KM, or other single-target tracking algorithms such as KCF and the like based on multithreading, and the like, may also be adopted.

S3: according to the real-time smoking and calling identification model, smoking and calling real-time identification is carried out on the pedestrian area, and the method comprises the following specific steps: when the confidence threshold of the real-time smoking and calling identification model is larger than 0.5, the correct cigarette or mobile phone is considered to be detected, and smoking and calling judgment is carried out at the moment, wherein the specific judgment method comprises the following steps:

determining a pedestrian head area; in a specific embodiment, when determining the head area of the pedestrian, the head area is divided according to the ratio of the head to the body, wherein the ratio of the head to the body is 1. It should be noted that, in addition to the determination of the head area of the pedestrian by using the proportional division of the head and the body in the embodiment, other prior art techniques may be used to determine the head area of the pedestrian.

The pedestrian area is positioned by utilizing the multi-target tracking algorithm, the false detection rate can be reduced by combining multi-frame information, the target detection is single-frame detection, small targets such as cigarettes and mobile phones are easily identified by mistake, the multi-target tracking algorithm is introduced, and the information of continuous frames is introduced, so that the false detection rate can be greatly reduced, and the robustness of the model is improved; in addition, IOU calculation is carried out according to the head position of the pedestrian and the positions of the cigarettes and the mobile phone, and a threshold value is set to identify a result, so that the false detection rate is low.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A real-time smoking and calling identification method is characterized by comprising the following steps:

s1: the method for establishing the real-time smoking and calling identification model specifically comprises the following substeps:

s121: constructing a lightweight backbone network by taking Se-Res2Block as a basic module, wherein the lightweight backbone network comprises two 3 x3 convolutions with the step length of 2, a bottleneck layer, three Se-Res2Block modules, the number of convolution cores is 36, a bottleneck layer, three Se-Res2Block modules, the number of convolution cores is 72, a bottleneck layer, three Se-Res2Block modules and the number of convolution cores is 144, which are sequentially connected;

s122: introducing an SPP module behind the lightweight backbone network, and adjusting the resolution of an input image; the introduction of the SPP module after the lightweight backbone network specifically comprises: down-sampling the output of the upper layer by three maximum pooling kernels with the step length of 1 and the sizes of 5 × 5, 9 × 9 and 13 × 13, and realizing multi-feature fusion of the three down-sampled outputs and the output of the upper layer in a splicing manner to obtain receptive fields with different sizes;

s123: adding an ASFF module to obtain the improved Yolov3 model;

s15: taking data of a test set as input of the improved Yolov3 model, and testing the accuracy of the improved Yolov3 model; when the accuracy reaches a target threshold value, obtaining the real-time smoking and calling identification model;

s3: according to the real-time smoking and calling identification model, smoking and calling real-time identification is carried out on the pedestrian area, and the smoking and calling real-time identification result of the pedestrian is determined according to the instantaneous judgment addition result of continuous 5 frames of the same pedestrian;

when smoking and making a call to the pedestrian area are identified in real time: when the confidence threshold of the real-time smoking and calling identification model is larger than 0.5, the correct cigarette or mobile phone is detected, and smoking and calling judgment is carried out at the moment, wherein the specific judgment method comprises the following steps:

determining a pedestrian head area;

2. The method according to claim 1, wherein the labeling step S11 is performed by yolomark when labeling the objects in the data set.

3. The real-time smoking call recognition method according to claim 1, wherein in step S121, the Se-Res2Block module is composed of two parts, namely a Res2Bolck module and a channel attention mechanism, and the channel attention mechanism is arranged behind the Res2Block module.

4. The real-time smoking calling identification method according to claim 1, wherein the step S123 specifically comprises: increasing a fusion mode of ASFF on the basis of YOLOV3, wherein the YOLOV3 is output in three scales, the ASFF designs a weight parameter for the feature layer in each scale, the sum of the three weight parameters is 1, the feature layers in different scales are adjusted to be the same in size through up-sampling or down-sampling during fusion, and each feature layer is multiplied by the respective weight parameter to serve as output.

5. The method of claim 1, wherein in step S14, when the improved YOLOV3 model is trained, data enhancement is performed on data in the training set, specifically by changing one or more of an angle, a contrast, and a brightness of a picture of the data set.

6. The real-time smoking calling identification method of claim 1, wherein in step S2, when the pedestrian area is located, a depsort multi-target tracking algorithm is used for location.