CN114898434A

CN114898434A - Method, device and equipment for training mask recognition model and storage medium

Info

Publication number: CN114898434A
Application number: CN202210549746.2A
Authority: CN
Inventors: 孟海秀; 万业聪; 姚星星; 施森闽; 郑旭东
Original assignee: Haier Digital Technology Qingdao Co Ltd; Haier Caos IoT Ecological Technology Co Ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Current assignee: Haier Digital Technology Qingdao Co Ltd; Haier Caos IoT Ecological Technology Co Ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-12
Also published as: WO2023221608A1

Abstract

The invention provides a training method, a device, equipment and a storage medium for a mask recognition model. The method comprises the following steps: respectively cutting each picture in the first training set into a plurality of image blocks with the same size, and labeling the image blocks to obtain a second training set; respectively cutting each picture in the third training set into a plurality of image blocks with the same size, and labeling the image blocks to obtain a fourth training set; pre-training a first preset model by using the second training set to obtain a pre-training model; performing formal training by using the fourth training set and a second preset model to obtain a mask recognition model; the first preset model and the second preset model can adopt a YOLOV4-tiny network model, and the YOLOV4-tiny network model comprises a main network and a remaining part network; and freezing parameters of the backbone network of the model in the pre-training and the formal training. The method improves the running speed of the model and the accuracy of identifying the small target with a longer distance.

Description

Method, device and equipment for training mask recognition model and storage medium

Technical Field

The invention relates to the technical field of new generation information, in particular to a training method, a device, equipment and a storage medium for a mask recognition model.

Background

With the development of computer vision technology, the computer vision technology is utilized to detect the face from the image and identify the wearing of the mask, so that the method has very important research significance and application value.

At present, neural networks are mostly adopted for identifying the mask, and a plurality of security monitoring systems realize mask identification through system upgrading. For example, a Multi-task cascaded convolutional neural network (MTCNN) is used as a network model for mask wearing recognition, a Region of interest (ROI) is marked on a spectral image, coordinate and category information is acquired, a Support Vector Machine (SVM) classifier is trained, and classification judgment is performed on whether a mask is worn or not.

The existing identification method has the disadvantages of low running speed and long identification period due to large model parameters, and cannot accurately identify small targets with long distances.

Disclosure of Invention

The invention provides a training method, a training device, equipment and a storage medium for a mask recognition model, which are used for improving the accuracy of small target recognition at a longer distance and solving the problems of low running speed and long recognition period caused by large model parameters.

In a first aspect, an embodiment of the present invention provides a training method for a mask recognition model, including:

respectively cutting each picture in a first training set into a plurality of image blocks, and labeling the image blocks to obtain a second training set, wherein the first training set is a first face mask data set, and labels of the image blocks in the second training set comprise position information of the image blocks in the picture and class information of whether a face in the image blocks wears a mask or not;

respectively cutting each picture in a third training set into a plurality of image blocks, and labeling the image blocks to obtain a fourth training set, wherein the third training set is a second face mask data set, and labels of the image blocks in the fourth training set comprise position information of the image blocks in the picture and class information of whether a face in the image blocks wears a mask or not;

pre-training a first preset model by using the second training set to obtain a pre-trained model, wherein the first preset model comprises a backbone network and a rest network, and parameters of the backbone network of the first preset model are frozen in the pre-training process;

and performing formal training by using the fourth training set and a second preset model to obtain a mask recognition model, wherein the second preset model comprises a trunk network and a residual network, parameters of the trunk network of the second training model are parameters of the trunk network of the pre-training model, initial parameters of the residual network of the second preset model are parameters of the residual network of the pre-training model, and the parameters of the trunk network of the second preset model are frozen in the formal training process.

In a possible implementation manner, the pre-training the first preset model with the second training set to obtain a pre-training model, where the pre-training is performed on the first preset model with a YOLOV4-tiny network model, and the method includes:

loading parameters obtained by training on the ImageNet data set to a backbone network of the first preset model;

after the loading is finished, freezing a backbone network of the first preset model;

updating parameters of the rest network of the first preset model by using the following iterative process until the iterative condition is met, and taking the trained model as the pre-training model:

inputting a first number of image blocks in the second training set into the first preset model for training each time to obtain a training result;

determining a YOLO loss value according to the label of the input image block and the training result of the input image block;

performing back propagation according to the YOLO loss value to obtain an update parameter of the rest network of the first preset model;

and updating the parameters of the rest network of the first preset model by using the updated parameters.

In a possible implementation manner, the obtaining of the mask recognition model by using the second preset model and performing formal training using the fourth training set and the second preset model by using a YOLOV4-tiny network model includes:

loading parameters of the backbone network of the pre-training model to the backbone network of the second pre-training model;

after the loading is finished, freezing the backbone network of the second preset model;

loading parameters of a network of a remaining portion of the pre-training model onto the network of the remaining portion of the second pre-set model;

updating parameters of the network of the rest part of the second preset model by using the following iterative process until the iterative conditions are met, and taking the trained model as the mask recognition model:

inputting the first number of image blocks in the fourth training set into the second preset model for training each time to obtain a training result;

performing back propagation according to the YOLO loss value to obtain an update parameter of the rest network of the preset model;

and updating the parameters of the rest network of the second preset model by using the updated parameters.

In one possible implementation manner, the cutting each picture in the first training set and the third training set into a plurality of image blocks respectively includes:

starting from any corner of the picture, sliding an image frame by using preset pixels, and cutting an image in the image frame to form the image block, wherein the size of the preset pixels is smaller than the length and the width of the image frame;

and obtaining the coordinates of the frame of the image block according to the coordinates of the picture frame marked in the picture and the size of the sliding picture frame.

In a possible implementation manner, before the pictures in the first training set and the third training set are respectively cut into a plurality of image blocks, the method further includes:

performing data enhancement processing on the pictures in the first training set and the third training set, wherein the data enhancement processing comprises one or more of the following processing: randomly adjusting the size of the picture, randomly adjusting the contrast of the picture, randomly adjusting the tone of the picture, randomly adjusting the brightness of the picture, randomly adding noise to the picture, randomly changing a color model of the picture and randomly cutting the picture.

In a possible implementation manner, the size of the image block is 416 × 416 pixels, the backbone network includes 6 cascaded cross-stage partial CSP networks, and each CSP network is configured to perform feature extraction on an input image;

the target CSP network in the main network is connected with the input of a CAT module, the output end of the CAT module is connected with the rest part network, the characteristic graphs extracted by the target CSP network are 26 pixels by 26 pixels and 13 pixels by 13, and the CAT module is used for connecting the characteristic graphs extracted by the target CSP network.

In one possible implementation manner, the proportion of the face targets of a part of the pictures in the first face mask data set is greater than a preset threshold, and the proportion of the face targets of the rest of the pictures is less than or equal to the preset threshold;

and the proportion of the face target included in each picture in the second face mask data set is greater than the preset threshold value.

In a second aspect, a mask recognition method provided by the embodiment of the present invention is applied to a mask recognition model obtained by the method in the first aspect, and the method includes:

cutting a picture to be recognized into a plurality of image blocks, and determining position information of each image block in the picture to be recognized, wherein the picture to be recognized comprises at least one face target;

inputting the plurality of image blocks into the mask recognition model to obtain a first recognition result of each image block, wherein the first recognition result is used for indicating whether the face in the image block wears a mask or not;

when the same face target in the picture to be recognized exists in different image blocks, calculating the confidence coefficient of each image block in which the face target is located according to the position information of each image block in which the face target is located in the picture to be recognized and a face target detection frame in each image block, and selecting a first recognition result of the image block with the maximum confidence coefficient as the recognition result of the face target;

and outputting the recognition result of the face target in the picture to be recognized.

In a possible implementation manner, the calculating a confidence of each image block in which the face target is located according to the position information of each image block in which the face target is located in the picture to be recognized and the face target detection frame in each image block includes:

restoring each image block to the picture to be recognized according to the position information of each image block where the human face target is located in the picture to be recognized;

for each image block, calculating the ratio of a human face target detection frame in the image block to a human face target detection frame in the picture to be recognized;

and taking the calculated ratio as the confidence coefficient of the image block.

In a third aspect, an embodiment of the present invention provides a training apparatus for a mask recognition model, including:

the first cutting module is used for cutting each picture in a first training set into a plurality of image blocks respectively and labeling the image blocks to obtain a second training set, wherein the first training set is a first face mask data set, and labels of the image blocks in the second training set comprise position information of the image blocks in the picture and class information of whether a face in the image blocks wears a mask or not;

the second cutting module is used for cutting each picture in a third training set into a plurality of image blocks respectively and labeling the image blocks to obtain a fourth training set, wherein the third training set is a second face mask data set, and labels of the image blocks in the fourth training set comprise position information of the image blocks in the picture and class information of whether a face in the image blocks wears a mask or not;

the pre-training module is used for pre-training a first pre-set model by using the second training set to obtain a pre-training model, wherein the first pre-set model comprises a trunk network and a rest network, and parameters of the trunk network of the first pre-set model are frozen in the pre-training process;

and the formal training module is used for performing formal training by using the fourth training set and a second preset model to obtain a mask recognition model, wherein the second preset model comprises a trunk network and a remaining part network, parameters of the trunk network of the second training model are parameters of the trunk network of the pre-training model, initial parameters of the remaining part network of the second preset model are parameters of the remaining part network of the pre-training model, and the parameters of the trunk network of the second preset model are frozen in the formal training process.

In a possible implementation manner, the first preset model adopts a YOLOV4-tiny network model, and the pre-training module is specifically configured to:

In a possible implementation manner, the second preset model adopts a YOLOV4-tiny network model, and the formal training module is specifically configured to:

updating the parameters of the rest network of the second preset model by using the following iterative process until the iterative condition is met, and taking the trained model as the mask recognition model:

In a possible implementation manner, the first cutting module and the second cutting module are specifically configured to:

and obtaining the coordinates of the frame of the image block according to the picture frame coordinates marked on the picture and the size of the sliding image frame.

In a possible implementation manner, the first cutting module and the second cutting module further include:

an enhancement unit, configured to perform data enhancement processing on the pictures in the first training set and the third training set, where the data enhancement processing includes one or more of the following processing: randomly adjusting the size of the picture, randomly adjusting the contrast of the picture, randomly adjusting the tone of the picture, randomly adjusting the brightness of the picture, randomly adding noise to the picture, randomly changing a color model of the picture and randomly cutting the picture.

In a possible implementation manner, the size of the image block in the first cutting module or the second cutting module is 416 × 416 pixels, the backbone network in the pre-training module or the formal training module includes 6 cascaded cross-stage partial CSP networks, and each CSP network is configured to perform feature extraction on an input image;

In a fourth aspect, a mask recognition device, the device comprising:

the image recognition system comprises a cutting module, a recognition module and a processing module, wherein the cutting module is used for cutting a picture to be recognized into a plurality of image blocks and determining the position information of each image block in the picture to be recognized, and the picture to be recognized comprises at least one face target;

the input module is used for inputting the image blocks into the mask identification model to obtain a first identification result of each image block, and the first identification result is used for indicating whether the face in the image block wears a mask or not;

the calculation module is used for calculating the confidence of each image block where the human face target is located according to the position information of each image block where the human face target is located in the picture to be recognized and the human face target detection frame in each image block when the same human face target in the picture to be recognized exists in different image blocks, and selecting the first recognition result of the image block with the maximum confidence as the recognition result of the human face target;

and the output module is used for outputting the recognition result of the face target in the picture to be recognized.

In a possible implementation manner, the calculation module is specifically configured to:

In a fifth aspect, an electronic device for mask recognition, comprises:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the training method of the mask recognition model provided by the first aspect of the present invention.

In a sixth aspect, a computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the mask model training method provided in the first aspect of the present invention.

According to the training method and device for the mask recognition model, the model parameters are reduced through the design of the model structure and the small target improvement processing of the image block of the training set, so that the running speed is accelerated, the recognition period is shortened, and the accuracy of the small target recognition at a longer distance is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic structural diagram of a YOLOV4-tiny network model according to the present invention;

FIG. 2 is a schematic structural diagram of a CSP network in the YOLOV4-tiny network model according to the present invention;

FIG. 3 is a flowchart of a method for training a mask recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a picture cut;

FIG. 5 is a diagram illustrating image block location information;

FIG. 6 is a flowchart of a pre-training method for a mask recognition model according to a second embodiment of the present invention;

fig. 7 is a flowchart of a formal training method for an intraoral recognition model according to a third embodiment of the present invention;

fig. 8 is a flowchart of a mask recognition method according to a fourth embodiment of the present invention;

fig. 9 is a schematic structural diagram of a training apparatus for a mask recognition model according to a fifth embodiment of the present invention;

fig. 10 is a schematic structural diagram of a mask recognition device according to a sixth embodiment of the present invention;

fig. 11 is a schematic structural diagram of a mask identification device according to a seventh embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The embodiment of the invention provides a training method and a using method of a mask recognition model, wherein the mask recognition model is used for recognizing whether a face wears a mask, the mask recognition model can adopt a Yolov4-tiny network model, and fig. 1 is a schematic structural diagram of the Yolov4-tiny network model. As shown in fig. 1, the YOLOV4-tiny network model is mainly composed of a backbone network 107, a Concat connection module 108, and a remaining network 109. In the structure shown in fig. 1, the last two stages of CSP networks are connected to the input end of a Concat connection module 108, the target feature maps extracted by the two stages of CSP networks enter a remaining part network 109 through the output end of the Concat connection module 108, and the remaining part network 109 is a convolution layer and performs convolution processing on the input target feature maps.

Illustratively, when an image block of 416 × 416 pixels is input, feature maps of 208 × 208 pixels, 104 × 104 pixels, 52 × 52 pixels, 26 × 26 pixels and 13 × 13 pixels are sequentially extracted through 6 cross-stage partial CSP networks connected in series by a backbone network, feature maps of 26 × 26 pixels and 13 × 13 pixels are input into a remaining network through a Concat connection module, and an identification result of the image block is finally output through processing of the image block by the remaining network.

The invention adopts a YOLOV4-tiny network model to identify whether a face target in a picture to be identified wears a mask, and the network model improves the running speed of the model by using a plurality of CSP networks with a leap-over connection structure.

The CSP network structure in the YOLOV4-tiny network model is described in detail below with reference to FIG. 2.

FIG. 2 is a schematic structural diagram of a CSP network in the YOLOV4-tiny network model according to the present invention. As shown in fig. 2, the CSP network is composed of an input module 201, an output module 206,

convolutional layers

202, 203, 204, 205, and

CAT modules

25, 26. The convolution kernels of

convolutional layers

202, 203, and 204 are all 3 × 3, and the convolution kernel of convolutional layer 205 is 1 × 1. The

CAT modules

25, 26 function to connect the two arrays without changing the characteristics of the arrays. The CSP network comprises two jump connection structures, as shown in 207 and 208, which are used for dividing the input feature diagram into two parts on the channel, and the input feature diagram only needs to carry out convolution calculation on the feature diagram on one channel after passing through the jump connection structures, thereby reducing the calculation amount of the model and accelerating the running speed of the computer.

Specifically, as shown in fig. 2, the characteristic map output from convolutional layer 202 is divided into path 21 and path 22 on the channel when passing through jump connection structure 207, and the characteristic map output from path 21 is directly connected to the characteristic map output from convolutional layer 205 in CAT module 26 and enters output module 206. The feature map output through the path 22 enters the convolutional layer 203, the feature map output after the convolutional layer 203 performs convolutional calculation is divided into a path 23 and a path 24 on a channel when passing through the jump connection structure 208, the feature map output through the path 24 enters the convolutional layer 204, and the feature map output through the path 23 and the feature map output through the convolutional layer 204 are connected in the CAT module 25 and enter the convolutional layer 205.

Illustratively, the input module 201 inputs a feature map of 64@104 × 104 pixels, enters the first convolution layer 202, performs convolution calculation processing on the feature map by using a convolution function Conv3 × 3, and outputs the feature map, and when the output feature map passes through the skip connection structure 207, the feature map is divided into two channels so that the channels of the feature map are halved. Then, the feature map of one part of 32@104 × 104 pixels enters the CAT module 26 through a path 21, the feature map of the other part of 32@104 × 104 pixels enters the convolutional layer 203 through a path 22, the feature map is output after convolution calculation processing is performed on the feature map through a convolution function Conv3 × 3, and when the output feature map passes through the skip connection structure 208, the feature map is divided into two channels on the channels, so that the channels of the feature map are halved. Then, the feature map of one part of the 16@104 × 104 pixels enters the CAT module 25 through the path 23, the feature map of the other part of the 16@104 × 104 pixels enters the convolutional layer 204 through the path 24, the feature map is output after convolution calculation processing is performed on the feature map through the convolution function Conv3 × 3, the output feature map enters the CAT module 25 and is combined with the feature map input through the path 23, then the feature map is input into the convolutional layer 205, the feature map is input into the CAT module 26 after convolution calculation processing is performed on the feature map through the convolution function Conv1 × 1, and is combined with the feature map input through the path 21 in the CAT module 26, and then the feature map is output through the output module 206.

The following describes the technical solution of the present invention and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the invention will be described below with reference to the drawings.

It should be noted that the network model used in the following embodiment is a YOLOV4-tiny network model shown in fig. 1.

Before using the Yolov4-tiny network model to identify whether a human face target in a picture to be detected wears a mask, parameters in the Yolov4-tiny network model need to be trained to obtain a Yolov4-tiny network model with optimal identification efficiency. The training of the YOLOV4-tiny network model is described in detail below with reference to fig. 3.

Fig. 3 is a flowchart of a training method for a mask recognition model according to an embodiment of the present invention. As shown in fig. 3, the training method of the mask recognition model includes the following steps.

S301, each picture in the first training set is respectively cut into a plurality of image blocks, and the image blocks are labeled to obtain a second training set.

The first training set is a first Face mask data set, specifically, the proportion of the Face targets of a part of pictures in the first Face mask data set is greater than a preset threshold, and the proportion of the Face targets of the rest of pictures is less than or equal to the preset threshold, and for example, the data set may be a mask Face data set (FaceMask _ CelebA), or may be one or more of a Real mask Face recognition data set, a simulated mask Face recognition data set and a Real mask Face verification data set in a mask blocking Face data set (Real-world mask Face data, RMFD for short).

Specifically, the data set includes a plurality of pictures and labels of the pictures, and the labels are information of whether the face wears the mask or not. For example, when a face in a picture wears a mask, the category information is 1, and when the face does not wear the mask, the category information is 0.

Each picture in the first training set is divided into a plurality of image blocks, the size of each image block may be the same or different, and this embodiment does not limit this. And labeling the cut image blocks with labels, wherein the labels comprise position information of the image blocks in each picture and category information of whether the face in the image blocks wears the mask.

The existing identification method for wearing the mask is poor in detection effect on the small targets far away, and in order to improve the detection effect on the small targets far away, the method provided by the invention adopts a mode of amplifying firstly and then detecting, namely, each picture in a training set is cut into a plurality of image blocks, and then the image blocks are detected.

Fig. 4 is a schematic diagram of picture cutting. As shown in fig. 4, the image frames 40 with a certain size are sequentially slid on the picture according to preset pixels, and the area in each image frame is cut to obtain the image block, wherein the size of the preset pixels is smaller than the length and width of the image frame with a certain size. For example, the size of the image frame 40 may be 416 × 416, and the preset pixel size may be 300.

As shown in fig. 4, in one possible implementation, starting from the top left corner of the picture, an image frame with a fixed size is used to slide from left to right in sequence by preset pixels, and the image in the image frame is cut off to form a plurality of image blocks. And when the image frame slides to the rightmost end, sliding the image frame downwards by a step length according to the sliding step length, then sliding the image frame from left to right or from right to left sequentially according to preset pixels, and finishing the cutting of the image block according to the mode. The sliding step is the size of a plurality of pixels, and the sliding step may be the same as or different from the preset pixel.

In another possible implementation, starting from the upper left corner of the picture, frames with a certain size are used to sequentially slide on the picture from top to bottom according to the size of the preset pixels, and the image in the image frame is cut off to form a plurality of image blocks. And when the image frame slides to the lowest end, sliding the image frame by a step length to the right according to the sliding step length, then sequentially sliding the image frame from top to bottom or sequentially sliding the image frame from bottom to top according to the preset pixels, and finishing the cutting of the image block according to the mode. The sliding step is the size of a plurality of pixels, and the sliding step may be the same as or different from the preset pixel.

Labeling the cut image blocks to obtain a second training set, wherein labels of the image blocks in the second training set comprise position information of the image blocks in the picture and class information of whether the face in the image blocks wears the mask, the class information of the image blocks is the same as the class information of the picture before cutting, namely the class information of the face in the picture before cutting is the mask worn, the class information of the face in each image block after cutting is the mask worn, wherein after cutting is completed, part of the image blocks possibly have no face, and the class information of the image blocks without the face is the mask not worn.

The position information of the image block may be position information of a cut point of the image block in the picture, and may be obtained according to coordinates of the picture before cutting. Fig. 5 is a schematic diagram of image block positions. As shown in fig. 5, a point a (0, 0) of the picture is defined as a coordinate origin, and points B and C are position information of the first image block and the second image block in the picture, respectively. It is to be understood that here the position of an image block is not the physical coordinates of the image block, but the pixel position of the image block. Illustratively, when the size of the cut frame is 416 × 416 and the sliding pixels are 300, the position information of the first image block is B (416 ), and the position information of the second image block is C (716,416). And analogizing in turn, performing translation operation according to the coordinates of the picture, the size of the picture frame during cutting and the sliding pixels, and obtaining the position information of each picture frame in the picture.

Optionally, before each picture in the first training set is respectively cut into a plurality of image blocks, data enhancement processing needs to be performed on each picture in the first training set. Specifically, the data enhancement processing mode may be one or more of the following processing modes: randomly adjusting the size of the picture, randomly adjusting the contrast of the picture, randomly adjusting the tone of the picture, randomly adjusting the brightness of the picture, randomly adding noise to the picture, randomly changing a color model of the picture and randomly cutting the picture.

Exemplarily, the quality of the picture can be changed by randomly adjusting the brightness and the contrast of the picture, so that the quality of the picture is in accordance with the condition that the imaging quality is not consistent due to environmental factors such as air quality and the like in a real shooting scene; randomly cutting the picture, and changing the position of the face target in the picture to ensure that the position of the face target in the picture is matched with the depth of field change of the foreground and the background caused by the position change of the face target in the real scene; therefore, by performing enhancement processing on the data set pictures, the pictures in the final training set contain various problems which may exist in a conventional view finding state and affect the imaging quality.

S302, respectively cutting each picture in the third training set into a plurality of image blocks, and labeling the image blocks to obtain a fourth training set.

The third training set is a second face mask data set, specifically, the proportion of face targets included in each picture in the second face mask data set is larger than a preset threshold, the pictures in the data set are obtained by daily shooting of workers, and compared with the pictures in the first face mask data set, the pictures in the data set are closer to the pictures shot under the monitoring camera, namely the face targets in the pictures are closer to small targets. According to the invention, the accuracy of small target detection can be improved by training the pictures in the non-public face training set.

Labeling the cut image blocks to obtain a fourth training set, wherein labels of the image blocks in the fourth training set comprise position information of the image blocks in the picture and class information of whether the face in the image blocks wears the mask, the class information of the image blocks is the same as the class information of the picture before cutting, namely the class information of the face in the picture before cutting is the mask worn, the class information of the face in each image block after cutting is the mask worn, wherein after cutting is completed, part of the image blocks possibly have no face, and the class information of the image blocks without the face is the mask not worn.

Specifically, the method for cutting the picture in the third training set and the method for labeling the position information of the image block are the same as those in step S301, and are not described herein again.

Optionally, before each picture in the third training set is respectively cut into a plurality of image blocks, data enhancement processing needs to be performed on each picture in the third training set, and the enhancement processing manner and the effect thereof are the same as those in step S301, and are not described again here.

And S303, pre-training the first preset model by using a second training set to obtain a pre-training model.

The first preset model adopts a YOLOV4-tiny network model, and the YOLOV4-tiny network model can adopt a structure as shown in fig. 1, and it can be understood that the network structure of the YOLOV4-tiny network model can be changed, for example, the number of CSP networks is greater or less than the network model shown in fig. 1. And inputting the image blocks in the second training set into the first preset model according to a certain number for model pre-training, illustratively, the number of the input image blocks in each time can be 16, and correspondingly, obtaining output results of 16 image blocks, wherein the output results are used for indicating whether the face in the image blocks wears the mask or not. And obtaining a YOLO loss value according to the output result of the model training and the label of the image block, performing back propagation through the YOLO loss value, and performing iteration by using a gradient descent method to continuously update the model parameters until the iteration is finished to obtain a pre-training model.

The first preset model comprises a backbone network and a remaining part network, wherein parameters of the backbone network of the first preset model are frozen in the pre-training process, the parameter freezing of the backbone network means that the parameters of the backbone network are not updated and are kept unchanged in the iterative training process, only the parameters of the remaining part network are updated, and the training process can be accelerated and the training time can be shortened by freezing the parameters of the backbone network.

And S304, performing formal training by using the fourth training set and the second preset model to obtain the mask recognition model.

The second preset model adopts a Yolov4-tiny network model, and the Yolov4-tiny network model can adopt the structure shown in FIG. 1. And inputting the image blocks in the fourth training set into the first preset model according to a certain number every time for model pre-training, illustratively, the number of the image blocks input every time can be 16, and correspondingly, obtaining output results of 16 image blocks, wherein the output results are used for indicating whether the face in the image blocks wears the mask or not. And obtaining a YOLO loss value according to the output result of the model training and the label of the image block, performing back propagation through the YOLO loss value, and performing iteration by using a gradient descent method to continuously update the model parameters until the iteration is finished to obtain the mask identification model.

The second preset model comprises a backbone network and a residual network, the parameters of the backbone network of the second preset model are the parameters of the backbone network of the pre-training model, the initial parameters of the residual network of the second preset model are the parameters of the residual network of the pre-training model, and the parameters of the residual network are set to be in a better state through pre-training, so that the convergence of the model in formal training can be accelerated.

In the formal training process, the parameters of the backbone network are frozen as in the pre-training process, and only the parameters of the rest network are updated.

In this embodiment, the second training set and the fourth training set are obtained by cutting the pictures in the first training set and the third training set, respectively, the pre-training model is obtained by inputting the image blocks in the second training set into the first preset model for pre-training, and the mask recognition model is obtained by inputting the image blocks in the fourth training set into the second preset model for formal training on the basis of the pre-training model. According to the method, the picture is divided into a plurality of image blocks, and small image blocks are trained and recognized, so that the accuracy rate of recognizing small targets with long distances is improved.

Fig. 6 is a flowchart of a pre-training method for a mask recognition model according to a second embodiment of the present invention. This embodiment is a detailed description of step S303 in the first embodiment. As shown in fig. 6, the pre-training method of the mask recognition model includes the following steps.

S601, loading parameters obtained by training the ImageNet data set to a backbone network of a first preset model.

Inputting the image blocks in the ImageNet data set to a backbone network of a YOLOV4-tiny network model, training parameters of the model on the backbone network, and finally filling the trained parameters in corresponding positions of the backbone network in sequence.

For example, when the model of the backbone network is n ═ Ax + By + Cz, the ImageNet data set is input into the backbone network and trained, and the value of the parameter A, B, C can be obtained and filled in the corresponding position of the backbone network.

S602, after the loading is finished, freezing the backbone network of the first preset model.

According to step S601, after filling each parameter into the corresponding position of the backbone network in sequence, freezing the backbone network makes the parameters of the backbone network portion not changed during the pre-training.

And S603, inputting the first number of image blocks in the second training set into a first preset model for training each time to obtain a training result.

The first number of image blocks in the second training set is input to the YOLOV4-tiny network model, and the first number may be 16 image blocks, for example. And inputting the first number of image blocks each time until all the image blocks in the second training set are input into the YOLOV4-tiny network model, training the model according to the first number of image blocks input each time to obtain a training result each time, and outputting the training result.

S604, determining a YOLO loss value according to the label of the input image block and the training result of the input image block.

The input label of the image block is type information of whether the face wears a mask, and exemplarily, the type information of whether the face wears the mask is 1, and the type information of whether the face does not wear the mask is 0. The training result of the image block is a real number between 0 and 1, which may be 0.12 or 0.9 for example, when the numerical value is close to 1, it indicates that the face target of the image block wears the mask, and when the numerical value is close to 0, it indicates that the face target of the image block does not wear the mask. And taking the difference value of the label of the input image block and the training result of the input image block as a YOLO loss value.

And S605, performing back propagation according to the YOLO loss value to obtain the update parameters of the rest part network of the first preset model.

And performing back propagation according to the YOLO loss value, namely obtaining the gradient of the YOLO loss value to each parameter of the model, and then updating the parameters of the rest network model in the first preset model by using the gradient through a gradient descent method. Gradient descent is one of iterative methods, and is simply a method for finding minimization of an objective function, that is, when solving the minimum value of a loss function, the minimum loss function and a model parameter value can be obtained by iteratively solving step by step through the gradient descent method.

And S606, updating the parameters of the rest network of the first preset model by using the updated parameters.

And S607, judging whether the iteration condition is satisfied.

And after updating is completed each time, judging whether the iteration condition is met, executing the step S608 when the iteration condition is met, and returning to execute the step S603 when the iteration condition is not met.

The iteration condition is, for example, that a preset number of times of iterative training is completed, where the preset number may be 100 or 120, and the like, where one iterative training refers to performing one training on all image blocks in the second training set.

And S608, taking the model obtained by training as a pre-training model.

And iteratively updating parameters of the rest network of the first preset model through the steps of the steps S603-608 until the iteration condition is met, and taking the trained model as a pre-training model.

And filling all parameters of the output model after the pre-training is finished into the corresponding positions of the rest part of the network of the first preset network model in sequence, and performing formal training on the second preset network model.

In this embodiment, the parameters obtained by training the ImageNet data set are loaded to the backbone network of the first preset network model, and the backbone network is frozen. And pre-training the first preset network model by using data of the second training set, obtaining a YOLO loss value according to the difference value of the label of the input image block and the training result of the input image block, performing back propagation on the YOLO loss value to obtain the gradient of the YOLO loss value to the model parameter, updating the model parameter by using the gradient through a gradient descent method, and obtaining the pre-training model through iteration. According to the method, the parameters of the first preset network model can be set to be in a better state by pre-training the first preset network model, so that the convergence of the second preset network model in formal training is accelerated.

Fig. 7 is a flowchart of a formal training method for a mask recognition model according to a third embodiment of the present invention. This embodiment is a detailed description of step S304 in the first embodiment. As shown in fig. 7, the formal training method of the mask recognition model includes the following steps.

And S701, loading parameters of the backbone network in the pre-training model to the backbone network of a second preset model.

And sequentially filling the parameters on the backbone network in the pre-training model obtained in the second embodiment into the positions corresponding to the backbone network in the second pre-training model.

S702, after the loading is finished, freezing the backbone network of the second preset model.

And freezing the backbone network in the second preset model, namely ensuring that the parameters of the backbone network are not changed when training the rest network of the second preset model.

And S703, loading the parameters of the network of the rest part of the pre-training model to the network of the rest part of the second preset model.

And sequentially filling each parameter of the network of the remaining part of the pre-training model obtained in the second embodiment into the corresponding position of the network of the remaining part in the second pre-training model, and performing formal training on the second pre-training model on the basis of the parameter obtained by the pre-training, so that the convergence of the second pre-training model in the formal training is facilitated to be accelerated, and the training efficiency is improved.

And S704, inputting the first number of image blocks in the fourth training set into a second preset model for training each time to obtain a training result.

The first number of image blocks in the fourth training set are input to the second preset network model, and the specific first number and method of inputting the image blocks are the same as those in step S603 in the second embodiment, and are not described here again.

S705, determining a YOLO loss value according to the label of the input image block and the training result of the input image block.

The label information of the input image block is the same as the information of the image block in step S604 in the second embodiment, and is not described herein again. And taking the difference value of the label of the input image block and the training result of the input image block as a YOLO loss value.

And S706, performing back propagation according to the YOLO loss value to obtain the update parameters of the rest part network of the second preset model.

And performing back propagation according to the YOLO loss value, namely obtaining the gradient of the YOLO loss value to each parameter of the model, then performing iteration through a gradient descent method, and updating the model parameters by using the gradient. In the process of formal training of the model, relevant parameters of the training stage are required to be set according to specific training index requirements in the stage, the parameters comprise learning rate, iteration times, attenuation strategy and the like, and the training of the model can be accelerated by artificially adjusting the relevant parameters, so that the model parameters obtained by training are better, and the accuracy of model identification can be further improved.

And S707, updating the parameters of the rest network of the second preset model by using the updated parameters.

S708, judging whether the iteration condition is satisfied.

After each update, it is determined whether an iteration condition is satisfied, and when the iteration condition is satisfied, step S709 is executed, and when the iteration condition is not satisfied, the process returns to step S704.

The iteration condition is that, for example, the YOLO loss value is reduced to a preset YOLO loss value, and the YOLO loss value does not change significantly any more in the continuous training process, wherein the change of the YOLO loss value can be judged by the variance. Specifically, when iterative training is performed by using a gradient descent method, the YOLO loss value is continuously reduced during training, and when the YOLO loss value is reduced to a lower level and is not reduced obviously any more, the minimum YOLO loss value is obtained, and at this time, iteration is completed, and training is stopped.

And S709, sequentially filling all model parameters obtained in the formal training into corresponding positions of the rest part of the network in the second preset model to obtain the mask recognition model.

In this embodiment, on the basis of the pre-training model, the backbone network in the second pre-training model is frozen, where parameters of the backbone network are the same as those in the pre-training model. And formally training the YOLOV4-tiny network model by using data of the fourth training set, obtaining a YOLO loss value according to the difference value of the label of the input image block and the training result of the input image block, performing back propagation on the YOLO loss value to obtain the gradient of the YOLO loss value to the model parameter, then performing iteration by using a gradient descent method, and updating the model parameter by using the gradient. And in the training process, the YOLO loss value is continuously reduced, and when the YOLO loss value is reduced to a lower level and is not obviously reduced any more, namely the minimum YOLO loss value is obtained, the training is stopped to obtain the mask recognition model. The method can improve the identification accuracy of the target detection object through formal training of the Yolov4-tiny network model.

Fig. 8 is a mask recognition method according to a fourth embodiment of the present invention, in which the mask recognition models obtained by training according to the first, second, and third embodiments are used, and as shown in fig. 8, the mask recognition method includes the following steps.

S801, cutting a picture to be recognized into a plurality of image blocks, and determining position information of each image block in the picture to be recognized, wherein the picture to be recognized comprises at least one human face target.

The picture to be recognized is an image obtained from a monitoring video, and specifically, the monitoring video obtained in real time is converted into a frame-by-frame image by a frame dividing method. An image obtained from the monitoring video includes at least one face target, which may be 1 face target or 3 face targets, for example.

The image to be recognized is cut into a plurality of image blocks with the same size, and the specific cutting method is the same as that in the first embodiment, and is not described herein again.

The method for determining the position information of each image block in the picture to be recognized is the same as that in the first embodiment, and is not described herein again.

And S802, inputting the plurality of image blocks into the mask recognition model to obtain a first recognition result of each image block, wherein the first recognition result is used for indicating whether the face in the image block wears the mask or not.

Inputting a plurality of image blocks into the inlet cover identification model, wherein the number of the image blocks is the number of the image blocks obtained by cutting according to the picture to be identified.

The first recognition result of each image block obtained through the mask recognition model is the type information of whether the face in the image block wears the mask or not, the first recognition result is a real number between 0 and 1, when the first recognition result value is close to 1, the fact that the face target in the image block wears the mask is indicated, and when the first recognition result value is close to 0, the fact that the face target in the image block does not wear the mask is indicated.

And S803, when the same face target in the picture to be recognized exists in different image blocks, calculating the confidence of each image block where the face target is located, and selecting the first recognition result of the image block with the maximum confidence as the recognition result of the face target.

When the same face target in the picture to be recognized exists in different image blocks, calculating the confidence of each image block according to the position information of each image block where the face target is located in the picture to be recognized and the face target detection frame in each image block.

The confidence degrees of all image blocks with the same human face target are calculated, and the first recognition result of the image block with the maximum confidence degree is used as the recognition result of the human face target.

And S804, outputting the recognition result of the face target in the picture to be recognized.

When the picture to be recognized has a plurality of human face targets, the recognition result of each human face target can be respectively obtained according to the first recognition result of the image block where the human face target is located.

In this embodiment, an image obtained from a surveillance video is used as a picture to be recognized, the picture to be recognized is cut into a plurality of image blocks with the same size, then the image blocks are input into a mask recognition model to obtain a first recognition result of each image block, and a recognition result of whether a human face target in the picture to be recognized wears a mask is obtained according to the first recognition result of the image blocks. When the same face target in the picture to be recognized has different image blocks, the confidence coefficient of each image block is calculated, and the first recognition result of the image block with the maximum confidence coefficient is selected to indicate whether the face in the image block wears the mask or not. According to the method, whether the mask is worn by the target object is identified and checked by using the mask identification model, and the accuracy of the identification model is further verified according to the identification result.

Fig. 9 is a schematic structural diagram of a training apparatus for a mask recognition model according to a fifth embodiment of the present invention. As shown in fig. 9, the training device 90 for mask recognition models includes: a first cutting module 901, a second cutting module 902, a pre-training module 903 and a formal training module 904.

The first cutting module 901 is configured to cut each picture in the first training set into a plurality of image blocks respectively, and label-label the image blocks to obtain a second training set, where the first training set is a first face mask data set, and labels of the image blocks in the second training set include position information of the image blocks in the picture and category information of whether a face in the image blocks wears a mask;

the second cutting module 902 is configured to cut each picture in the third training set into a plurality of image blocks respectively, and label-label the image blocks to obtain a fourth training set, where the third training set is a second face mask data set, and labels of the image blocks in the fourth training set include position information of the image blocks in the picture and category information of whether a face in the image blocks wears a mask;

a pre-training module 903, configured to pre-train a first pre-set model using a second training set to obtain a pre-training model, where the first pre-set model includes a trunk network and a remaining network, and parameters of the trunk network of the first pre-set model are frozen in a pre-training process;

and an official training module 904, configured to perform official training using the fourth training set and a second preset model to obtain the mask recognition model, where the second preset model includes a trunk network and a remaining part network, a parameter of the trunk network of the second training model is a parameter of the trunk network of the pre-training model, an initial parameter of the remaining part network of the second preset model is a parameter of the remaining part network of the pre-training model, and the parameter of the trunk network of the second preset model is frozen in the official training process.

In a possible implementation manner, the first preset model adopts a YOLOV4-tiny network model, and the pre-training module 903 is specifically configured to:

loading parameters obtained by training the ImageNet data set to a backbone network of a first preset model;

updating parameters of the rest part network of the first preset model by using the following iterative process until the iterative condition is met, and taking the trained model as a pre-training model:

inputting a first number of image blocks in a second training set into a first preset model for training each time to obtain a training result;

the parameters of the remaining network of the first pre-set model are updated using the updated parameters.

In a possible implementation manner, the second preset model is a YOLOV4-tiny network model, and the formal training module 904 is specifically configured to:

loading parameters of a backbone network of a pre-training model to a backbone network of a second pre-training model;

after the loading is finished, freezing a backbone network of a second preset model;

loading parameters of the network of the rest part of the pre-training model to the network of the rest part of the second pre-training model;

and updating parameters of the rest part network of the second preset model by using the following iterative process until the iterative condition is met, and taking the trained model as the mask recognition model:

inputting the first number of image blocks in the fourth training set into a second preset model for training each time to obtain a training result;

performing back propagation according to the YOLO loss value to obtain the update parameters of the rest part network of the preset model;

the parameters of the remaining network of the second pre-set model are updated using the updated parameters.

In a possible implementation manner, the first cutting module 901 and the second cutting module 902 are specifically configured to:

starting from any corner of the picture, sliding the picture frame by preset pixels, and cutting the image in the picture frame to form an image block, wherein the size of the preset pixels is smaller than the length and the width of the picture frame;

and obtaining the coordinates of the frame of the image block according to the picture frame coordinates marked on the picture and the size of the sliding picture frame.

In a possible implementation manner, the first cutting module 901 and the second cutting module 902 further include:

In a possible implementation manner, the size of the image block in the first cutting module 901 or the second cutting module 902 is 416 × 416 pixels, the backbone network in the pre-training module or the formal training module includes 6 cascaded cross-stage partial CSP networks, and each CSP network is configured to perform feature extraction on an input image;

the target CSP network in the main network is connected with the input of the CAT module, the output end of the CAT module is connected with the rest part network, the characteristic graphs extracted by the target CSP network are 26 pixels by 26 pixels and 13 pixels by 13, and the CAT module is used for connecting the characteristic graphs extracted by the target CSP network.

In one possible implementation manner, the proportion of the face targets of a part of pictures in the first face mask data set is greater than a preset threshold, and the proportion of the face targets of the rest of pictures is less than or equal to the preset threshold;

and the proportion of the face target contained in each picture in the second face mask data set is greater than a preset threshold value.

The apparatus provided in this embodiment may be used to perform the method steps of the first embodiment, the second embodiment, or the third embodiment, and specific implementation and technical effects are similar and will not be described herein again.

Fig. 10 is a schematic structural diagram of a mask recognition device according to a sixth embodiment of the present invention. As shown in fig. 10, the mask recognition device 10 includes: the cutting module 110, the input module 120, the calculation module 130 and the output module 140.

The cutting module 110 is configured to cut a picture to be recognized into a plurality of image blocks, and determine position information of each image block in the picture to be recognized, where the picture to be recognized includes at least one human face target;

the input module 120 is configured to input the plurality of image blocks into the mask recognition model to obtain a first recognition result of each image block, where the first recognition result is used to indicate whether a face in the image block wears a mask;

the calculation module 130 is configured to calculate confidence levels of the image blocks in which the human face targets are located according to position information of the image blocks in which the human face targets are located in the image to be recognized and human face target detection frames in the image blocks when the same human face target in the image to be recognized exists in different image blocks, and select a first recognition result of the image block with the highest confidence level as a recognition result of the human face target;

and the output module 140 is configured to output a recognition result of the face target in the picture to be recognized.

The apparatus provided in this embodiment may be configured to perform the method steps of the fourth embodiment, and the specific implementation manner and the technical effect are similar, which are not described herein again.

Fig. 11 is an electronic device 11 for mask recognition according to a seventh embodiment of the present invention, including:

at least one processor 111; and

a memory 112 communicatively coupled to the at least one processor 111; wherein,

the memory 112 stores instructions executable by the at least one processor 111, the instructions being executable by the at least one processor 111 to enable the at least one processor 111 to perform the training method of the mask recognition model as described above.

For a specific implementation process of the processor 111, reference may be made to the above method embodiment, and a specific implementation manner and a technical effect are similar, which are not described herein again.

An eighth embodiment of the present invention provides a computer-readable storage medium, where a computer-executable instruction is stored in the computer-readable storage medium, and the computer-executable instruction is used by a processor to implement the method steps in the foregoing method embodiments, where a specific implementation manner and a technical effect are similar, and are not described herein again.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A training method of a mask recognition model is characterized by comprising the following steps:

2. The method of claim 1, wherein the first pre-set model is a YOLOV4-tiny network model, and the pre-training the first pre-set model with the second training set to obtain a pre-trained model comprises:

3. The method according to claim 2, wherein the second preset model adopts a YOLOV4-tiny network model, and the formal training using the fourth training set and the second preset model to obtain the mask recognition model comprises:

4. The method according to any of claims 1-3, wherein the slicing each picture in the first training set and the third training set into a plurality of tiles comprises:

5. The method of claim 4, wherein before the pictures in the first training set and the third training set are respectively cut into a plurality of image blocks, the method further comprises:

6. The method according to any one of claims 2 or 3, wherein the image block has a size of 416 x 416 pixels, and the backbone network includes 6 cascaded cross-stage partial CSP networks, each CSP network being configured to perform feature extraction on an input image;

7. The method according to any one of claims 1 to 3, wherein the proportion of the face targets of a part of the pictures in the first face mask data set is greater than a preset threshold, and the proportion of the face targets of the remaining part of the pictures is less than or equal to the preset threshold;

8. A mask recognition method applied to a mask recognition model trained by the method of any one of claims 1 to 7, the method comprising:

9. The method according to claim 8, wherein the calculating the confidence of each image block in which the face target is located according to the position information of each image block in which the face target is located in the picture to be recognized and the face target detection frame in each image block comprises:

10. A training device for a mask recognition model, comprising:

the mask training system comprises a first cutting module, a second cutting module and a third cutting module, wherein the first cutting module is used for cutting each picture in a first training set into a plurality of picture blocks respectively and labeling the picture blocks to obtain a second training set, the first training set is a first face mask data set, the picture is labeled with the coordinates of a picture frame and the class information of whether a face wears a mask, and the label of the picture block in the second training set is used for expressing the coordinates of the picture block frame and the class information of whether the face wears the mask;

the second cutting module is used for cutting each picture in a third training set into a plurality of picture blocks respectively and labeling the picture blocks to obtain a fourth training set, the third training set is a second face mask data set, the picture is labeled with the coordinates of a picture frame and the class information of whether the face wears a mask, and the label of the picture block in the fourth training set is used for indicating the coordinates of the picture block frame and the class information of whether the face wears the mask;

11. The apparatus of claim 10, wherein the first predetermined model is a YOLOV4-tiny network model, and the pre-training module is specifically configured to:

12. The apparatus of claim 11, wherein the second predetermined model is a YOLOV4-tiny network model, and the formal training module is specifically configured to:

13. The device according to claim 10, characterized in that said first and second cutting modules are in particular adapted to:

starting from any corner of the picture, sliding an image frame by using a preset pixel, and cutting an image in the image frame to form the image block, wherein the size of the preset pixel is smaller than the length and the width of the image frame;

14. A mask recognition apparatus applied to a mask recognition model trained by the method of any one of claims 1 to 7, the apparatus comprising:

15. An electronic device for mask recognition, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

16. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the method of any one of claims 1 to 9.