CN113536965B

CN113536965B - Method and related device for training face shielding recognition model

Info

Publication number: CN113536965B
Application number: CN202110711009.3A
Authority: CN
Inventors: 曾梦萍
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2024-04-09
Anticipated expiration: 2041-06-25
Also published as: CN113536965A

Abstract

The embodiment of the invention relates to the technical field of intelligent recognition, and discloses a method for training a face shielding recognition model and a related device. The depth separable convolution layer-based neural network has the characteristics of light weight and high training speed due to the small parameter quantity and small calculation quantity of the depth separable convolution layer, and the depth separable convolution layer with the step length of M being larger than 1 is adopted for downsampling, so that the feature map generated by the depth separable convolution layer has the characteristics of low resolution, large receptive field and space invariance, and the accuracy of the face shielding recognition model obtained through training can be improved.

Description

Method and related device for training face shielding recognition model

Technical Field

The embodiment of the invention relates to the technical field of intelligent recognition, in particular to a method and a related device for training a face shielding recognition model.

Background

With the continuous progress of machine learning technology, recognition technology is more widely applied to people's daily life. When analyzing a face, whether a face has a shielding object or not needs to be detected in some application scenes, and the type of the shielding object is identified.

For example, in the current epidemic situation, public places need to detect whether people wear masks or not. In daily skin measurement, it is possible to detect the accessory of the face portion of the user, such as whether a hat is attached, whether glasses are attached, or the like. The existing occlusion recognition algorithm model simply judges whether occlusion exists or not only based on pixel differences, and cannot recognize the type of the occlusion. Even if the existing target recognition algorithm is applied to face shielding object recognition, the face shielding object recognition algorithm is easy to be interfered by facial features, and accuracy is low.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide a method for training a face shielding recognition model, a method for recognizing face shielding, electronic equipment and a storage medium, wherein the face shielding recognition model trained by the method for training the face shielding recognition model can accurately recognize various shielding object types, and the face shielding recognition model can be quickly converged in the training process.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for training a face occlusion recognition model, including:

acquiring an image sample set, wherein each image in the image sample set comprises a human face;

intercepting a face area of a target image to generate a face area image, wherein the target image is any image in the image sample set;

dividing the face region image into at least one local region image, wherein each local region image is marked with a real label, one local region image and the real label marked by the local region image serve as a sample pair, and the real label comprises a shielding object type;

at least one sample pair corresponding to each image in the image sample set is used as a training set, the training set is input into a preset neural network for training, and training is stopped until iteration termination conditions are met, so that a face shielding recognition model is obtained;

the preset neural network comprises a feature extraction network, wherein the feature extraction network comprises a common convolution layer and N depth separable convolution layers which are arranged layer by layer, and one depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are arranged layer by layer;

The step length of the depth convolution layer in the first M depth separable convolution layers is a preset value, and the preset value is larger than 1, wherein M is smaller than or equal to N.

In some embodiments, the depth-separable convolutional layers further comprise a first linear convolutional layer and a second linear convolutional layer, each convolution kernel in the first linear convolutional layer having a size of 1*1, and each convolution kernel in the second linear convolutional layer having a size of 1*1;

wherein the first linear convolution layer is located between the depth convolution layer and the point-wise convolution layer, and the second linear convolution layer is located after the point-wise convolution layer;

the number of convolution kernels in the first linear convolution layer is a preset multiple of the number of convolution kernels in the depth convolution layer, and the preset multiple is larger than 1;

the number of convolution kernels in the second linear convolution layer is the same as the number of convolution kernels in the point-by-point convolution layer.

In some embodiments, the method further comprises:

and carrying out smoothing processing on the real labels of each sample pair to obtain each smoothed real label so as to enable each smoothed real label to participate in training of the preset neural network, wherein the smoothing processing is adding noise into the real labels.

In some embodiments, the step of smoothing the real label of each of the sample pairs to obtain each smoothed real label includes:

smoothing the target real label according to the following formula to obtain a smoothed target real label, wherein the target real label is any real label;

wherein k is the type of the obstruction,for the purpose after the smoothing treatmentProbability of k category in true label, y _k For the probability of k categories in the target real label, y is when the occlusion category k is correctly classified _k Equal to 1, y when the occlusion category k is misclassified _k And the alpha is equal to 0, the alpha is a preset parameter value, and the K is the total number of the occlusion categories in the training set.

In some embodiments, before the step of inputting the training set into a preset neural network for training, the method further includes:

and carrying out data enhancement processing on the training set.

In some embodiments, the partial region image is an image of the facial region image in which a partial region exhibits geometric features.

In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides a method for identifying facial occlusion, including:

Acquiring an image to be detected, wherein the image to be detected comprises a human face;

intercepting a face area of the image to be detected to generate a face area image to be detected;

dividing the face region image to be detected into at least one local region image to be detected;

inputting the at least one partial region image to be detected into the face shielding recognition model according to any one of claims 1-6, wherein the face shielding recognition model outputs shielding object types of the partial region images to be detected;

and determining the shielding condition of the image to be detected according to the shielding object type of each image of the local area to be detected.

In some embodiments, the method further comprises:

obtaining region attributes of a target to-be-detected local region image, wherein the region attributes reflect facial geometric features included in the target to-be-detected local region image, and the target to-be-detected local region image is any to-be-detected local region image;

judging whether the region attribute is matched with the type of the shielding object of the target to-be-detected local region image or not;

if not, determining the target to-be-detected local area image as a non-shielding object.

In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides an electronic device, including:

At least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

To solve the above technical problem, in a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium, where computer executable instructions are stored, where the computer executable instructions are configured to cause an electronic device to perform the method according to the first aspect.

The embodiment of the invention has the beneficial effects that: compared with the prior art, the method for training the face shielding recognition model provided by the embodiment of the invention has the advantages that the images used for training are all local area images, so that the local area characteristics of the preset neural network learning image are better compared with the characteristics of the whole image, the local characteristics of the specific area of the image can be better learned through the local area image learning, the interference of other areas is eliminated, the preset neural network can be quickly converged to obtain the face shielding recognition model, and the classification accuracy of the face shielding recognition model obtained through training is improved. Secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is improved in the model training process or the model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolution layer and N depth separable convolution layers, wherein the common convolution layer and the N depth separable convolution layers are arranged layer by layer, one depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are arranged layer by layer, the step length of the depth convolution layer in the first M depth separable convolution layers in the N depth separable convolution layers is a preset value, and the preset value is larger than 1, and M is smaller than or equal to N. The parameter quantity and the calculation quantity based on the depth separable convolution layer are small, so that the parameter quantity and the calculation quantity of the feature extraction network can be reduced, the parameter quantity and the calculation quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolution layer with the step length of more than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of the feature space is more concerned, the finally generated feature map has the characteristics of low resolution, large receptive field and space invariance, and therefore, the accuracy of the face shielding recognition model obtained through training can be improved.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is a schematic view of an operating environment of a method for training a face occlusion recognition model and a method for recognizing a face occlusion according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for training a face occlusion recognition model according to an embodiment of the present invention;

fig. 4 is a schematic layer structure of a preset neural network according to an embodiment of the present invention;

fig. 5 (a) is a schematic convolution operation diagram of a normal convolution layer according to an embodiment of the present invention, fig. 5 (b) is a schematic convolution operation diagram of a deep convolution layer according to an embodiment of the present invention, and fig. 5 (c) is a schematic convolution operation diagram of a point-by-point convolution layer according to an embodiment of the present invention;

FIG. 6 is a schematic layer structure of a depth-separable convolution layer according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a method for recognizing face occlusion according to an embodiment of the present invention;

Fig. 8 is a flowchart of a method for recognizing face occlusion according to another embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that, if not conflicting, the various features of the embodiments of the present invention may be combined with each other, which are all within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.

In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

FIG. 1 is a schematic view of an operating environment of a related method in an embodiment of the present invention, where the related method includes a method of training a face occlusion recognition model and a method of recognizing a face occlusion. Referring to fig. 1, the operating environment includes an electronic device 10 and an image acquisition device 20, where the electronic device 10 and the image acquisition device 20 are communicatively connected.

The communication connection may be a wired connection, for example: fiber optic cables, also wireless communication connections, such as: WIFI connection, bluetooth connection, 4G wireless communication connection, 5G wireless communication connection, etc.

The image obtaining device 20 is configured to obtain a set of image samples, where each image in the set of image samples includes a face, and may also be configured to obtain an image to be measured, where the image to be measured includes the face, and the image obtaining device 20 may be a terminal capable of capturing an image, for example: a mobile phone, a tablet computer, a video recorder or a camera with shooting function, etc.

The electronic device 10 is a device capable of automatically and high-speed processing mass data according to a program operation, and is generally composed of a hardware system and a software system, for example: computers, smartphones, etc. The electronic device 10 may be a local device, which is directly connected to the image acquisition means 20; cloud devices are also possible, for example: cloud servers, cloud hosts, cloud service platforms, cloud computing platforms, etc., cloud devices are connected to the image acquisition apparatus 20 via a network, and both are communicatively connected via a predetermined communication protocol, which in some embodiments may be TCP/IP, netbeuii, IPX/SPX, etc.

It will be appreciated that: the image capturing mechanism 20 and the electronic device 10 may also be integrated together as a unitary device, such as a computer with a camera or a smart phone.

The electronic device 10 receives the image sample set sent by the image obtaining device 20, each image in the image sample set includes a human face, the electronic device 10 trains the image sample set to obtain a face shielding recognition model, and detects each type of shielding object of the human face in the image to be detected sent by the image obtaining device 20 by using the face shielding recognition model. It will be appreciated that the above-described method of training the face mask recognition model and method of recognizing the face mask may be performed on the same electronic device or may be performed on different electronic devices.

On the basis of fig. 1, other embodiments of the present invention provide an electronic device 10, please refer to fig. 2, which is a hardware configuration diagram of the electronic device 10 provided in the embodiment of the present invention, specifically, as shown in fig. 2, the electronic device 10 includes at least one processor 11 and a memory 12 (in fig. 2, a bus connection, one processor is taken as an example) that are communicatively connected.

The processor 11 is configured to provide computing and control capabilities to control the electronic device 10 to perform corresponding tasks, for example, control the electronic device 10 to perform any one of the methods for training a face mask recognition model provided in the following embodiments of the invention and any one of the methods for recognizing a face mask provided in the following embodiments of the invention.

It is understood that the processor 11 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The memory 12 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to a method for training a face occlusion recognition model in the embodiment of the present invention, or a program instruction/module corresponding to a method for recognizing a face occlusion in the embodiment of the present invention. The processor 11 may implement the method of training the face mask recognition model in any of the method embodiments described below, and may implement the method of recognizing face mask in any of the method embodiments described below, by running non-transitory software programs, instructions, and modules stored in the memory 12. In particular, the memory 12 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Referring to fig. 3, the method S20 includes, but is not limited to, the following steps:

s21: an image sample set is obtained, wherein each image in the image sample set comprises a human face.

S22: and intercepting a face area of a target image to generate a face area image, wherein the target image is any image in the image sample set.

S23: dividing the face region image into at least one local region image, wherein each local region image is marked with a real label, one local region image and the real label marked by the local region image serve as a sample pair, and the real label comprises a shielding object type.

S24: at least one sample pair corresponding to each image in the image sample set is used as a training set, the training set is input into a preset neural network for training, and training is stopped until iteration termination conditions are met, so that a face shielding recognition model is obtained;

Each image in the image sample set includes a face, and may be acquired by the image acquisition device, for example, the image sample set may be a certificate photo or a self-timer photo acquired by the image acquisition device. It can be understood that the occlusion condition exists in each image, that is, at least a part of the area of the face has the occlusion object, and the type of the occlusion object can be set according to the scene requirement identified by the model.

In an alternative scenario, for example, where a public occasion needs to detect whether a person wears a mask, the face in each image in the image sample set wears a mask. In an optional scenario, when detecting the accessory of the face part, for example, whether the user wears a hat, wears glasses, has a bang or a mask, sticks to the nose, etc., the face in the partial image in the image sample set wears the hat, the face in the partial image wears the glasses, the bang remains in the partial image, the mask sticks to the face in the partial image, and the nose sticks to the face in the partial image. It will be appreciated that one or more occlusion categories may be present in the same image, e.g., faces in the same image are both glasses worn, and bangs remain. Of course, the above two usage scenarios may also be combined, that is, six types of shielding objects such as mask, cap, glasses, bang, mask and nose patch need to be detected, in this scenario, a face in the same image may exist, that is, wearing glasses and wearing mask, and bang is left.

It will be appreciated that the image includes a face and a background, wherein the face is the target area for detection of occlusion. In order to reduce interference of the background on occlusion detection and reduce training time of a subsequent algorithm model, each image in the sample set is truncated. Hereinafter, a target image, which is any one of the images in the image sample set, will be described. And intercepting a face area of the target image to generate a face area image. Specifically, as shown in fig. 4, a plurality of key points of the face can be located through the existing face key point algorithm, including points of areas such as eyebrows, eyes, nose, mouth, face outline and the like. Then, according to the face contour, the face area is intercepted.

The existing face key point algorithm may be Active Appearance Models (AAMs), constrained Local Models (CLMs), explicit Shape Regression (ESR), supervised Descent Method (SDM), or the like.

The face region image is then divided into at least one partial region image, wherein the partial region image is an image in which a partial region in the face region image exhibits geometric features, such as an eye region, a forehead region, a nose region, a mouth region, etc., all belong to the partial region in the face region image and exhibit geometric features. Thus, the partial region image includes any one or two or more of the following region images: forehead area image, eye area image, nose to chin area image. For example, in a public place, if it is required to detect whether a person wears a mask, then at least one partial area image only includes nose-to-chin area images. Upon detecting whether the user is wearing a hat, wearing glasses, having a bang or facial mask, applying a nose mask, etc., the at least one partial area image includes a forehead area image, an eye area image, and a nose-to-chin area image. When the six types of shielding objects such as the mask, the cap, the glasses, the bang or the mask and the nose patch are required to be detected, the at least one local area image comprises a forehead area image, an eye area image and a nose-to-chin area image.

The method comprises the steps of dividing images of each local area, locating key points of areas such as eyes, nose, mouth and face outline of a face by using the face key point algorithm, and then intercepting the local area according to coordinate information of the key points of each area. In order to facilitate network learning, the local area images are scaled, so that the sizes of the local area images are consistent, for example, 64×64×3.

It will be appreciated that each local area image is labeled with a real label, and the real label includes a type of mask, for example, if the local area image is a forehead area image, the type of mask may be a bang or a hat, if the local area image is an eye area image, the type of mask may be glasses or a mask, and if the local area image is a nose-to-chin area image, the type of mask may be a mask or a nose paste. It will be appreciated that if the local area image is unobstructed, its actual label is unobstructed.

When a partial region image and the real label of the partial region image are taken as a pair of samples, for example, when the real label of the partial region image 1# is taken as a mask, the (partial region image 1# and mask) is taken as a pair of samples. It will be appreciated that if the image sample set comprises 400 images, 3 partial region images are acquired from each image, and thus 400×3=1200 sample pairs can be obtained.

At least one sample pair corresponding to each image in the image sample set is used as a training set, for example, in the above example, the above 1200 sample pairs are used as training sets. And inputting the training set into a preset neural network for training until the iteration termination condition is met, and at the moment, acquiring a face shielding recognition model.

The preset neural network may be preset based on an existing deep learning framework, for example, a Keras framework is adopted. It can be appreciated that the preset neural network may directly call an existing algorithm network in the deep learning framework, for example, mobiletv 1, mobiletv 2, mobiletv 3, etc., or may be obtained by modifying an existing algorithm network structure according to a training set and a requirement, for example, adding layers or reducing layers, or changing parameters in layers, etc., where the layers include, but are not limited to, convolution layers, normalization layers, activation function layers, etc. in the algorithm network.

When the training set is input into the preset neural network, the preset neural network learns the training set by using a model parameter, and a prediction result is output. And outputting a prediction result every time a learning iteration is carried out, wherein in each learning iteration process, the preset neural network adjusts model parameters, errors between the output prediction result and a real label, namely errors of a training set are continuously corrected, training is stopped when iteration termination conditions are met, at this time, corresponding optimal model parameters are obtained, and the preset neural network with the optimal model parameters is the face shielding recognition model obtained through training. In some embodiments, the iteration termination condition may be that the iteration number threshold is reached, that is, training is stopped when the iteration number threshold is reached, and the model parameter at that time is used as an optimal model parameter to obtain the face occlusion recognition model. It may be appreciated that, in some embodiments, the iteration termination condition may also be that the error of the training set fluctuates within a preset range, and the model parameter at this time is taken as the optimal model parameter, so as to obtain the face occlusion recognition model.

As can be seen from the above, the images in the training set are all images of the local areas, so that the local area features of the images are learned by the preset neural network, and compared with the features of the whole image, the local area image learning can be used for better learning the local features of the specific areas of the images, so that the interference of other areas is eliminated, for example, when the network learns the features of the forehead area, the interference from the eyes area or the nose to the chin area is avoided. Therefore, the face shielding recognition model can be obtained by means of rapid convergence of the preset neural network, and classification accuracy of the face shielding recognition model obtained by training is improved. In addition, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is improved in the model training process or the model prediction process.

In order to further lighten the network model and ensure the model accuracy, in this embodiment, the preset neural network includes a feature extraction network for extracting feature generation feature map data of the local area image.

The feature extraction network comprises a common convolution layer and N depth separable convolution layers which are arranged layer by layer, wherein a depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are arranged layer by layer; the step length of the depth convolution layer in the first M depth separable convolution layers is a preset value, and the preset value is larger than 1, wherein M is smaller than or equal to N.

It will be appreciated that as shown in fig. 4, the partial region image is first input to a common convolution layer, the output of each layer in the feature extraction network is the input of the next layer, until the feature map of the output of the last layer is the feature map of the final output of the feature extraction network.

The parameter quantity of the depth separable convolution layer is reduced, and the calculated quantity is reduced, so that the preset neural network has the characteristics of light weight and high training speed.

Specifically, the normal convolution layer includes a normal convolution kernel, where a parameter of the normal convolution kernel is Dk1×dk1×m1×n1, where Dk1×dk1 is a size of the normal convolution kernel, M1 is an input channel number of the normal convolution kernel, and N1 is an output channel number of the normal convolution kernel (i.e., the number of convolution kernels), where the input channel number M1 needs to be consistent with an image channel number of an input image. As shown in fig. 5 (a), when a general convolution kernel performs convolution operation, a convolution kernel performs convolution operation with all image channels in an input image at the same time, and a feature map is obtained after weighting, so that the number of feature maps output is consistent with the number of output channels. For example, if the dimension of the input image is df×m, where M is the number of image channels, the parameter of the common convolution kernel is dk1×dk1×m1×n1, and if the output dimension after convolution is DT1×dt1×n1, i.e. N1 feature maps with the size of dt×dt are provided. In the convolution process, each image channel of each common convolution check input image carries out DT1 times of convolution operation, and each image channel needs DT1 times of weighted summation operation. Thus, the parameter of the general convolution kernel is dk1×dk1×m1×n1, and the calculated amount is N1×df1×m1×dk1×dk1.

The depth separable convolution layers include a depth convolution layer and a point-wise convolution layer disposed layer by layer. Specifically, the depth convolution layer includes a depth convolution kernel, where a parameter of the depth convolution kernel is DK2×dk2×1×m2, where DK2×dk2 is a size of the depth convolution kernel, and M2 is an output channel number (number of convolution kernels) of the depth convolution kernel, where the output channel number M2 is consistent with an image channel number M of an input image df×m. It will be appreciated that the input images of the normal convolutional layer and the deep convolutional layer are not the same in size, and the input images df×df×m are exemplified here for illustrating the calculation amounts of the convolutional layers of the respective different classes. As shown in fig. 5 (b), in the convolution operation, a depth convolution kernel performs convolution operation with only one image channel in the input image, so as to obtain an output feature map, so that the number of the output feature maps is consistent with the number M2 of output channels of the convolution kernel (or the number M of image channels of the input image). For example, if the dimension of the input image is df×m, the parameter of the depth convolution kernel is dk2×dk2×1×m2, and if the dimension of the output feature map after convolution is DT2×dt2×m2, i.e., M2 feature maps with the size of DT2×dt2. In the convolution process, each depth convolution kernel only carries out DT2 times of convolution operation on one image channel of the input image, and M2 output feature images can be obtained. Thus, the parameter of the deep convolution layer is dk2×d2k×m2, and the calculated amount is M2×df2×df2×dk2×dk2.

The point-by-point convolution layer comprises a point-by-point convolution kernel, wherein the parameter of the point-by-point convolution kernel is 1 x 3 x N3, the size of the point-by-point convolution kernel is 1 x 1, m3 is the number of input channels, and N3 is the number of output channels. As shown in fig. 5 (c), the point-by-point convolution layer has the same structure as the normal convolution layer, and the convolution calculation method is the same, and the only difference is that the size of the point-by-point convolution kernel is 1*1. Thus, the number of parameters of the point-by-point convolution layer is 1×1×m3×n3, and the calculated amount is N3×df3×df3×m3.

Based on the above-mentioned structure of the common convolution layer and the depth separable convolution layer, as a depth convolution kernel in the depth separable convolution layer only carries out convolution operation with an image channel to obtain an output feature map, weighted summation after convolution operation with a plurality of image channels is not needed, and the size of the point-by-point convolution kernel is 1*1, the parameter amount of the depth separable convolution layer can be reduced, the calculated amount is reduced, and the preset neural network has the characteristics of light weight and high training speed.

In order to further reduce the calculation amount of the network and increase the training speed, in this embodiment, the depth convolution layer step length in the first M depth separable convolution layers of the N depth separable convolution layers is set to a preset value, where the preset value is greater than 1.

For example, M may be 3, i.e., the step size of the depth convolution layers 1#, 2# and 3# of the first 3 depth separable convolution layers is set to a preset value. In some embodiments, referring again to fig. 4, the predetermined value may be 2, i.e., the step sizes of the depth separable convolution layers 1#, 2# and 3# are all 2. The moving distance of the input characteristic image when the convolution operation is carried out based on the step length as the depth convolution kernel is larger than 1, so that the size of the characteristic image output by the first M depth separable convolution layers is reduced relative to the size of the characteristic image of the first M depth separable convolution layers, namely, the resolution of the characteristic image is quickly reduced, and the calculated amount of a network structure is reduced.

The method is equivalent to the step length of 1 depth separable convolution layer for downsampling by preset network structure for M times, thereby being beneficial to increasing the receptive field of the network, paying more attention to invariance of the feature space, achieving the characteristics of low resolution, large receptive field and space invariance of the finally generated feature map, and improving the precision of the classified network model.

In this embodiment, the parameter amount and the calculation amount based on the depth separable convolution layer are small, so that the parameter amount and the calculation amount of the feature extraction network can be reduced, and further, the parameter amount and the calculation amount of the whole preset neural network are effectively reduced, so that the preset neural network has the characteristics of light weight and high training speed. In addition, the preset network structure adopts M depth-separable convolution layers with the step length greater than 1 to perform downsampling, which is beneficial to increasing the receptive field of the network, and focuses on invariance of the feature space, so that the finally generated feature map has the characteristics of low resolution, large receptive field and space invariance, and the accuracy of the face shielding recognition model obtained through training can be improved.

In summary, in the method for training the face occlusion model provided by the embodiment of the invention, the images used for training are images of each local area, so that the local characteristics of each local area of the preset neural network learning image can be better learned by learning the local area image compared with the characteristics of the whole image, the interference of other areas is eliminated, the preset neural network can be quickly converged to obtain the face occlusion model, and the accuracy of classification of the face occlusion model obtained by training is improved. Secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is improved in the model training process or the model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolution layer and N depth separable convolution layers, wherein the common convolution layer and the N depth separable convolution layers are arranged layer by layer, one depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are arranged layer by layer, the step length of the depth convolution layer in the first M depth separable convolution layers in the N depth separable convolution layers is a preset value, and the preset value is larger than 1, and M is smaller than or equal to N. The parameter quantity and the calculation quantity based on the depth separable convolution layer are small, so that the parameter quantity and the calculation quantity of the feature extraction network can be reduced, the parameter quantity and the calculation quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolution layer with the step length of more than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of the feature space is more concerned, the finally generated feature map has the characteristics of low resolution, large receptive field and space invariance, and therefore, the accuracy of the face shielding recognition model obtained through training can be improved.

It is understood that the normal convolution layer, the depth convolution layer and the point-by-point convolution layer are all provided with a normalization layer and an activation function layer. The normalization layer can be realized by the existing Batch Normalization algorithm to normalize the data, namely, the mean value of the data input into the next layer is 0 and the variance is 1, so that on one hand, the generalization capability of the network can be reduced, and on the other hand, the training speed of the network is improved. The activation function layer can be realized by the existing ReLU function to increase the nonlinearity of the model and overcome the problem of gradient disappearance.

However, when the ReLU function is activated, the negative number is directly changed into 0, and much feature space information is lost in the operation, which is unfavorable for fitting the network model and expressing the features of the network model. Based on this problem, in some embodiments, the depth-separable convolutional layers further comprise a first linear convolutional layer in which the size of each convolution kernel is 1*1 and a second linear convolutional layer in which the size of each convolution kernel is 1*1.

As shown in fig. 6, the first linear convolution layer is located between the depth convolution layer and the point-wise convolution layer, and the second linear convolution layer is located after the point-wise convolution layer. In a depth separable convolution layer, feature image data output by the depth convolution layer is normalized by a normalization layer, then is input into an activation function for activation, and the data after the activation function is input into a first linear convolution layer for convolution operation, and then is sequentially input into a point-by-point convolution layer, a normalization layer, an activation function layer and a second linear convolution layer. The convolution kernel of 1*1 does not change the feature map size, can increase the nonlinear capability and increase the model depth.

The number of convolution kernels in the first linear convolution layer is a preset multiple of the number of convolution kernels in the depth convolution layer, and the preset multiple is larger than 1. For example, the number of convolution kernels in the first linear convolution layer is set to be 2 times of the number of convolution kernels in the depth convolution layer, redundancy of information quantity can be guaranteed, and after the next activation of the activation function layer, diversity of feature space can be contained to compensate for the lost feature space information after the activation of the activation function layer.

The number of convolution kernels in the second linear convolution layer is the same as the number of convolution kernels in the point-by-point convolution layer, so that the nonlinear capability of the model can be increased, and the model is accurate.

In this embodiment, by adding the first linear convolution layer and the second linear convolution layer to the depth-separable convolution layer, redundancy of information amount can be ensured, after the next activation of the activation function layer, diversity of feature space can be contained, so as to compensate for the lost feature space information after the activation of the activation function layer, and meanwhile, the nonlinear capability of the model can be increased, so that the model is accurate.

It can be understood that referring to fig. 4 again, the preset neural network further includes a full-connection layer and a softmax layer, and each feature map output by the last layer of the feature extraction network is input to the full-connection layer for integration calculation, and a one-dimensional vector is output, where the dimension of the one-dimensional vector is equal to the number of neurons in the full-connection layer. It will be appreciated that in the fully connected layer, a neuron is configured with a convolution kernel, the size of which is the same as the size of the input feature map, and the number of input channels of the convolution kernel is consistent with the number of image channels of the input feature map. The number of neurons is determined by the number of occlusion categories in the training set, for example, 6 when the occlusion categories in the training set include six types of occlusions, such as mask, cap, glasses, bang, mask and nasal mask.

For example, in the case of occlusion categories including mask, cap, glasses, bang, mask and nose mask, if the last layer of the feature extraction network outputs a feature map of 12×12×20, the number of image channels of the feature map is 20, that is, there are 20 feature maps, the adopted full connection layer includes 6 neurons, the convolution kernel configured for each neuron is 12×12×20, that is, the number of input channels of the convolution kernel is 20, and the convolution kernel of 12×12×20 performs convolution operations on the 20 feature maps, respectively, and performs weighted summation to obtain a value, that is, one neuron outputs a value, and 6 neurons obtain a vector of 1*6.

And then, inputting the vector output by the full connection layer into a softmax layer, wherein the softmax layer comprises a softmax function, and outputting one-dimensional predictive label probability distribution after the vector is calculated by the softmax function, wherein the dimension of the predictive label probability distribution is the same as the dimension of a neuron, and the predictive label probability distribution represents the probability of the local area image being of each class. Those skilled in the art will appreciate that the softmax function is an existing function, and its calculation formula will not be described in detail. For example, the vector 1*6 in the above example, after input of the softmax function, outputs a 1*6 predictive label probability distribution, which characterizes the probability of the occlusion categories being the mask, cap, glasses, ban, mask, and nasal mask, respectively. The sum of 6 probability values in the predictive tag probability distribution is 1. For example, the predictive label probability distribution [0.7,0.1,0.08,0.06,0.04,0.02] indicates that the mask class is 0.7, the cap class is 0.1, the glasses class is 0.08, the bang class is 0.06, the mask class is 0.04, and the nose patch class is 0.02. It can be understood that the predicted label output by the preset neural network is the probability distribution of the predicted label.

The one-hot thermal coding is used for representing the real label, for example, the class represented by the real label [1,0,0,0,0,0] is a mask. When the loss is calculated, the class with the true probability of 1 participates in the loss calculation, and the other classes with the true probability of 0 do not participate in the loss calculation, so that the relation between the true class and the other classes is ignored, and the model obtained through training is too armed, transited to fit and has poor generalization performance.

To solve the problem of real tags due to the use of one-hot thermal encoding, in some embodiments, the method S20 further comprises:

s25: and carrying out smoothing treatment on the real labels of each sample pair to obtain smoothed real labels, so that each smoothed real label participates in training of the preset neural network.

The smoothing processing is to add noise into the real tag, the noise is random positive number or random negative number, and the real tag and the noise are summed to obtain the smoothed real tag. The relationship among the real tags, the predicted tag probability distribution, and the smoothed real tags is described by way of example as shown in table 1 below:

TABLE 1

Label name	Mask	Hat with cap	Glasses with glasses	Liu Hai	Facial mask	Nose paste
							Predicting tag probability distribution	0.7	0.1	0.08	0.06	0.04	0.02
Real label	1	0	0	0	0	0
							Real label after smoothing	0.75	0.05	0.05	0.05	0.05	0.05

As can be seen from table 1 above, the probability distribution of each category in the real tag after the smoothing process does not have absolute 0 and 1. When the real labels after the smoothing process participate in training of the preset neural network and calculate losses, the probability of each category can participate in loss calculation, so that the preset neural network can learn the relation between the real category and other categories, the problem that the model is too broken can be relieved to a certain extent, clusters between the categories are more compact, the inter-category distance is increased, the intra-category distance is reduced, and the generalization capability of the model is improved.

In some embodiments, the step S25 specifically includes:

wherein k is the type of the obstruction,for the probability of k category in the smoothed target real label, y _k For the probability of k categories in the target real label, y is when the occlusion category k is correctly classified _k Equal to 1, y when the occlusion category k is misclassified _k And the alpha is equal to 0, the alpha is a preset parameter value, and the K is the total number of the occlusion categories in the training set.

The parameter value α is an empirical value, and may be 0.1.K is the total number of occlusion categories, and in the above embodiment, where the mask, cap, glasses, bang, mask, and nasal mask are six categories, k=6. If the target real label is [1,0,0,0,0,0 ]]When k is mask, y _k Equal to 1, after smoothing1 (1-0.1) +0.1/6=0.917, when k is a cap, glasses, bang, mask or nose mask,>0 x (1-0.1) +0.1/6=0.017, then the smoothed target real label [1,0,0,0,0,0]In the case of [0.917,0.017,0.017,0.017,0.017,0.017 ],]。

in the embodiment, the real label is subjected to smoothing processing through the formula, noise is added to the real label, constraint on the model is achieved, the degree of overfitting of the model is reduced, clusters between classifications are more compact, inter-class distances are increased, intra-class distances are reduced, and therefore generalization capability of the model is improved.

In some embodiments, before the step S24, further includes:

s26: and carrying out data enhancement processing on the training set.

The data enhancement processing comprises the steps of turning, rotating, translating, zooming, noise adding or brightness adjusting and the like on partial regional images in the training set to generate partial new partial regional images, so that the number of the training sets is increased, the purpose of enhancing sample diversity is achieved, and the generalization capability of a model is improved.

In summary, in the method for training the face shielding recognition model according to the embodiment of the present invention, each face region image is obtained by cutting out each face region of each image in the image sample set, and then each face region image is divided into at least one local region image, for example, a forehead region, an eye region, a nose region, or a chin region, and each local region image is labeled with a real tag, where the real tag includes a shielding object type, for example, a mask, glasses, a mask, a bang, a cap, a nose patch, or the like. And taking the local area image and the real label marked by the local area image as a sample pair, so that at least one sample pair corresponding to each image in the image sample set can be taken as a training set, and inputting the training set into a preset neural network for training until the iteration termination condition is met, and stopping training to obtain the face shielding recognition model. As can be seen, the images used for training are all local area images, so that the local area features of the preset neural network learning image are better than those of the whole image, the local features of the specific area of the image can be better learned through local area image learning, the interference of other areas is eliminated, the preset neural network can be quickly converged to obtain the face shielding recognition model, and the classification accuracy of the face shielding recognition model obtained through training is improved. Secondly, the image is divided into local areas, so that the size of the image is reduced, and the calculation speed is improved in the model training process or the model prediction process. In addition, the preset neural network comprises a feature extraction network, the feature extraction network comprises a common convolution layer and N depth separable convolution layers, wherein the common convolution layer and the N depth separable convolution layers are arranged layer by layer, one depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are arranged layer by layer, the step length of the depth convolution layer in the first M depth separable convolution layers in the N depth separable convolution layers is a preset value, and the preset value is larger than 1, and M is smaller than or equal to N. The parameter quantity and the calculation quantity based on the depth separable convolution layer are small, so that the parameter quantity and the calculation quantity of the feature extraction network can be reduced, the parameter quantity and the calculation quantity of the whole preset neural network are effectively reduced, the preset neural network has the characteristics of light weight and high training speed, the depth separable convolution layer with the step length of more than 1 is adopted for downsampling, the receptive field of the network is favorably increased, the invariance of the feature space is more concerned, the finally generated feature map has the characteristics of low resolution, large receptive field and space invariance, and therefore, the accuracy of the face shielding recognition model obtained through training can be improved.

Referring to fig. 7, the method S30 includes, but is not limited to, the following steps:

s31: and obtaining an image to be detected, wherein the image to be detected comprises a human face.

S32: and intercepting the face area of the image to be detected to generate a face area image to be detected.

S33: dividing the face region image to be detected into at least one partial region image to be detected.

S34: inputting the at least one local area image to be detected into the face shielding recognition model in any embodiment, wherein the face shielding recognition model outputs the shielding object type of each local area image to be detected.

S35: and determining the shielding condition of the image to be detected according to the shielding object type of each image of the local area to be detected.

The image to be detected comprises a human face and can be acquired by the image acquisition device, for example, the image to be detected can be a certificate photo or a self-timer photo acquired by the image acquisition device.

It will be appreciated that the image to be measured includes a face and a background, wherein the face is the target area for detection of occlusion. In order to reduce interference of the background on detection of the shielding object and reduce recognition time, the image to be detected is intercepted, a face area of the image to be detected is intercepted, and a face area image to be detected is generated. Specifically, a plurality of key points of the human face can be positioned through the existing human face key point algorithm, including the points of the areas of eyebrows, eyes, nose, mouth, face outline and the like. Then, according to the face outline, the face area is intercepted, and a face area image to be detected is generated.

And dividing the face region image to be detected into at least one local region image to be detected, wherein the local region image to be detected is an image with geometrical characteristics represented by local regions in the face region image to be detected, such as an eye region, a forehead region, a nose region, a mouth region and the like, which belong to the local regions in the face region image to be detected and represent the geometrical characteristics. Thus, the local area image to be measured includes any one or two or more of the following area images: the forehead area image, the eye area image and the nose-chin area image can be specifically set according to the identification requirement.

It can be understood that the division of the images of the local areas to be detected can also be performed by using the above-mentioned face key point algorithm to locate the key points of the areas such as eyes, nose, mouth and face contours, and then intercepting the images of the local areas to be detected according to the coordinate information of the key points of the areas.

The face occlusion recognition model outputs the occlusion category of each local area image to be detected, so that the occlusion condition of the image to be detected can be determined, and it can be understood that the occlusion condition is the occlusion category of each local area image to be detected output by the face occlusion recognition model.

It can be understood that the face shielding recognition model is obtained by training the face shielding recognition model in the above embodiment, and has the same structure and function as the face shielding recognition model in the above embodiment, and will not be described in detail herein.

To further determine the accuracy of occlusion recognition, false positives of the model are reduced, and in some embodiments, the occlusion condition of the model output is logically determined. Specifically, referring to fig. 8, the method S30 further includes:

s36: and acquiring the region attribute of the target local region image to be detected, wherein the region attribute reflects the facial geometric characteristics included in the target local region image to be detected, and the target local region image to be detected is any local region image to be detected.

S37: and judging whether the region attribute is matched with the type of the shielding object of the target to-be-detected local region image.

S38: if not, determining the target to-be-detected local area image as a non-shielding object.

For any partial region image to be measured, namely, the region attribute of the target partial region image to be measured, the region attribute reflects the facial geometric characteristics included in the target partial region image to be measured, for example, if the target partial region image to be measured is a forehead region image, the region attribute is forehead, if the target partial region image to be measured is an eye region image, the region attribute is eyes, and if the target partial region image to be measured is a nose-chin region image, the region attribute is mouth.

It is understood that the type of the shielding object has a correspondence with the attribute of the area, for example, the shielding object in the forehead area cannot be glasses, a nose mask, a mouth mask or the like, and the shielding object in the eye area cannot be a mouth mask, a nose mask or the like.

Therefore, after the model outputs the occlusion object type of the target to-be-detected local area image, in order to verify the correctness of the occlusion object type, whether the area attribute is matched with the occlusion object type is further judged, if the area attribute is not matched, for example, the area attribute of the target to-be-detected local area image is forehead, and the occlusion object type is mask, the target to-be-detected local area image is determined to be free of occlusion objects.

In this embodiment, by logically judging the occlusion situation of the model output, the misjudgment degree of the face occlusion recognition model can be reduced, so that the output result of the face occlusion recognition model is more in line with objective facts.

Another embodiment of the present invention also provides a non-transitory computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform the above-described method of training a facial mask recognition model, or a method of recognizing facial mask.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method of training a facial occlusion recognition model, comprising:

2. The method of claim 1, wherein the depth-separable convolutional layers further comprise a first linear convolutional layer and a second linear convolutional layer, each of the convolutional kernels in the first linear convolutional layer having a size of 1*1, and each of the convolutional kernels in the second linear convolutional layer having a size of 1*1;

3. The method according to claim 1, wherein the method further comprises:

4. A method according to claim 3, wherein the step of smoothing the real labels of each of the pairs of samples to obtain smoothed real labels comprises:

5. The method of claim 1, further comprising, prior to the step of inputting the training set into a predetermined neural network for training:

and carrying out data enhancement processing on the training set.

6. The method of claim 1, wherein the partial region image is an image of a partial region in the face region image that exhibits geometric features.

7. A method of identifying facial occlusion, comprising:

8. The method of claim 7, wherein the method further comprises:

9. An electronic device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A non-transitory computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform the method of any one of claims 1-8.