CN111950496B

CN111950496B - Mask person identity recognition method

Info

Publication number: CN111950496B
Application number: CN202010843398.0A
Authority: CN
Inventors: 程良伦; 杨颖�; 黄国恒
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-09-15
Anticipated expiration: 2040-08-20
Also published as: CN111950496A

Abstract

The application discloses a method for identifying identity of a mask person, which comprises the following steps: inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain an image of the person to be identified; inputting the mask person region image into a preset encoding-decoding model for feature separation, deleting the extracted external features, and outputting typical features and gesture features of the mask person region image; carrying out averaging treatment on typical features of the area image of the mask person to obtain static gait features, and inputting the gesture features into a preset LSTM network for treatment to obtain dynamic gait features; the static gait characteristics and the dynamic gait characteristics are input into a preset classifier for classification and identification, so that an identity identification result of the image of the person to be identified is obtained, and the technical problem that the identity of the person to be identified is difficult to identify by the existing face identification method is solved.

Description

Mask person identity recognition method

Technical Field

The application relates to the technical field of identity recognition, in particular to a method for recognizing identity of a mask person.

Background

In the prior art, identity recognition is usually performed through face recognition, but when facing a mask with facial features blocked by a sunglasses, a hat, a mask or the like, the face recognition method cannot accurately extract the facial features of the mask, so that the identity of the mask cannot be recognized. Particularly in the case of a lower resolution photographing device, it is more difficult to identify the identity of the person.

Disclosure of Invention

The application provides a method for identifying identity of a person to be masked, which is used for solving the technical problem that the identity of the person to be masked is difficult to identify in the existing face identification method.

In view of this, a first aspect of the present application provides a method for identifying identity of a person on a mask, including:

inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain an image of the person to be identified;

inputting the mask area image into a preset encoding-decoding model for feature separation, deleting the extracted external features, and outputting typical features and gesture features of the mask area image;

carrying out averaging treatment on the typical characteristics of the mask area image to obtain static gait characteristics, and inputting the gesture characteristics into a preset LSTM network for treatment to obtain dynamic gait characteristics;

and inputting the static gait characteristics and the dynamic gait characteristics into a preset classifier for classification and identification to obtain an identification result of the to-be-identified mask image.

Optionally, inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain an image of the person to be identified, and before the step of obtaining the image of the person to be identified, further includes:

inputting the to-be-identified mask image into a preset super-resolution network for processing, and outputting a super-resolution mask image;

correspondingly, the step of inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain the image of the person to be identified comprises the following steps:

and inputting the super-resolution mask image into a preset region segmentation network for segmentation processing to obtain a mask region image.

Optionally, the preset super-resolution network includes a first convolution module and a second convolution module;

the first convolution module comprises 6 convolution layers and one sub-pixel convolution layer, and is used for improving pixels in the length direction and the width direction of the to-be-identified mask image;

the second convolution module comprises 4 convolution layers and one sub-pixel convolution layer and is used for improving pixels in the height direction of the to-be-identified mask image.

Optionally, the configuration process of the preset encoding-decoding model includes:

framing the acquired video data to obtain a training sample image;

sequentially inputting the training sample images to an encoding-decoding network, so that an encoder in the encoding-decoding network encodes the training sample images, outputting external features, gesture features and typical features of the training sample images, performing image reconstruction by a decoder in the encoding-decoding network based on the features output by the encoder, and outputting reconstructed images, wherein the training sample images are mask region images obtained by dividing mask images to be trained;

based on the reconstructed image and the training sample image, separating non-posture features of the training sample image through a cross reconstruction loss function, separating posture features of the training sample image through a posture similarity loss function, and separating typical features of the training sample image from the non-posture features through a standard consistency loss function, wherein the non-posture features comprise the external features and the typical features.

Optionally, the cross-reconstruction loss function is:

wherein t is ₁ ，t ₂ F for different moments under the same video _a As external features, f _p For gesture features, f _c As is typical of the features of this application,at t ₂ The training sample image at the moment, D (·) is the decoding function.

Optionally, the pose similarity loss function is:

wherein n is ₁ For video scene c ₁ Number of video frames, n ₂ For video scene c ₂ The number of video frames below.

Optionally, the canonical consistency loss function:

optionally, the preset area dividing network is a trained Mask R-CNN network.

From the above technical scheme, the application has the following advantages:

the application provides a method for identifying identity of a mask person, which comprises the following steps: inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain an image of the person to be identified; inputting the mask person region image into a preset encoding-decoding model for feature separation, deleting the extracted external features, and outputting typical features and gesture features of the mask person region image; carrying out averaging treatment on typical features of the area image of the mask person to obtain static gait features, and inputting the gesture features into a preset LSTM network for treatment to obtain dynamic gait features; and inputting the static gait characteristics and the dynamic gait characteristics into a preset classifier for classification and identification, and obtaining an identity identification result of the to-be-identified mask image.

According to the identity recognition method of the mask, the mask area and the background area are segmented through the preset area segmentation network, so that a mask area image is obtained, and interference of factors such as background is reduced; the gait characteristics of the mask are obtained by extracting the typical characteristics and the gesture characteristics of the mask, so that the identity of the mask is identified, and the identity of the mask is identified by extracting the gait characteristics of the mask, so that the problem that the face characteristics of the mask cannot be accurately extracted and the identity of the mask cannot be identified by a face identification method is avoided; in addition, in consideration of the fact that the external features of the mask extracted by the convolutional neural network are different due to the fact that the mask possibly wears different in different scenes, the identity recognition effect of the mask is affected by the fact that the different external features participate in the identity recognition of the mask, in order to avoid the effect of the identity recognition due to the fact that the change of the external features of the mask is affected, the external features are separated and deleted through the preset coding-decoding model, accuracy of the identity recognition of the mask is improved, and therefore the technical problem that the identity recognition of the mask is difficult to recognize in an existing face recognition method is solved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for identifying identity of a person to be masked according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for identifying a person with a mask according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a super-resolution network according to an embodiment of the present application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

For easy understanding, referring to fig. 1, an embodiment of a method for identifying identity of a person on a mask provided by the present application includes:

step 101, inputting the image of the person to be identified into a preset area segmentation network for segmentation processing to obtain the image of the person to be identified.

The image of the mask to be identified is directly input into a preset encoding-decoding model for feature extraction, so that unnecessary background features can be extracted, and the identity identification effect is affected. Therefore, in the embodiment of the application, the preset area segmentation network is adopted to segment the image of the to-be-identified mask, and the mask area is segmented from the image of the to-be-identified mask, so that unnecessary characteristics are reduced, and the quality of effective characteristic information is improved. The to-be-identified mask image can be obtained by performing image framing processing on a video stream acquired by monitoring equipment.

And 102, inputting the mask area image into a preset encoding-decoding model to perform feature separation, deleting the extracted external features, and outputting typical features and gesture features of the mask area image.

For different scenes, external features of the mask person are different, such as wearing, and in this case, when the identity is identified through the trained classifier, an erroneous identification result may be generated due to the change of the external features of the mask person. For example, the same person wearing the mask may be wearing the mask differently in different places, and the trained classifier may identify the same person wearing the mask as a person of different identity. Therefore, in the embodiment of the application, the input mask area image is subjected to feature separation by a preset encoding-decoding model to obtain typical features, external features and gesture features, and the external features are deleted without participating in subsequent identification. Typical features are the shape features of the person's height, arm length, etc., and the posture features are the representation of the person's gait information on a particular frame while acting.

Further, the preset encoding-decoding model includes an encoder and a decoder, and the configuration process of the preset encoding-decoding model specifically includes:

carrying out framing treatment on the acquired video data to obtain a to-be-trained mask image, dividing the to-be-trained mask image to obtain a mask area image, and taking the mask area image as a training sample image; sequentially inputting the training sample images into an encoding-decoding network, enabling an encoder in the encoding-decoding network to encode the training sample images, outputting external features, gesture features and typical features of the training sample images, and enabling a decoder in the encoding-decoding network to reconstruct images based on the features output by the encoder and output reconstructed images; based on the reconstructed image and the training sample image, separating out non-posture features of the training sample image through a cross reconstruction loss function, separating out posture features of the training sample image through a posture similarity loss function, and separating out typical features of the training sample image from the non-posture features through a standard consistency loss function, wherein the non-posture features comprise external features and typical features.

It should be noted that the encoding-decoding network is also a typical CNN structure, which includes convolution layers and a max-pooling layer, where each convolution layer is followed by a ReLU activation function, and the last layer is a Sigmoid activation function, and is used to output the value of [0,1] for the next operation. Firstly, an encoder epsilon encodes an input mask person region image X, wherein the characteristic representation mode of the encoding is as follows:

f _a ,f _p ,f _c ＝ε(X)；

wherein f _a As external features, f _p For gesture features, f _c As a typical feature, ε (·) is the coding function.

To adequately extract the image features of the mask area, a reconstructed image approximating the original image X is reconstructed by a decoder DWherein the reconstructed image +.>Can be expressed as:

where D (·) is the decoding function.

After three different features are completely learned and classified, external features, typical features and gesture features in the features of the mask are separated by designing different loss functions, and the method specifically comprises the following steps:

(1) Cross reconstruction loss function:

wherein t is ₁ ，t ₂ C, for different moments in the same video ₁ ，c ₂ For different video scenes.

The cross reconstruction loss function proposed by the present application is realized by using t ₁ Appearance characteristic f at time _a Characteristic features f _c And t ₂ Gesture feature at time f _p To reconstruct t ₁ Is the next frame image of (a)Due to t ₁ ，t ₂ F independent of posture in case _a ，f _c Is unchanged, f of the current frame can be used _p F with any frame in the same video _a ，f _c Matching, reconstructing the same object, which can force f in all frames of the video _a ，f _c Keep similar, f _a ，f _c Corresponds to a constant factor, can make the feature (f _a And f _c ) Separating, namely separating out the non-attitude features.

(2) Pose similarity loss function:

wherein n is ₁ For video scene c ₁ Number of video frames, n ₂ For video scene c ₂ Video of the lower partFrame number.

Under different scenes, f _p Will receive f _a To ensure f _p Only comprises gesture information, a gesture similarity loss function is provided, and f is deleted by utilizing the uniqueness of the gesture information under different scenes _p External characteristics present in the model.

(3) Canonical consistency loss function:

in the formula, i, j E [1, n ] ₁ ]. Each individual is invariant at different times and scenarios, and from this characteristic, the representative feature can be separated from the non-pose features by a canonical consistency loss function.

In the process of feature separation, deleting external features, and keeping the gesture features and typical features of the mask person for subsequent identification.

Step 103, carrying out averaging treatment on typical features of the area image of the mask person to obtain static gait features, and inputting the gesture features into a preset LSTM network for treatment to obtain dynamic gait features.

102, after typical features and gesture features of a person are obtained through separation, carrying out averaging treatment on the typical features to obtain static gait features of the person; the dynamic gait characteristics are obtained by inputting the gesture characteristics into a multi-layer LSTM network with a design increment identity loss function for characteristic processing.

Step 104, inputting the static gait characteristics and the dynamic gait characteristics into a preset classifier for classification and identification, and obtaining an identity identification result of the to-be-identified mask image.

The static gait characteristics and the dynamic gait characteristics of the mask are input into the preset classifier for classification and identification, so that an identity identification result is achieved, and the identity identification result is more accurate and reliable than an identity identification result obtained by classification and identification based on single static gait characteristics or single dynamic gait characteristics.

According to the identity recognition method for the mask, the mask area and the background area are segmented through the preset area segmentation network, the mask area image is obtained, and interference of factors such as the background is reduced; the gait characteristics of the mask are obtained by extracting the typical characteristics and the gesture characteristics of the mask, so that the identity of the mask is identified, and the identity of the mask is identified by extracting the gait characteristics of the mask, so that the problem that the face characteristics of the mask cannot be accurately extracted and the identity of the mask cannot be identified by a face identification method is avoided; in addition, in consideration of the fact that the external features of the mask extracted by the convolutional neural network are different due to the fact that the mask possibly wears different in different scenes, the identity recognition effect of the mask is affected by the fact that the different external features participate in the identity recognition of the mask, in order to avoid the effect of the identity recognition due to the fact that the change of the external features of the mask is affected, the external features are separated and deleted through the preset coding-decoding model, accuracy of the identity recognition of the mask is improved, and therefore the technical problem that the identity recognition of the mask is difficult to recognize in an existing face recognition method is solved.

The above is an embodiment of a method for identifying a person on a mask provided by the present application, and the following is another embodiment of a method for identifying a person on a mask passed by the present application.

For easy understanding, referring to fig. 2, another embodiment of a method for identifying identity of a person on a mask provided by the present application includes:

step 201, inputting the to-be-identified mask image into a preset super-resolution network for processing, and outputting the super-resolution mask image.

In consideration of the problem of monitoring equipment, the quality of the photographed images of the people with the face is poor, so that the extraction of effective characteristics is affected. According to the application, the super-resolution network is preset to process the to-be-identified mask image, and the super-resolution mask image is output, so that the quality of the to-be-identified mask image is improved.

Further, the preset super-resolution network in the embodiment of the present application includes a first convolution module and a second convolution module, please refer to fig. 3; the first convolution module comprises 6 convolution layers and one sub-pixel convolution layer (upsampling layer) and is used for improving pixels in the length and width directions of the to-be-identified mask image; the second convolution module comprises 4 convolution layers and one sub-pixel convolution layer, and is used for improving pixels in the height direction of the to-be-identified mask image. For an input image of a to-be-identified mask, a patch (pats) -based segmentation method is adopted, low-resolution pats with a pixel of 7*7 is input, a 3*3 convolution kernel is adopted for each convolution layer to perform feature learning, a linear rectification function (ReLU) is connected behind each convolution layer, the learned feature of each convolution layer is input of the next layer, and in order to obtain spatial information among the features of the image of the to-be-identified mask, a first convolution layer in a first convolution module is connected with a third convolution layer in a short jump manner, and the first convolution layer is connected with a sixth convolution layer in a long jump manner. And the activation map generated by the sixth convolution layer is processed by the sub-pixel convolution layer, and the amplified activation map is output, so that the pixels in the length and width directions of the to-be-identified mask image are improved. The input of the second convolution module is the output (amplified activation diagram) of the first convolution module, the first convolution layer in the second convolution module is connected with the fourth convolution layer in a long jump way, the sub-pixel convolution layer in the second convolution module carries out further amplification processing on the activation diagram output by the fourth convolution layer, the pixels in the height direction of the to-be-identified mask image are improved, and finally the super-resolution mask image is output.

The sub-pixel convolution layer in the pre-set super-resolution network has scale factors of two directions for scaling in two axis directions, for which r=2, the sub-pixel convolution layer requires 4 activation maps as inputs. The low resolution mask image is input to the activation map and then the activation map is output in response to the high resolution, with the pixels rearranged for super resolution on both axes, outputting Meng Mianren the super resolution image.

Step 202, inputting the super-resolution mask image into a preset region segmentation network for segmentation processing to obtain a mask region image.

The preset split area network in the embodiment of the application is preferably a trained Mask R-CNN network. The Mask R-CNN network performs feature extraction and region segmentation through the ResNext-101+FPN network, the ROIALign is adopted to replace pooling operation, an interpolation process is introduced, bilinear interpolation processing is performed first, pooling operation is performed again, and thus the problem of data nonlinearity caused by sampling only through pooling operation is avoided.

And 203, inputting the mask area image into a preset encoding-decoding model to perform feature separation, deleting the extracted external features, and outputting typical features and gesture features of the mask area image.

And 204, carrying out averaging treatment on typical features of the area image of the mask person to obtain static gait features, and inputting the gesture features into a preset LSTM network for treatment to obtain dynamic gait features.

Step 205, inputting the static gait feature and the dynamic gait feature into a preset classifier for classification and identification, and obtaining an identity identification result of the to-be-identified mask image.

The details of steps 203 to 205 are identical to those of steps 102 to 104, and the details of steps 203 to 205 will not be described here.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for identifying a person on a mask, comprising:

inputting the static gait characteristics and the dynamic gait characteristics into a preset classifier for classification and identification to obtain an identification result of the to-be-identified mask image;

the configuration process of the preset encoding-decoding model comprises the following steps:

framing the acquired video data to obtain a training sample image;

based on the reconstructed image and the training sample image, separating non-attitude features of the training sample image through a cross reconstruction loss function, separating attitude features of the training sample image through an attitude similarity loss function, and separating typical features of the training sample image from the non-attitude features through a standard consistency loss function, wherein the non-attitude features comprise the external features and the typical features.

2. The method for identifying the identity of the person to be masked according to claim 1, wherein the step of inputting the image of the person to be masked into a preset area dividing network for dividing processing to obtain the image of the area of the person to be masked, further comprises the steps of:

3. The method of claim 2, wherein the preset super-resolution network comprises a first convolution module and a second convolution module;

4. The method of claim 1, wherein the cross-reconstruction loss function is:

5. The method of claim 4, wherein the pose similarity loss function is:

6. The method of claim 5, wherein the canonical consistency loss function:

7. the method of claim 1, wherein the preset area division network is a trained Mask R-CNN network.