CN110969089A

CN110969089A - Lightweight face recognition system and recognition method under noise environment

Info

Publication number: CN110969089A
Application number: CN201911059976.5A
Authority: CN
Inventors: 白慧慧; 郭璐璐
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-04-07
Anticipated expiration: 2039-11-01
Also published as: CN110969089B

Abstract

The invention provides a lightweight face recognition system and a recognition method under a noise environment, which belong to the technical field of computer face recognition, firstly, a noise picture is input, a 3 x 3 convolution kernel is used for extracting features, then a denoising block is added, the features extracted by a convolution layer are denoised, then a depth separable convolution and ten layers of bottleneck structures are used, and finally, output features are obtained through 1 x 1 convolution and linear global depth separable convolution. The invention provides an effective independent de-noising module-LD-MobileFaceNet which can be conveniently combined with any convolutional neural network structure; in the denoising operation, a non-local mean denoising method is adopted to improve the performance of the MobileFaceNet and improve the accuracy of face recognition in a noise environment; a part of bottleneck layers in the network are removed, so that the system is lighter, and better robustness is realized on a noise data set; the nonlinear activation function swish replaces PReLU, loss of identification precision is made up, and identification accuracy under different noise levels is guaranteed.

Description

Lightweight face recognition system and recognition method under noise environment

Technical Field

The invention relates to the technical field of computer face recognition, in particular to a lightweight face recognition system and a recognition method in a noise environment.

Background

Nowadays, face recognition has been widely applied to the fields of face login, mobile payment, identity authentication, and the like. These application scenarios often require high accuracy, however, face recognition still faces a great challenge due to interference of complex environments such as occlusion, illumination, noise, and the like. With the development of artificial intelligence, deep learning has made a major breakthrough in face recognition research, and some high-performance methods based on convolutional neural networks are proposed.

For example, deep face achieves high accuracy on a Labeled Wild face dataset (LFW). The appearance of triple loss allows further development of face recognition research, and the method considers the relative difference of the distances between matched pairs and non-matched pairs and can extract better features. The angular margin Loss (ArcFace Loss) utilizes the arc-cos function, an additional angular edge is added to obtain a target, and therefore accurate correspondence between the angle and the circular arc is achieved on the normalized hypersphere.

The above identification methods based on convolutional neural networks all provide more distinctive features for face identification, but some practical problems also exist. The deep neural network based on deep learning has a large number of parameters and layers, and occupies a large memory space.

For example, when comparing the deep face with the classical image classification task, the human face has strong similarity, simple features cannot accurately distinguish the human face, and a proper loss function needs to be designed to improve the recognition capability. The model size of an ArcFace (LResNet100E-IR) network reaches 250 megabits. In practical face recognition applications, the face recognition network is often used in mobile terminals and embedded devices, and due to the limitations of memory and computing resources, the large face recognition network is difficult to be widely applied.

Therefore, a lightweight convolutional neural network is produced. Such as MobileNetV1, MobileNetV2, which are widely used in engineering and industrial production. The MoileNetV2 model is designed by using structures such as depth separable convolution, inverse residual error, linear bottleneck and the like, so that an efficient and lightweight model is realized. The MobileFaceNet adopts an ArcFace Loss function, and improves the mobileNet V2 algorithm, so that the face recognition speed is improved by more than two times, and meanwhile, the accuracy of an LFW data set reaches 99.55%.

However, the above-mentioned real-time face recognition based on the lightweight convolutional neural network often faces complicated background, such as noise, occlusion and insufficient contrast, which all affect the application of the face recognition network model in practice. In order to apply the network model to our real life more widely, it is indispensable to improve the robustness of the algorithm in a complicated and varied environment. The common method for improving the robustness of the algorithm is to add a denoising network, and remove image noise and clean data in an image preprocessing stage. For example, a clear image is output by using an end-to-end denoising network, and the method increases a large number of parameters and makes the network structure more complex. But is not applicable in real-time face recognition when used in the image pre-processing stage.

Disclosure of Invention

The invention aims to provide a lightweight face recognition system and a recognition method in a noise environment, so as to solve the technical problems in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect, the present invention provides a lightweight face recognition system in a noise environment, including:

the image acquisition module is used for acquiring an original face image to be recognized;

the characteristic extraction module is used for extracting the identification characteristics of the collected original face image;

the denoising module is used for denoising the face image with the extracted identification features;

the depth separable convolution module is used for carrying out channel correspondence on the face image subjected to the denoising processing and the original face image;

and the characteristic output module is used for outputting the final identification characteristic.

Preferably, the feature extraction module is a 3 × 3 convolutional layer.

Preferably, the bottle neck structure comprises ten layers.

Preferably, the film further comprises a 1 × 1 convolutional layer.

Preferably, the feature output module comprises a linear global depth separable convolutional layer.

In another aspect, the present invention provides a method for performing lightweight face recognition in a noise environment by using the system described above, including:

extracting identification characteristics of an input original face image with noise;

carrying out feature denoising processing on the extracted identification features;

performing channel correspondence on the face image subjected to noise removal processing and the original face image;

and sequentially obtaining the final identification features through ten layers of bottleneck structures, 1 × 1 convolution and linear global depth separable convolution.

Preferably, the characteristic drying treatment includes:

matching the images by combining the Euclidean distance and a weighted subregion matching method, and reflecting local and global characteristics of the images by the calculated similarity, wherein the expression is as follows:

in formula (1), i is an index of an output position, j represents all possible positions, x is an input original face image, y is a denoised feature map with the same pixels as those corresponding to x,

is the feature obtained by the down-sampling operation of x, w is a gaussian function,

is a normalization operation, S denotes all spatial positions,

representing the mapping of input x to output y.

Preferably, the first and second liquid crystal materials are,

wherein e is an exponential function, x is an input pixel, T is a transpose matrix of x, and the estimation value of the current pixel is obtained by weighted averaging of neighborhood pixels through a Gaussian function.

Preferably, the activation function in the bottleneck structure is a swish activation function:

f(x)＝x·sigmoid(x) (3)。

preferably, the penalty function for the linear global depth separable convolutional layer is Arcfacelos:

the loss function is expressed as L₁：θ_jIs the angle between the current weight and the target feature, N is a batch parameter, y_iFor the class to which the input features belong,

and representing an included angle between the input features and the real weight, n representing the number of the total classes, t representing an additional angle margin penalty to enhance the intra-class compactness and the inter-class difference, and all the extracted face features are distributed on a space with the radius of r.

The invention has the beneficial effects that: an effective independent de-noising module-LD-MobileFaceNet which can be conveniently combined with any convolutional neural network structure is provided; in the denoising operation, a non-local mean denoising method is adopted to improve the performance of the MobileFaceNet and improve the accuracy of face recognition in a noise environment; a part of bottleneck layers in the network are removed, so that the system is lighter, and better robustness is realized on a noise data set; the nonlinear activation function swish is used for replacing the PReLU, loss of identification precision is made up, and identification accuracy under different noise levels is guaranteed.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic block diagram of a lightweight face recognition system in a noise environment according to an embodiment of the present invention.

Fig. 2 is a schematic block diagram of a drying module according to an embodiment of the present invention.

FIG. 3 is a graph comparing performance of LD-MobileFaceNet (swish) and MobileFaceNet according to the embodiment of the present invention.

Fig. 4 is a performance comparison graph of different recognition methods using different loss functions according to the embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating the recognition accuracy comparison of three models, namely D-MobileFaceNet (4.0MB), LD-MobileFaceNet (3.0MB) and LD-MobileFaceNet (swish) (3.0MB), with an original picture size of 112 × 112 according to an embodiment of the present invention.

Detailed Description

The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or modules, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, and/or groups thereof.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

It will be understood by those of ordinary skill in the art that the figures are merely schematic representations of one embodiment and that the elements or devices in the figures are not necessarily required to practice the present invention.

Example 1

An embodiment 1 of the present invention provides a lightweight face recognition system in a noise environment, including:

The feature extraction module is a 3 × 3 convolutional layer.

Also includes ten layers of bottleneck structure.

Also included are 1 × 1 convolutional layers.

The feature output module includes a linear global depth separable convolutional layer.

When the system is used for carrying out lightweight face recognition in a noise environment, extracting recognition features of an input original face image with noise; carrying out feature denoising processing on the extracted identification features; performing channel correspondence on the face image subjected to noise removal processing and the original face image; and sequentially obtaining the final identification features through ten layers of bottleneck structures, 1 × 1 convolution and linear global depth separable convolution.

The characteristic drying treatment comprises the following steps:

is a normalization operation, S denotes all spatial positions,

representing the mapping of input x to output y.

The activation function in the bottleneck structure is the swish activation function:

f(x)＝x·sigmoid(x) (3)。

the penalty function for a linear global depth separable convolutional layer is Arcfacelos, denoted L₁：

θ_jIs the angle between the current weight and the target feature, N is a batch parameter, y_iFor the class to which the input features belong,

Example 2

As shown in fig. 1, a structure diagram of an LD-mobilefacenet (swish) network proposed in embodiment 2 of the present invention. Firstly, inputting a noise picture, extracting features by using a 3 multiplied by 3 convolution kernel, then adding a denoising block, carrying out feature denoising on the features extracted by the convolution layer, then using a depth separable convolution and a ten-layer bottleneck structure, and finally obtaining output features through 1 multiplied by 1 convolution and linear global depth separable convolution, so that a network can learn better, and the accuracy of face matching is improved.

In embodiment 2 of the present invention, the idea of a non-local mean algorithm is used, taking into account every position information in space. A denoising module is defined, and denoising representation of extracting features is realized as follows:

is a normalization operation, S denotes all spatial positions,

representing the mapping of input x to output y.

The formula (1) is combined with the Euclidean distance and weighted subregion matching method to match the images, and the similarity obtained by calculation can reflect the local and global characteristics of the images.

Equation (2) is a gaussian equation for smooth image denoising. For images with random noise, the function can achieve a good denoising effect.

As shown in fig. 2, a structure of the denoising module is shown, in the graph, the size of an input feature graph x is 64 channels, maximum pooling layers are added at positions a and b, the size of the picture is reduced, the calculation amount is reduced, and the denoising module is realized by using a formula (2) through matrix multiplication. Through convolution of 1 multiplied by 1, a denoised characteristic diagram with an output channel consistent with an input channel can be obtained. This operation keeps the input and output channels unchanged, so the denoising block can be applied to any layer without affecting the network structure. Because the denoising operation can remove part of useful information at the same time, more original features can be stored by using residual concatenation.

In embodiment 2 of the present invention, in consideration of the compatibility of practical applications, a lightweight robust face recognition network LD-mobilefacenet (swish) is designed based on the above denoising module. Firstly, extracting features through a 3 multiplied by 3 convolution kernel with the step length of 2, and then carrying out feature denoising by using a denoising block. The deep separable convolution is a building block of MobileNetV1 that is used to reduce computational expense.

In embodiment 2 of the present invention, in order to reduce the weight, a part of redundant layers is deleted, and only a bottleneck structure of ten layers is used. In a bottleneck structure, adopting swish activation function to replace PReLU as a nonlinear activation function to improve the precision of a neural network, and defining the function as follows:

f(x)＝x·sigmoid(x) (3)。

the swish function uses a sigmoid method, and only a simple scalar needs to be input by using self-gating. It is a smooth non-monotonic function, and experiments show that the performance is superior to that of ReLU.

Finally, the features are output by using a linear global depth separable convolution and a linear 1 x 1 convolution.

In embodiment 2 of the invention, Arcface loss with better effect is used as a loss function, and the characteristics of better distinction are extracted, wherein the loss function is expressed as L₁：

Example 3

In embodiment 3 of the present invention, a denoising module is used to train on a CASIA-Webface data set with a noise level of 25.

The LFW data set with different noise levels is used as a test, 13233 pictures are contained in the data set, 6000 pairs of human faces are randomly selected to form a human face recognition picture pair, 3000 pairs of human face pictures belong to the same person, and 3000 pairs of human face pictures belong to different persons, wherein 1 human face picture of each person is tested and verified.

The data set processed in the experiment is added with white gaussian noise, noise with the noise level of 25 is added to the training set, and the test is carried out on the test sets with the noise levels of 0,15,25,35 and 50. The size of the input picture is 112 × 96, a human face is detected through the MTCNN, and a human face alignment operation is performed according to the marked five feature points. Also for better performance, a 112 × 112 dimensional training test was used. During training, the SGD optimizer was used with a batch size of 256 and a momentum parameter of 0.9. The initial learning rate was set to 0.1, decreasing by a factor of 0.1 for 20 iterations, for a total of 70 iterations.

Table 1 is a performance comparison table of the denoising model according to the embodiment of the present invention and CosFace, ArcFace, and MobileFaceNet, where table 1 is to denoise the features extracted from the first convolutional layer. The method is characterized in that the MobileFaceNet is used as a basic model, the L-MobileFaceNet is used as a bottleneck layer with 10 layers, the lightweight of the network is realized, and the D-MobileFaceNet is used for adding a denoising module on the basis of the basic model and comparing the denoising module. LD-MobileFaceNet adds denoising block on L-MobileFaceNet model, LD-MobileFaceNet (swish) further uses swish nonlinear activation function to replace PReLU activation function. These models were trained on a 112 × 96 CASIA-Webface dataset with a noise level of 25, and tested on LFW datasets with different noise levels.

Table 1: the denoising model was compared to the performance of CosFace, ArcFace, MobileFaceNet. Models using noise blocks were all trained on the CASIA-Webface dataset (σ 25, size 112 × 96) and tested on LFW datasets of different noise levels.

Compared with the table 1, the effectiveness of the denoising block can be obviously seen, and the swish activation function has certain help for improving the precision.

Table 2 shows the comparison results of different positions and numbers of the denoising blocks, wherein the denoising blocks are used after the first layer of convolution layer by Model A, the denoising blocks are added after the depth separable convolution layer by Model B, and the denoising blocks are added at two positions by Model C. It can be seen from table 2 that the effect of the different positions and numbers of the denoising blocks on the result is not too large, but a certain amount of calculation and parameters are increased by using a plurality of denoising blocks, and the performance of adding the denoising blocks at the front position is better. Meanwhile, as shown in fig. 3, which is a performance comparison graph of LD-MobileFaceNet (swish) and MobileFaceNet, a comparative experiment was performed on a 112 × 112 data set, and the generalization of the method according to the embodiment of the present invention was demonstrated. Fig. 3(a) shows a performance comparison graph in which the pixel size of the original image is 112 × 96, and fig. 3(b) shows a performance comparison graph in which the pixel size of the original image is 112 × 112.

TABLE 2

For fair comparison of the denoising models according to the embodiments of the present invention, different loss functions are used to test the performance of LD-mobilefacenet (swish), as shown in fig. 4, where ArcFace loss, CosFace loss, spheerface loss, and Softmax loss are compared, respectively. Wherein the performance of ArcFace loss and CosFace loss is better. FIG. 5 is a visual display of the accuracy of the three models D-MobileFaceNet, LD-MobileFaceNet and LD-MobileFaceNet (swish) on a 112X 112 data set.

Therefore, the denoising module provided by the embodiment of the invention effectively improves the identification precision. LD-MobileFaceNet (swish) shows better robustness under the noise environment, obviously surpasses MobileFaceNet, and its model is smaller, so the light-weight, robust face recognition network has more practical application meaning.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A lightweight face recognition system in a noisy environment, comprising:

2. The system of claim 1, wherein the feature extraction module is a 3 x 3 convolutional layer.

3. The system of claim 1, further comprising a ten-layer bottleneck structure.

4. The lightweight face recognition system in a noisy environment of claim 2, further comprising a 1 x 1 convolutional layer.

5. The lightweight face recognition system in a noisy environment according to claim 4, wherein said feature output module comprises a linear global depth separable convolutional layer.

6. A method for lightweight face recognition in a noisy environment using a system according to any of claims 1-5, characterized by:

7. The method of claim 6, wherein:

the characteristic drying treatment comprises the following steps:

is a normalization operation, S denotes all spatial positions,

representing the mapping of input x to output y.

8. The method of claim 7, wherein:

9. The method of claim 8, wherein:

f(x)＝x·sigmoid(x) (3)。

10. the method of claim 9, wherein:

and representing an included angle between the input features and the real weight, n representing the total number of classes to which the input features belong, t representing the penalty of adding angle margin, and all the extracted face features are distributed on a space with the radius of r.