CN114373205A

CN114373205A - Face detection and recognition method based on convolution width network

Info

Publication number: CN114373205A
Application number: CN202111610869.4A
Authority: CN
Inventors: 陈俊龙; 郭继凤; 冯绮颖; 刘竹琳; 张通
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-19

Abstract

The invention discloses a face detection and identification method based on a convolution width network, which comprises the following steps: s1, acquiring a video by using a camera and intercepting a video frame according to a certain frequency; s2, carrying out face detection on the video frame by using a deep network MTCNN; s3, inputting the detected face area into a convolution width face recognition network, and outputting the final universal face feature; and S4, comparing the obtained face features with the person features in the existing person library, and outputting a face recognition result according to a threshold value. The invention combines deep learning and width learning and is used for face detection and recognition, solves the problems of large parameter quantity, large resource consumption and long training time in the prior method, and can meet the requirement of real-time property in deployment.

Description

Face detection and recognition method based on convolution width network

Technical Field

The invention belongs to the technical field of face detection and identification, and particularly relates to a face detection and identification method based on a convolution width network.

Background

With the development of science and technology and imaging technology, artificial intelligence has been involved in various aspects of human life, and human face detection and recognition are important scenes. Due to the excellent learning characteristic and recognition performance, deep learning is more and more widely researched in face detection and recognition. The face detection method based on the cascade residual error network proposed by Pang et al shows the highest precision in binocular stereo matching. In addition, the fast-RCNN also achieves good effect on face detection, and simultaneously shortens the learning time. And a three-layer cascade network designed by Zhang et al has an accuracy rate of over 92 percent and the like. Although the existing method is excellent in face detection and recognition, the existing method is based on a deep neural network, and the existing method has the disadvantages of large parameter quantity, large resource consumption and long training time. When the system is deployed in a resource-limited device, the requirement on real-time performance is difficult to achieve.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a face detection and recognition method based on a convolution width network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a face detection and identification method based on a convolution width network comprises the following steps:

s1, acquiring a video by using a camera and intercepting a video frame according to a certain frequency;

s2, carrying out face detection on the video frame by using a deep network MTCNN;

s3, inputting the detected face area into a convolution width face recognition network, and outputting the final universal face feature;

and S4, comparing the obtained face features with the face features in the existing personnel library, calculating the difference value between the obtained face features and each face feature in the personnel library, and outputting a face recognition result according to a set threshold value.

Further, the deep network MTCNN comprises three cascaded sub-networks, namely P-Net, R-Net and Q-Net.

Further, the specific structure of the P-Net is as follows:

the P-Net network inputs 12x 3 images, 3 convolutions of 3 x 3 are arranged in the middle, the output of the first part of the network is to judge whether the 12x12 images have human faces, and the output vector size is 1 x 2; the second part of the network outputs the offset of the face frame position of which the current face frame position is relatively perfect, and the output vector is 1 multiplied by 4, which represents the relative offset of the abscissa of the upper left corner, the relative offset of the ordinate of the upper left corner of the frame, the error of the width of the frame and the error of the height of the frame; the third part of the network outputs the positions of 5 feature points of the human face, corresponding to the left and right eye positions, the nose position, and the left and right mouth positions, respectively, each feature point needs to be represented in two dimensions, so that the output is a vector size of 1 × 1 × 10.

Further, R-Net is specifically:

the R-Net network inputs 24 x 3 images, including 3 convolutions, 3 x 3 and 2x 2, respectively, 3 convolutions are followed by a fully connected layer, the output of which is the same as the P-Net output, and comprises three parts: the vector of 1 × 1 × 2 indicates whether or not a human face is present, the vector of 1 × 1 × 4 indicates face frame position offset information, and the vector of 1 × 1 × 10 indicates 5 human face feature positions.

Further, the specific structure of Q-Net is as follows:

the Q-Net network inputs 48 × 48 × 3 images, including 4 convolutions, of sizes 3 × 3, and 2 × 2, respectively, 4 convolutions followed by a full connection layer, which outputs coordinate information and feature point information of bounding boxes.

Further, in step S2, the performing face detection on the video frame using the deep network MTCNN specifically includes:

carrying out transformation of different scales on the image, and constructing an image pyramid to adapt to the human faces of different scales;

judging whether the region is a face or not by a face classifier in the P-Net, and simultaneously carrying out preliminary proposal on the face region by using frame regression and a locator of a face key point, wherein the part finally outputs a plurality of face regions possibly having faces and inputs the regions into the R-Net for further processing;

in R-Net, input is selected in a thinning mode, error input is omitted, frame regression and key point positioning of a face region are carried out by using a frame regression and face key point positioner again, and finally a credible face region is output;

and the Q-Net continuously performs face discrimination, face region frame regression and face feature positioning, and finally outputs the coordinate information of the face region and five feature points of the face region.

Further, in step S3, the convolutional width face recognition network specifically includes:

s31, initializing convolution width network parameters, wherein the model parameters comprise the number n of mapping feature groups, the number k of features in the groups and the number m of enhanced nodes, and convolution kernels Kernel corresponding to each feature;

s32, initializing a mapping feature node group in width learning by using a random convolution kernel; using model input K, Kernel, a convolution Kernel with random initialization_k(θ₁) Computing feature mapping node Zⁿ≡[Z₁,Z₂,…,Z_n]Wherein the ith group maps feature Z_iEach group of features comprises k mapping features;

Z_i＝X*Kernel_k(θ₁),i＝1,2,…,n (1)

s33, mapping node ZⁿConvolution Kernel using random initialization_m(θ₂) Compute booster node H^m≡[H₁,H₂,…,H_m]Wherein the feature H is enhanced_jIs calculated as formula (2):

H_j≡Zⁿ*Kernel_m(θ₂),j＝1,2,…,m (2)

s34, combining the mapping characteristics and the enhanced node characteristics into a characteristic layer A ═ alpha [ Z | H ], connecting the characteristic layer A and the enhanced node characteristics to a model output layer Y, wherein the connecting weight of the characteristic layer and the output layer is W; where α is a vector, the sum of all elements is 1; the relation between the real output Y and the feature layer A is shown as the formula (3):

Y＝WA (3)

s35, optimizing parameters of the method by using a batch gradient descent algorithm, wherein the parameters comprise the last layer of connection weight and a characteristic layer convolution kernel until a stop condition is reached; if the loss function of the convolution width network is formula (4)

Wherein N represents the number of data, zⁱIs the predicted output corresponding to the ith data;

the partial derivative of the loss function expressed by equation (4) is:

wherein j is 1, 2, representing a feature layer parameter set;

each iteration is relative to the parameter theta₁，θ₂W is updated using formula (7) and formula (8);

wherein α is the learning rate;

repeating the steps until a stopping condition is reached, wherein the stopping condition is that the value of the loss function does not change greatly in the 5-time iteration process;

and S36, obtaining the face recognition characteristics A of the characteristic layer of the optimization model.

Further, in step S4, the difference value between the current face and each face feature in the people database is calculated according to formula (9):

wherein i represents the number in the people pool, A_ijRepresenting the jth characteristic corresponding to the ith personal member;

based on the difference value, obtaining the personnel corresponding to the minimum difference value, judging whether the value is smaller than a set acceptable threshold value, and if the value is smaller than the set acceptable threshold value, outputting the personnel information; if greater than the threshold, the output is that the person is not present.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention makes full use of the features extracted by deep convolution in width learning, and strengthens the features through the enhancement layer, thereby providing effective face features for the face recognition module in the later stage. In addition, due to the characteristics of width learning, the framework can use fewer parameters to realize higher identification precision.

2. The invention combines deep learning and width learning and is used for face detection and recognition, solves the problems of large parameter quantity, large resource consumption and long training time in the existing face recognition model, and can meet the real-time requirement in practical application.

3. Compared with the existing deep face recognition method, the method has the characteristics of rapidness and effectiveness.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the structure of P-Net in the present invention;

FIG. 3 is a schematic diagram of the structure of R-Net in the present invention;

FIG. 4 is a schematic diagram of the structure of Q-Net in the present invention;

FIG. 5 is a schematic diagram of a convolution width face recognition network structure in the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in fig. 1, the present invention provides a face detection and recognition method based on a convolutional width network, which comprises the following steps:

s2, performing face detection on the video frame by using a deep network MTCNN, wherein the deep network MTCNN comprises three cascaded sub-networks which are respectively P-Net, R-Net and Q-Net, and the method comprises the following specific steps:

in R-Net, input is selected in a thinning mode, most of error input is omitted, frame regression and key point positioning of a face region are carried out by using a frame regression and face key point positioner again, and finally a credible face region is output;

the Q-Net has more input features, the last of the network structure is also a larger 256 full-connection layer, more image features are reserved, then face judgment, face region frame regression and face feature positioning are carried out, and finally the upper left corner coordinate and the lower right corner coordinate of the face region and five feature points of the face region are output.

As shown in fig. 2, the specific structure of P-Net is: the P-Net network inputs 12x 3 images, 3 x 3 convolutions are arranged in the middle, the output of the first part of the network is to judge whether the 12x12 images have human faces, and the output vector size is 1 x 2; the second part of the network outputs the offset of the face frame position of which the current face frame position is relatively perfect, and the output vector is 1 multiplied by 4, which represents the relative offset of the abscissa of the upper left corner, the relative offset of the ordinate of the upper left corner of the frame, the error of the width of the frame and the error of the height of the frame; the third part of the network outputs the positions of 5 feature points of the human face, corresponding to the left and right eye positions, the nose position, and the left and right mouth positions, respectively, each feature point needs to be represented in two dimensions, so that the output is a vector size of 1 × 1 × 10.

As shown in FIG. 3, the specific structure of R-Net is as follows: the R-Net network inputs 24 x 3 images, including 3 convolutions, 3 x 3 and 2x 2, respectively, 3 convolutions are followed by a fully connected layer, the output of which is the same as the P-Net output, and comprises three parts: the vector of 1 × 1 × 2 indicates whether or not a human face is present, the vector of 1 × 1 × 4 indicates face frame position offset information, and the vector of 1 × 1 × 10 indicates 5 human face feature positions.

As shown in fig. 4, the specific structure of Q-Net is: the Q-Net network inputs 48 × 48 × 3 images, including 4 convolutions, of sizes 3 × 3, and 2 × 2, respectively, 4 convolutions followed by a full connection layer, which outputs coordinate information and feature point information of bounding boxes.

S3, inputting the detected face area into a convolution width face recognition network for training, and outputting the final general face characteristics, specifically:

s31, initializing model parameters, wherein the model parameters comprise the number n of mapping feature groups, the number k of features in the groups and the number m of enhanced nodes; convolution Kernel corresponding to each feature;

and S32, randomly initializing the mapping feature node group in the width learning by using a convolution kernel. Computing feature mapping node Z using a randomly initialized convolution kernel (K) with model input Kⁿ≡[Z₁,Z₂,…,Z_n]Wherein the ith group maps feature Z_iIs calculated as formula (1), each group of feature packetsK mapping features are contained;

Z_i＝X*Kernel(k),i＝1,2,…,n (1)

s33, mapping node ZⁿComputing enhanced nodes H using randomly initialized convolution kernels Kernel (m)^m≡[H₁,H₂,…,H_m]Wherein the feature H is enhanced_jIs calculated as formula (2):

H_j≡Zⁿ*Kernel(m),j＝1,2,…,m (2)

Y＝WA (3)

the partial derivative of the loss function expressed by equation (4) is:

wherein j is 1, 2, representing a feature layer parameter set;

wherein α is the learning rate;

and S36, obtaining the face recognition feature A of the optimization model feature layer.

Fig. 5 is a schematic diagram of a convolutional width face recognition network structure.

S4, comparing the obtained face features with face features in the existing personnel library, calculating the difference value between the obtained face features and each face feature in the personnel library, and outputting a face recognition result according to a set threshold value; the difference value between the current face and each face feature in the person library is calculated by the formula (9):

In this example, a model test was performed using the published data set CASIA-Webface image. The data set contained 494414 images of 10575 individuals. The same data allocation was used when compared to the deep learning method Facenet, the results of which are shown in table 1 below. From experimental results, the training precision of the face detection recognition model based on the convolution width is improved by 2.09% compared with that of the Facenet model, and the testing precision is improved by 1.21%. However, the training time and the parameter number of the system are greatly reduced, and the time for capturing the same number of pictures is reduced, which is significant for realizing real-time performance.

TABLE 1

The invention combines the depth and width network models and provides a face detection and recognition system based on a convolution width network. The model based on the framework has the advantages of high efficiency and good effect, and has very friendly requirements on the real-time performance of deploying face recognition.

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A face detection and identification method based on a convolution width network is characterized by comprising the following steps:

2. The face detection and recognition method based on the convolutional wide network as claimed in claim 1, wherein the deep network MTCNN comprises three cascaded sub-networks, P-Net, R-Net and Q-Net respectively.

3. The face detection and recognition method based on the convolutional width network as claimed in claim 2, wherein the specific structure of P-Net is as follows:

4. The face detection and recognition method based on the convolutional width network as claimed in claim 2, wherein R-Net specifically is:

5. The face detection and recognition method based on the convolutional width network as claimed in claim 2, wherein the specific structure of Q-Net is as follows:

6. The face detection and recognition method based on the convolutional width network as claimed in claim 2, wherein in step S2, the performing face detection on the video frame using the deep network MTCNN specifically comprises:

7. The method for detecting and identifying a face based on a convolutional width network as claimed in claim 1, wherein in step S3, the convolutional width face identification network specifically comprises:

s32, initializing a mapping feature node group in width learning by using a random convolution kernel; using model input X, Kernel, a convolution Kernel initialized at random_k(θ₁) Computing feature mapping node Zⁿ≡[Z₁，Z₂，...，Z_n]Wherein the ith group maps feature Z_iEach group of features comprises k mapping features;

Z_i＝X*Kernel_k(θ₁)，i＝1，2，...，n (1)

s33, mapping node ZⁿConvolution Kernel using random initialization_m(θ₂) Compute booster node H^m≡[H₁，H₂，...，H_m]Wherein the feature H is enhanced_jIs calculated as formula (2):

H_j≡Zⁿ*Kernel_m(θ₂)，j＝1，2，...，m (2)

Y＝WA (3)

Wherein N represents a numberNumber of data, zⁱIs the predicted output corresponding to the ith data;

the partial derivative of the loss function expressed by equation (4) is:

wherein j is 1, 2, representing a feature layer parameter set;

wherein α is the learning rate;

8. The face detection and recognition method based on the convolutional width network of claim 1, wherein in step S4, the difference value between the current face and each face feature in the people' S bank is calculated by the following formula (9):