CN109934115B

CN109934115B - Face recognition model construction method, face recognition method and electronic equipment

Info

Publication number: CN109934115B
Application number: CN201910120174.4A
Authority: CN
Inventors: 蔡啸; 肖潇; 晋兆龙
Original assignee: Suzhou Keda Special Video Co ltd; Suzhou Keyuan Software Technology Development Co ltd; Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Special Video Co ltd; Suzhou Keyuan Software Technology Development Co ltd; Suzhou Keda Technology Co Ltd
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2021-11-02
Anticipated expiration: 2039-02-18
Also published as: CN109934115A

Abstract

The invention relates to the technical field of image processing, in particular to a construction method of a face recognition network, a face recognition method and electronic equipment, wherein the construction method comprises the steps of obtaining a sample image with annotation information; inputting the sample image into a detection network to obtain the position of a face candidate frame; based on the position of the face candidate frame, intercepting an image of a position area corresponding to the face candidate frame to obtain a candidate frame screenshot; fixing parameters of a detection network, inputting the screenshot of the candidate frame into a face recognition network, and training the face recognition network; and releasing parameters of the detection network, and training the detection network and the face recognition network after the parameters are adjusted to construct a face recognition model. The coupling of the detection network and the face recognition network is realized by utilizing the process of candidate face screenshot, the recognition result of the face recognition network can be transmitted to the detection network through back propagation, the detection precision of the candidate frame is improved, and the accuracy of face recognition is improved while the error accumulation transmission is avoided.

Description

Face recognition model construction method, face recognition method and electronic equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a construction method of a face recognition model, a face recognition method and electronic equipment.

Background

The process of face recognition can be roughly divided into three steps: first, determine "where" the face is; secondly, finding key points of face contours such as eyebrows, ears and noses; finally, based on the face recognition of big data, the ' who ' this is ' is clarified. The face recognition aiming at the image sequence increases the time dimension compared with the recognition of a single image. The image sequence refers to a series of sequential images obtained by continuous capturing of a scene or a person. In short, the information which can help to judge who is the best is detected when and where the information (best three-dimensional positioning of human face) appears, and the information is identified.

The face detection is the basis of recognition, and accurate positioning of the face on the picture is a prerequisite for all subsequent recognition. The early main technical route of the face detection algorithm is to manually design various face feature templates and perform image matching based on a basic method of image processing and expert experience. The cascade detection structure appearing in the beginning of the 21 st century greatly improves the accuracy and the real-time performance of detection and greatly widens the application scene of a face detection algorithm.

Since the deep learning method represented by the convolutional neural network is acquired in ImageNet competition in 2012, the face recognition technology almost achieves one hundred percent of recognition rate on some open databases. Face detection methods based on deep learning are becoming mainstream. However, various errors inevitably exist in face detection, including missed detection, false alarm, position deviation, etc., and since face recognition based on deep learning is based on a forward propagation manner, these errors may also be transmitted and affect pose analysis, expression extraction, etc. in subsequent tasks, thereby affecting the accuracy of the recognition task.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for constructing a face recognition model, a face recognition method, and an electronic device, so as to solve the problem of low face recognition accuracy.

According to a first aspect, an embodiment of the present invention provides a method for constructing a face recognition model, including:

acquiring a sample image with marking information; the annotation information is used for representing the position of a real human face in the sample image;

inputting the sample image into a detection network to obtain the position of a face candidate frame; the detection network is obtained by utilizing a first convolution network for training;

based on the position of the face candidate frame, intercepting an image corresponding to the face candidate frame region to obtain a candidate frame screenshot; the candidate frame screenshot is obtained by a candidate face screenshot module when the position of the face candidate frame passes through, and the candidate face screenshot module has a back propagation characteristic;

fixing parameters of the detection network, inputting the screenshot of the candidate frame into a face recognition network, and training the face recognition network to adjust the parameters of the face recognition network; the face recognition network is constructed by utilizing a second convolutional network;

and releasing the parameters of the detection network, and training the detection network and the face recognition network after the parameters are adjusted to construct a face recognition model.

According to the construction method of the face recognition model provided by the embodiment of the invention, as the candidate face screenshot module has the reverse propagation characteristic, the candidate frame screenshot can be obtained from the face candidate frame, and conversely, the face candidate frame can also be obtained from the candidate frame screenshot; the coupling between the detection network and the face recognition network is realized by utilizing the process of candidate face screenshot, the recognition result of the face recognition network can be transmitted to the detection network through back propagation, the detection precision of the candidate frame is improved, and the accuracy of face recognition is improved while the error accumulation transmission is avoided.

With reference to the first aspect, in a first implementation manner of the first aspect, the releasing parameters of the detection network, and performing joint training on the detection network and the face recognition network after adjusting the parameters to construct a face recognition model includes:

fixing parameters of the face recognition network, and training the detection network to adjust the parameters of the detection network;

and releasing parameters of the face recognition network, and performing joint training on the detection network and the face recognition network after the parameters are adjusted so as to optimize the parameters of each network in the face recognition model.

According to the construction method of the face recognition model provided by the embodiment of the invention, the detection network is trained by using the recognition result of the face recognition network, so that the detection precision of the detection network is improved; and then the detection network and the face recognition network are jointly trained, the face recognition network helps to improve the precision of the detection network, the face recognition network provides a basis for the detection network, and the detection and the recognition are mutually coupled, so that the accuracy of the face recognition is improved.

With reference to the first aspect, in a second implementation manner of the first aspect, the output of the face recognition network is a multitask result of face recognition, where the multitask result includes coordinates of feature points of a human face and a classification of attributes of the human face; the loss function of the face recognition network is a function obtained by adding the loss functions of all the face recognition tasks.

The construction method of the face recognition model provided by the embodiment of the invention is based on the multitask correlation of face recognition, adopts a mode of weighting and summing a plurality of loss functions to realize a face recognition network, improves the integral capturing capability of deep features and improves the accuracy of recognition tasks.

With reference to the first aspect, the first implementation manner of the first aspect, or the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the inputting the sample image into a detection network to obtain coordinates of a face candidate frame includes:

inputting the sample image into the detection network to obtain the two classification confidence degrees and the position offset of a plurality of anchor point frames; wherein the position offset is an offset of the anchor point frame relative to the position of the real face;

judging whether the two classification confidence coefficients of the anchor point frames are smaller than a preset value or not;

when the two classification confidence degrees of the anchor point frame are smaller than a preset value, discarding the anchor point frame;

and carrying out non-maximum suppression on the rest anchor point frames to obtain the positions of the face candidate frames.

According to a second aspect, an embodiment of the present invention further provides a face recognition method, including:

acquiring an image sequence to be identified;

inputting the image sequence to be recognized into the face recognition model constructed by the method for constructing the face recognition model according to any one of claims 1 to 4 to obtain a face recognition result; the result of the face recognition comprises the coordinates of the face characteristic points;

calculating a numerical value of face quality evaluation based on the coordinates of the face feature points;

outputting an optimal face image according to the numerical value of the face quality evaluation; the optimal face image is a face image with a centered position, a proper size and a front posture.

According to the face recognition method provided by the embodiment of the invention, the optimal face is the candidate face with the most projection energy on the matrix subspace of the two-dimensional image and is projected to the energy outside the subspace, so that the optimal face can be understood as the three-dimensional information of the face, such as the pitch angle, the yaw angle and the like which cannot be described by a two-dimensional plane, and the numerical value for calculating the face quality evaluation by adopting the coordinates of the feature points of the face has higher reliability.

With reference to the second aspect, in a first embodiment of the second aspect, the calculating a value of the face quality evaluation based on the coordinates of the face feature points includes:

extracting coordinates of standard face characteristic points;

calculating a constant matrix based on the coordinates of the standard face feature points;

and calculating the numerical value of the human face quality evaluation by using the coordinates of the human face characteristic points and the constant matrix.

With reference to the first embodiment of the second aspect, in the second embodiment of the second aspect, the following formula is used to calculate the value of the face evaluation:

Y＝|P×B|^2；

wherein P is the coordinates of the face characteristic points and is expressed as

N is the number of the face characteristic points; b is the constant matrix, and the size of the constant matrix is N multiplied by N.

With reference to the second aspect, the first embodiment of the second aspect, or the second embodiment of the second aspect, in a third embodiment of the second aspect, the outputting an optimal face image according to the value of the face quality evaluation includes:

respectively sequencing the numerical values of the face quality evaluation corresponding to the same face;

and outputting the optimal face image of the same face based on the sequencing result.

According to a third aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the method for constructing a face recognition model according to the first aspect or any one of the embodiments of the first aspect, or to perform the method for face recognition according to the second aspect or any one of the embodiments of the second aspect.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the method for constructing a face recognition model according to the first aspect or any one of the embodiments of the first aspect, or execute the method for face recognition according to the second aspect or any one of the embodiments of the second aspect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method of constructing a face recognition model according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a convolution structure of a detection network according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of constructing a face recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a detection network according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method of constructing a face recognition model according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a face detection model according to an embodiment of the present invention;

FIG. 7 is a flow chart of a face recognition method according to an embodiment of the invention;

FIG. 8 is a flow chart of a face recognition method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a face recognition model according to an embodiment of the present invention;

fig. 10 is a block diagram showing the construction of an apparatus for constructing a face recognition model according to an embodiment of the present invention;

fig. 11 is a block diagram of a configuration of a face recognition apparatus according to an embodiment of the present invention;

fig. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In accordance with an embodiment of the present invention, there is provided an embodiment of a method for constructing a face recognition model, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that described herein.

The embodiment of the invention provides a construction method of a face recognition model and a face recognition method of the face recognition model constructed based on the construction method. The face recognition model comprises a detection network and a face recognition network, wherein the detection network is used for marking the position of a face candidate frame in an input image. The method comprises the steps of obtaining a candidate frame screenshot by intercepting an image corresponding to a face candidate frame region from the image, and expressing the process of intercepting the candidate frame screenshot by a candidate face screenshot module in the following description, namely inputting position information of the face candidate frame into the candidate face screenshot module to obtain a corresponding candidate frame screenshot. The face recognition network is used for performing multi-task recognition on the input candidate frame screenshot, namely recognizing face characteristic points, face attributes and the like in the candidate frame screenshot, and the output of the specific face recognition network can be trained according to specific conditions and is not displayed here.

The face recognition method is a method for recognizing the optimal face from an input image sequence, wherein the optimal face is a face image which is centered in position, proper in size and in front posture in the whole image sequence. Specifically, an image sequence to be recognized is input into a face recognition model, and then the output result of the face recognition network is evaluated by combining a face quality evaluation function, so that the optimal face is obtained.

Hereinafter, a construction method of the face recognition model and a face recognition method will be described in detail.

In this embodiment, a method for constructing a face recognition model is provided, which can be used in the above-mentioned electronic devices, such as a computer, a mobile phone, a tablet computer, and the like, fig. 1 is a flowchart of a method for constructing a face recognition model according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

and S11, acquiring the sample image with the annotation information.

And the annotation information is used for representing the position of the real human face in the sample image. Before inputting the sample image into the detection network, the user needs to mark the position of the real face on the sample image, which is convenient for verifying the position of the face detected by the detection network subsequently.

Specifically, during labeling, an image feature tag selection tool (e.g., LabelImg) can be used for manually labeling the picture to be queried, and the labeling information is (X)_p,Y_p,L_p,W_p) Wherein (X)_p,Y_p) As coordinate information of the top left corner of the feature, (L)_p,W_p) The length and width of the pixel occupied by the feature; manual labeling and the like may also be performed.

And S12, inputting the sample image into the detection network to obtain the position of the face candidate frame.

The detection network is obtained by training through a first convolutional network. Specifically, the detection network comprises a convolution structure and a full connection layer, wherein the convolution structure is used for carrying out feature extraction on an input sample image and constructing a feature map; and the full connection layer is used for carrying out deep feature extraction on the feature map, and subsequently training parameters in the detection network by combining the set loss function.

Convolution structure as shown in fig. 2, the first convolution layer (con1 layer) to the fifth convolution layer (pool5 layer) of the AlexNet convolution network can be taken. Optionally, some additional convolution-pooling structure, etc. may also be added after pool 5.

It should be noted that the detection network may be trained in advance, or may be trained when a face recognition model needs to be constructed, and no limitation is imposed herein, and it is only required to ensure that the detection network can mark a face candidate box on an input sample image.

After the electronic equipment acquires the sample image with the annotation information, the sample image is input into a detection network, and then the position of the candidate face frame can be obtained.

And S13, based on the position of the face candidate frame, intercepting the image of the position area corresponding to the face candidate frame to obtain the screenshot of the candidate frame.

The candidate frame screenshot is obtained by a candidate face screenshot module when the position of the face candidate frame passes through, and the candidate face screenshot module has a back propagation characteristic.

After obtaining the position of the face candidate frame, the electronic device intercepts an image corresponding to the position area of the face candidate frame based on the position of the face candidate frame. As described above, the process is expressed as a candidate face capture module, and specifically, the candidate face capture module may be understood as a program segment for implementing image capture, where the input of the program segment is the position of the face candidate frame, and the output is the candidate face image. The electronic equipment can intercept the image corresponding to the face candidate frame area by using the candidate face screenshot module so as to obtain the candidate frame screenshot. The specific screenshot principle can be an affine mapping principle, or other intercepting methods can be used, and only the candidate face screenshot module needs to be ensured to intercept the image corresponding to the face candidate frame region and has the back propagation characteristic. In the following description, the face capture module will be described in detail.

And S14, fixing parameters of the detection network, inputting the screenshot of the candidate frame into the face recognition network, and training the face recognition network to adjust the parameters of the face recognition network.

Wherein the face recognition network is constructed using a second convolutional network.

Before the electronic device trains the face recognition model, the acquired detection network, the face screenshot module and the face recognition network may be cascaded in sequence to form an initial face recognition model. The face recognition network is a recognition network of a convolution structure, the input is a face candidate frame screenshot (i.e., the candidate frame screenshot obtained in S13), and the output is a result of face recognition. The result of the face recognition may be a multi-task recognition result, that is, the result includes a face feature point positioning task and a plurality of face attribute classification tasks. The face attribute classification includes gender, race, age, and the like. The specific multitask identification can be specifically set according to actual conditions, and is not limited herein.

The electronic equipment is a face recognition network after the face recognition network subjected to network cascade is initialized with parameters, and the parameters in the face recognition network are set manually. After the cascade connection, the electronic device firstly fixes the parameters of the detection network and trains the face recognition network separately to adjust the parameters of the face recognition network.

Specifically, by setting the number of iterations (i.e., the number of times S11 to S14 are performed), the parameters of the face recognition network are continuously adjusted until the number of iterations reaches a preset value. Or setting a threshold of the loss function, and ending the iteration when the loss function corresponding to the face recognition network meets the condition of the threshold, and the like.

In summary, the electronic device first fixes the parameters of the detection network, and trains the face recognition network separately for forward propagation training.

And S15, releasing the parameters of the detection network, and training the detection network and the face recognition network after the parameters are adjusted to construct a face recognition model.

After the training of the face recognition network is completed, the electronic device releases the parameters of the detection network, i.e., the detection network is trained with the face recognition network obtained after the parameters are adjusted in S14. The detection network and the face recognition network can be trained together, or the parameters of the face recognition network can be fixed, and the detection network is trained independently; and then releasing parameters of the face recognition network, performing joint training on the detection network and the face recognition network, and the like. In the following description, this step will be described in detail.

After the electronic equipment trains the detection network and the face recognition network with the adjusted parameters, a face recognition model can be constructed.

In the method for constructing the face recognition model provided by this embodiment, since the candidate face screenshot module has a back propagation characteristic (that is, the face screenshot expression represented by the candidate face screenshot module is derivable), a candidate frame screenshot can be obtained from the face candidate frame, and conversely, a face candidate frame can also be obtained from the candidate frame screenshot; the coupling between the detection network and the face recognition network is realized by utilizing the process of candidate face screenshot, the recognition result of the face recognition network can be transmitted to the detection network through back propagation, the detection precision of the candidate frame is improved, and the accuracy of face recognition is improved while the error accumulation transmission is avoided.

The embodiment further provides a method for constructing a face recognition model, which can be used in the above electronic devices, such as a computer, a mobile phone, a tablet computer, and the like, and fig. 4 is a flowchart of the method for constructing a face recognition model according to the embodiment of the present invention, and as shown in fig. 4, the flowchart includes the following steps:

and S21, acquiring the sample image with the annotation information.

And the annotation information is used for representing the position of the real human face in the sample image. Please refer to S11 in fig. 1, which is not described herein again.

And S22, inputting the sample image into the detection network to obtain the position of the face candidate frame.

The detection network is obtained by training through a first convolutional network. Please refer to S12 in fig. 1, which is not described herein again.

And S23, based on the position of the face candidate frame, intercepting the image of the position area corresponding to the face candidate frame to obtain the screenshot of the candidate frame.

The candidate frame screenshot is obtained by the candidate face screenshot module when the position of the face candidate frame passes through the candidate face screenshot module, and the candidate face screenshot module has a back propagation characteristic.

As shown above, the electronic device expresses the implementation process of this step as a candidate face screenshot module, and specifically, the candidate face screenshot module performs coordinate operation according to the position of the face candidate frame to obtain a candidate frame screenshot.

Optionally, the candidate face capture module is constructed as follows:

(1) two discrete lattices U and V are constructed. Wherein V represents the coordinates of each pixel point on the output image (i.e., the candidate frame screenshot); u represents the coordinates of the corresponding pixel points of each point on V on the original image (i.e., the sample image). All coordinate points on U and V are normalized to be between [ -1, 1], and the number of coordinate points is determined by the width and height parameters of the output image.

(2) The lattice U corresponds to the coordinates on the lattice V one by one, and the corresponding relation is expressed by affine transformation:

the values of the six parameters can be calculated by using the positions of the face candidate frames.

(3) And calculating the dot matrix U, and obtaining the screenshot of the candidate face through bilinear interpolation.

And S24, fixing parameters of the detection network, inputting the screenshot of the candidate frame into the face recognition network, and training the face recognition network to adjust the parameters of the face recognition network.

Wherein the face recognition network is constructed using a second convolutional network. Referring to fig. 4, it is shown in fig. 4 that the loss function corresponding to the face recognition network is a multi-task joint loss function, that is, the loss function is the sum of the loss functions of the face recognition tasks. Specifically, the construction process of the face recognition network is as follows:

(1) selecting a proper convolution network as a main convolution structure of the face recognition network, and extracting depth features by using a full connection layer;

(2) and connecting the loss functions of the plurality of identification tasks to form a multi-task identification network.

The human face recognition multiple tasks include but are not limited to a human face feature point positioning task and a plurality of human face attribute classification tasks. The Loss function of the feature point positioning task can be a Euclidean distance Loss function (i.e., Euclidean Loss function), and the Loss function of the face attribute classification task is Softmax + Loss function (i.e., SoftmaxLoss function). The corresponding loss weight coefficient is flexibly adjusted according to factors such as importance, accuracy and relevance, a plurality of loss functions are trained together, the capture angle of depth features is enriched, and the respective recognition precision is improved; for example, based on the multitask correlation of the face recognition, a face recognition network is realized by adopting a multi-loss function weighted summation mode, the integral capturing capability of deep features is improved, and the accuracy of the recognition task is improved.

Furthermore, the face recognition network can output a multitask result, and by means of shared features among multiple tasks of face recognition, the required memory and the calculated amount are reduced, the multi-angle advantages of the multiple tasks are also played, and the feature capture capability of the network is improved in mutual promotion.

For the rest, please refer to S24 in the embodiment shown in fig. 1, which is not described herein again.

And S25, releasing the parameters of the detection network, and training the detection network and the face recognition network after the parameters are adjusted to construct a face recognition model.

After the electronic device fixes the parameters of the detection network in S24 to train the face recognition network, the parameters of the face recognition network are fixed first, and the detection network is trained; and then releasing parameters of the face recognition network, and jointly training the detection network and the face recognition network. Specifically, the method comprises the following steps:

and S251, fixing parameters of the face recognition network, and training the detection network to adjust the parameters of the detection network.

The electronic device may lock the parameters of the face recognition network trained in S24, and train the detection network by setting the number of iterations, using the multi-task joint loss function in fig. 4, to update the parameters of the detection network.

And S252, releasing parameters of the face recognition network, and performing combined training on the detected network and the face recognition network after the parameters are adjusted so as to optimize the parameters of each network in the face recognition model.

The electronic device releases parameters of the face recognition network, that is, simultaneously releases parameters of the detection network and the face recognition network, performs joint training on the detection network and the face recognition network after the parameters are adjusted, and optimizes the parameters of the detection network and the parameters in the face recognition network, wherein the parameter optimization in this step is still performed based on the multi-task joint loss function described in fig. 4.

The electronic device refines the positional offset of the anchor point frame formed in the detection network by using the abundant recognition results of the face recognition network in this step.

Compared with the embodiment shown in fig. 1, the method for constructing the face recognition model provided by the embodiment trains the detection network by using the recognition result of the face recognition network, so that the detection precision of the detection network is improved; and then the detection network and the face recognition network are jointly trained, the face recognition network helps to improve the precision of the detection network, the face recognition network provides a basis for the detection network, and the detection and the recognition are mutually coupled, so that the accuracy of the face recognition is improved.

The embodiment further provides a method for constructing a face recognition model, which can be used in the above electronic devices, such as a computer, a mobile phone, a tablet computer, and the like, and fig. 5 is a flowchart of the method for constructing a face recognition model according to the embodiment of the present invention, and as shown in fig. 5, the flowchart includes the following steps:

and S31, acquiring the sample image with the annotation information.

And the annotation information is used for representing the position of the real human face in the sample image. Please refer to S21 in the embodiment shown in fig. 3 for details, which are not described herein.

And S32, inputting the sample image into the detection network to obtain the position of the face candidate frame.

The detection network is obtained by training through a first convolutional network. Specifically, the structure of the detection network is shown in fig. 6, and the construction process of the detection network specifically considers two aspects of classification confidence and position offset as follows:

(1) a collection of anchor boxes is generated on the feature map. The anchor frames on the feature map and the anchor frames on the input original image have a one-to-one correspondence relationship. The coordinate position of the central point of the anchor point frame on the feature map can be correspondingly found out. The size of the anchor box on the feature map multiplied by the downsampling multiple of the convolutional layer is the size of the anchor box on the original map. By proper parameter design, anchor frames with different sizes and specifications are generated, and the specifications are overlapped to cover the whole original image.

(2) And (4) intercepting a feature region on the feature map output by the convolution structure by using an anchor point frame, and then extracting the depth feature of the region by using a full connection layer.

(3) And calculating the binary confidence and the position offset of each anchor point box by using the depth characteristics. The labels of the two classes are obtained by judging the coverage area ratio of the anchor point frames and the real face frames, and the labels with the position offset can be calculated from the corresponding positions on the original drawing on each anchor point image. The offset label comprises four real values (dx, dy, dw, dh) corresponding to an abscissa x at the upper left corner of the anchor block, an ordinate y at the upper left corner, a width w and a height h; which is used to represent a rectangular box (x) relative to a real face₀，y₀，w₀，h₀) The amount of change in (c). Is defined as:

further, the construction process of the detection network can also be understood as the generation process of the face candidate frame. Specifically, the method comprises the following steps:

s321, inputting the sample image into a detection network to obtain the two classification confidence degrees and the position offset of the anchor point frames.

Wherein the position offset is an offset of the anchor point frame relative to the position of the real face.

S322, judging whether the two classification confidence degrees of the anchor point frames are smaller than a preset value.

And a preset value is set in the detection network, and when the two classification confidence degrees of the anchor point frame are smaller than the preset value, the anchor point frame is discarded. And when the confidence degrees of the two classifications of the anchor point frame are larger than or equal to the preset value, reserving the anchor point frame. Continuously repeating the steps until all anchor point frames are judged completely; then, S323 is performed.

And S323, performing non-maximum suppression on the rest anchor point frames to obtain the positions of the face candidate frames.

And sorting the rest anchor boxes screened in the step S322 by taking the total loss value of the anchor boxes as a scoring basis. Then, the anchor point frame with the best evaluation is reserved as a candidate frame, and other anchor point frames with large overlapping areas with the candidate frame are removed. And repeating the operation in the rest anchor point frames until all the frames are traversed, thereby obtaining the positions of the face candidate frames.

And S33, based on the position of the face candidate frame, intercepting the image of the position area corresponding to the face candidate frame to obtain the screenshot of the candidate frame.

The candidate frame screenshot is obtained by the candidate face screenshot module when the position of the face candidate frame passes through the candidate face screenshot module, and the candidate face screenshot module has a back propagation characteristic. Please refer to S23 in fig. 3 for details, which are not described herein.

And S34, fixing parameters of the detection network, inputting the screenshot of the candidate frame into the face recognition network, and training the face recognition network to adjust the parameters of the face recognition network.

Wherein the face recognition network is constructed using a second convolutional network. Please refer to S24 in fig. 3 for details, which are not described herein.

And S35, releasing the parameters of the detection network, and training the detection network and the face recognition network after the parameters are adjusted to construct a face recognition model. Please refer to S25 in fig. 3 for details, which are not described herein.

Compared with the embodiment shown in fig. 3, the method for constructing the face recognition model provided by the embodiment is based on the multitask correlation of face recognition, and the face recognition network is realized by adopting a mode of weighting and summing a plurality of loss functions, so that the overall capturing capability of deep features is improved, and the accuracy of recognition tasks is improved.

In accordance with an embodiment of the present invention, there is provided a face recognition method embodiment, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The present embodiment provides a face recognition method, which can be used in the electronic devices, such as a computer, a mobile phone, a tablet computer, and the like. The face recognition method in this embodiment is performed based on the face recognition model constructed by the face recognition model construction method provided in the embodiments shown in fig. 1, 3, and 5. Fig. 7 is a flowchart of a face recognition method according to an embodiment of the present invention, and as shown in fig. 7, the flowchart includes the following steps:

and S41, acquiring the image sequence to be recognized.

The image sequence to be identified acquired by the electronic equipment can be stored in advance, or can be acquired by the electronic equipment from the image acquisition equipment in real time, only the electronic equipment needs to be ensured to acquire the image sequence to be identified, and the specific acquisition mode is not limited at all.

And S42, inputting the image sequence to be recognized into the face recognition model constructed according to the construction method of the face recognition model, and obtaining the face recognition result.

And the face recognition result comprises the coordinates of the face characteristic points. The electronic equipment inputs the image sequence to be recognized into the face recognition model, and then a face recognition result can be obtained. For example, the feature point coordinates of each face image are output. Optionally, the output of the face recognition model may also be a multitasking result of face recognition. For example, the multitasking result includes coordinates of the face feature points and a classification of the face attributes, and the like.

S43, a numerical value of the face quality evaluation is calculated based on the coordinates of the face feature points.

The aim of the face recognition is to identify the best face belonging to the same face from the image to be recognized. For example, the image sequence is acquired by face images of various postures of a certain road segment, which a plurality of people pass through, and then the face recognition model is used for recognizing the best face image belonging to the same person from the image sequence. The optimal face image is the face image with a centered position, a proper size and a front posture. Mathematically, the criterion can also be described as picking the candidate face that projects the most energy on the matrix subspace of the two-dimensional image. The energy projected to the outside of the subspace can be understood as the three-dimensional information such as the pitch angle, the deflection angle, the facial contour, the face linkage relation and the like of the human face which cannot be described by the two-dimensional plane.

The face feature points are the basis for measuring the best face, and based on the coordinates of the face feature points obtained in S43, the projection energy of each image in the image sequence to be recognized on the matrix subspace of the two-dimensional image can be calculated, and the projection energy can represent whether the candidate face is the best face.

And S44, outputting the optimal face image according to the numerical value of the face quality evaluation.

The electronic equipment calculates a numerical value of face quality evaluation corresponding to the same face in S43, and determines the optimal face image corresponding to the same face by comparing the magnitude of the numerical value of the face quality evaluation; or, in order to facilitate comparison of the sizes, the numerical values of the face quality evaluation are firstly sorted, and then the optimal face image corresponding to the same face is output. For example, after the image sequence passes through the face recognition network, the face recognition network marks the same face with the same ID, and then when the optimal face image is output, the optimal face image corresponding to the same ID, that is, the optimal face image of the same face, may be determined by comparing the values of the face quality evaluations with the same ID.

In the face recognition method provided by this embodiment, because the optimal face is the candidate face with the largest projection energy on the matrix subspace of the two-dimensional image and is projected to energy outside the subspace, it can be understood as stereoscopic information such as a face pitch angle and a face yaw angle that cannot be described by a two-dimensional plane, and a numerical value for calculating face quality evaluation by using coordinates of the face feature points has high reliability.

In this embodiment, a face recognition method is provided, which can be used in the above-mentioned electronic devices, such as a computer, a mobile phone, a tablet computer, and the like, fig. 8 is a flowchart of the face recognition method according to the embodiment of the present invention, and as shown in fig. 8, the flowchart includes the following steps:

and S51, acquiring the image sequence to be recognized. Please refer to S41 in fig. 7 for details, which are not described herein.

And S52, inputting the image sequence to be recognized into the face recognition model constructed according to the construction method of the face recognition model, and obtaining the face recognition result.

Wherein the result of the face recognition comprises coordinates of the face feature points. Specifically, the electronic device classifies all candidate frames of the entire image sequence after the image sequence passes through the face recognition model, that is, classifies all candidate frames according to the classification of the same face.

Please refer to S42 in fig. 7 for details, which are not described herein.

Optionally, in the image sequence to be recognized, if the area of the coordinate overlapping region in two connected images (i.e., the front and rear images in the image sequence) exceeds the threshold value, the face candidate frames are regarded as describing the same target and are classified into one type.

S53, a numerical value of the face quality evaluation is calculated based on the coordinates of the face feature points.

Specifically, the method comprises the following steps:

and S531, extracting the coordinates of the standard human face characteristic points.

The standard human face is a human face with a centered position, a proper size and a front posture. The coordinates of a set of standard face characteristic points can be stored in the electronic equipment in advance, and when the numerical value of the face quality evaluation is calculated, the coordinates of the standard face characteristic points can be directly extracted from the electronic equipment.

And S532, calculating a constant matrix based on the coordinates of the standard face characteristic points.

The number of the face feature points output by the electronic device in the face recognition network is N, and for example, the number of the face feature points may be 5, 68, and the like. Then, a feature point matrix composed of the coordinates of the face feature points corresponding to the face feature points can be obtained, which is specifically expressed as:

and N is the number of the face characteristic points, and the face characteristic point matrix is a 2 XN dimensional matrix, wherein each column is the coordinate of each characteristic point.

The constant matrix is obtained by calculating the coordinates of the characteristic points of the face of the standard brick, and is specifically an NxN constant matrix B.

And S533, calculating the numerical value of the human face quality evaluation by using the coordinates of the human face characteristic points and the constant matrix.

And mapping a characteristic point matrix formed by the coordinates of the human face characteristic points to a constant matrix, so as to select the candidate human face with the maximum projection energy on a matrix subspace of the two-dimensional image. The energy projected to the outside of the subspace can be understood as the three-dimensional information such as the pitch angle, the deflection angle, the facial contour, the face linkage relation and the like of the human face which cannot be described by the two-dimensional plane.

Specifically, the numerical value of the face evaluation is calculated by the following formula:

Y＝|P×B|^2；

wherein P is the face feature point matrix and is expressed as

And S54, outputting the optimal face image according to the numerical value of the face quality evaluation.

The optimal face image is a face image with a centered position, a proper size and a front posture. Specifically, the method comprises the following steps:

and S541, sequencing the numerical values of the face quality evaluation corresponding to the same face respectively.

In S5333, the electronic device calculates to obtain values of the face quality evaluation corresponding to the same face, and sorts the values, for example, from small to large, or from large to small.

And S542, outputting the optimal face image of the same face based on the sorting result.

The electronic device may use the candidate face with the largest or smallest face quality evaluation value as the best face image of the same face.

As an alternative implementation manner of this embodiment, as shown in fig. 9, fig. 9 illustrates a process of three-dimensionally locating recognition information from an image sequence to an optimal face by using a face recognition model. Specifically, the method comprises the following steps: inputting the image sequence to be recognized into a detection network to obtain a face candidate frame; after the face candidate frame passes through the candidate face screenshot module, a candidate frame screenshot is obtained; inputting the screenshot of the candidate box into a face recognition network to obtain a multi-task recognition result; the multitask recognition result is processed by a face quality evaluation module (namely, a face evaluation value is calculated) to obtain a face evaluation value, and the best face can be obtained by processing the face evaluation value. And finally, after the optimal face is obtained, the appearance position of the optimal face in the whole image sequence to be recognized is utilized to obtain the recognition information of the optimal face three-dimensional positioning.

As a specific application example from training of a face recognition model to face recognition, the whole process includes:

1. training preparation for detecting network

(1) Input image preparation

And (5) adjusting the size to 227 x 227, and manually calibrating to obtain a real face detection frame.

(2) Network fabric preparation

The convolution structure corresponds to fig. 2, taking the AlexNet network con1 layer to the pool5 layer.

(3) Anchor box placement rules

The aspect ratio of the selected anchor point frames is 1:1 in consideration of the shape of the human face. By designing appropriate convolutional layer parameters, all anchor blocks can be made to overlap over the entire feature map. Specifically, the sides of the boxes vary by multiples of 4, 8, 16, with the minimum offset of the boxes being 1/4 wide and high, i.e., the maximum overlap area being 9/16.

(4) Anchor point frame calibration

The calibration criteria are as follows: a. regarding each real face detection frame, the anchor point frame with the maximum overlapping proportion with the real face detection frame as a positive sample; b. regarding the rest anchor point frames, if the overlapping proportion of the anchor point frames and a certain real face frame is more than 0.7, the anchor point frames are regarded as positive samples; c. if the overlapping proportion is less than 0.3, the sample is regarded as a negative sample; d. discarding the anchor point frames which are remained; d. and discarding the anchor point frame crossing the image boundary.

(5) Anchor box input

Each mini-batch contains 256 anchor boxes, where the ratio of positive samples to negative samples (or face box to background box) is 1: 3.

2. Training preparation of face recognition network

(1) Input image preparation

And setting parameters of a candidate face screenshot structure, and uniformly outputting the width and height of the image.

(2) Network fabric preparation

And carrying out fine adjustment on the convolution structure according to the identified actual requirement.

(3) Multitask loss weight setting

The method is flexibly set according to the importance of tasks, the cleanness degree of data and the like, and the approximate principle is to ensure that the magnitude of the loss function value of each task is approximately the same.

3. Joint training strategy for recognition network and detection network

a. Fixing or closing parameters in the recognition network, and training the detection network;

b. fixing the detection network parameters, and training a recognition network;

c. and repeating the steps a and b.

d. And simultaneously training a detection network and an identification network for fine adjustment of parameters.

4. Embodiments of the face quality evaluation Module

(1) Input image sequence preparation

The image sequence is from the continuous shooting of the pedestrian walking track by the streetscape monitoring probe.

(2) Preparation of network architecture

The detection network, the candidate face screenshot module, the recognition network and the face quality evaluation module are integrally cascaded.

(3) Integral testing

And inputting all candidate face recognition results of the whole image sequence into a face quality recognition module, and testing whether the output optimal face is in a centered position, proper size and front posture.

In this embodiment, a device for constructing a face recognition model is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides a face recognition model constructing apparatus, as shown in fig. 10, including:

a first obtaining module 61, configured to obtain a sample image with annotation information; and the annotation information is used for representing the position of the real human face in the sample image.

The detection module 62 is configured to input the sample image into a detection network to obtain a position of a face candidate frame; the detection network is obtained by training through a first convolutional network.

An intercepting module 63, configured to intercept, based on the position of the face candidate frame, an image of a position area corresponding to the face candidate frame to obtain a candidate frame screenshot; the candidate frame screenshot is obtained by the candidate face screenshot module when the position of the face candidate frame passes through the candidate face screenshot module, and the candidate face screenshot module has a back propagation characteristic.

A first training module 64, configured to fix parameters of the detection network, input the screenshot of the candidate frame into a face recognition network, and train the face recognition network to adjust the parameters of the face recognition network; wherein the face recognition network is constructed using a second convolutional network.

A second training module 65, configured to release the parameters of the detection network, and train the detection network and the face recognition network after adjusting the parameters, so as to construct a face recognition model.

In the face recognition model construction device provided by this embodiment, since the candidate face screenshot module has a back propagation characteristic, a candidate frame screenshot can be obtained from the position of the face candidate frame, and conversely, the position of the face candidate frame can also be obtained from the candidate frame screenshot; the coupling between the detection network and the face recognition network is realized by utilizing the process of candidate face screenshot, the recognition result of the face recognition network can be transmitted to the detection network through back propagation, the detection precision of the candidate frame is improved, and the accuracy of face recognition is improved while the error accumulation transmission is avoided.

The present embodiment further provides a device for constructing a face recognition model, as shown in fig. 11, including:

and a second obtaining module 71, configured to obtain an image sequence to be identified.

A recognition module 72, configured to input the image sequence to be recognized into the face recognition model constructed according to the method for constructing a face recognition model according to any one of claims 1 to 4, so as to obtain a multitask result of face recognition; the task results include coordinates of the human face feature points and a classification of the human face attributes.

And the evaluation module 73 is used for calculating a numerical value of the face quality evaluation based on the coordinates of the face characteristic points.

An output module 74, configured to output an optimal face image corresponding to each type of the face attributes according to the face quality evaluation value; the optimal face image is a face image with a centered position, a proper size and a front posture.

In the face recognition model construction apparatus provided in this embodiment, since the optimal face is the candidate face with the largest projection energy on the matrix subspace of the two-dimensional image and is projected to energy outside the subspace, it can be understood as stereoscopic information such as a face pitch angle and a face yaw angle that cannot be described by a two-dimensional plane, and the value for calculating the face quality evaluation by using the coordinates of the face feature points has high reliability.

The face recognition model constructing apparatus in this embodiment is presented in the form of a face recognition apparatus in a functional unit, where the unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices capable of providing the above functions.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, which includes the above-mentioned face recognition model construction apparatus shown in fig. 10 or the face recognition apparatus shown in fig. 11.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a terminal according to an alternative embodiment of the present invention, and as shown in fig. 12, the terminal may include: at least one processor 81, such as a CPU (Central Processing Unit), at least one communication interface 83, memory 84, and at least one communication bus 82. Wherein a communication bus 82 is used to enable the connection communication between these components. The communication interface 83 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 83 may also include a standard wired interface and a standard wireless interface. The Memory 84 may be a high-speed RAM Memory (volatile Random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 84 may optionally be at least one memory device located remotely from the processor 81. Wherein the processor 81 may be in connection with the apparatus described in fig. 12, an application program is stored in the memory 84, and the processor 81 calls the program code stored in the memory 84 for performing any of the above-mentioned method steps.

The communication bus 82 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 82 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

The memory 84 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 84 may also comprise a combination of the above types of memory.

The processor 81 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of CPU and NP.

The processor 81 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 84 is also used to store program instructions. The processor 81 may call program instructions to implement a method of constructing a face recognition model as shown in the embodiments of fig. 1, 3 and 5 of the present application or a face recognition method as shown in the embodiments of fig. 7 to 8.

The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the construction method of the face recognition model or the face recognition method in any method embodiment. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A face recognition method, comprising:

acquiring an image sequence to be identified;

inputting the image sequence to be recognized into a face recognition model to obtain a face recognition result; the result of the face recognition comprises the coordinates of the face characteristic points;

outputting an optimal face image according to the numerical value of the face quality evaluation; the optimal face image is a face image with a centered position, a proper size and a front posture;

calculating the value of the face quality evaluation by adopting the following formula:

；

N is the number of the face characteristic points; b is a constant matrix of size

；

The construction of the face recognition model comprises the following steps:

based on the position of the face candidate frame, intercepting an image corresponding to the face candidate frame region to obtain a candidate frame screenshot; the candidate frame screenshot is obtained by a candidate face screenshot module through the position of the face candidate frame, the candidate face screenshot module has a back propagation characteristic, and the candidate face screenshot module is established based on affine transformation;

2. The method according to claim 1, wherein outputting an optimal face image according to the value of the face quality evaluation comprises:

3. The method according to claim 1, wherein the releasing the parameters of the detection network, and training the detection network and the face recognition network after adjusting the parameters to construct the face recognition model comprises:

4. The method of claim 1, wherein the output of the face recognition network is a multitask result of face recognition, the multitask result comprising coordinates of face feature points and a classification of face attributes; the loss function of the face recognition network is a function obtained by adding the loss functions of all the face recognition tasks.

5. The method according to any one of claims 1-4, wherein the inputting the sample image into a detection network to obtain the coordinates of the face candidate frame comprises:

6. An electronic device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the face recognition method of any one of claims 1-5.

7. A computer-readable storage medium storing computer instructions for causing a computer to perform the face recognition method of any one of claims 1-5.