CN109902546B

CN109902546B - Face recognition method, face recognition device and computer readable medium

Info

Publication number: CN109902546B
Application number: CN201810523102.XA
Authority: CN
Inventors: 遇冰; 冯柏岚; 胡一博; 赫然; 孙哲南
Original assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2020-11-06
Anticipated expiration: 2038-05-28
Also published as: CN109902546A; WO2019228317A1

Abstract

The embodiment of the invention relates to the technical field of face recognition, and discloses a face recognition method, a face recognition device and a computer readable medium, wherein the method comprises the following steps: inputting n frames of images contained in first video data into a feature extraction network to respectively extract human face features to obtain n personal face feature matrixes which are in one-to-one correspondence with the n frames of images; fusing the n personal face feature matrixes to obtain a target face feature matrix of the face to be recognized; carrying out face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result; wherein n is more than or equal to 2. According to the face recognition method and device, the face features extracted from the video data and fused with the face features are used for face recognition, and therefore the accuracy of face recognition can be improved.

Description

Face recognition method, face recognition device and computer readable medium

Technical Field

The present invention relates to the field of face recognition technologies, and in particular, to a face recognition method, a face recognition device, and a computer readable medium.

Background

As an important biometric feature recognition technology, face recognition is a very active research hotspot in the fields of pattern recognition and computer vision. Compared with other biological identification technologies such as fingerprints and irises, the method has the characteristics of being direct, friendly, convenient, rapid, hidden in operation, non-invasive, strong in interchangeability and the like, and has a very wide application prospect. With the increasing maturity of the face recognition technology, the face recognition technology is widely applied to various aspects such as public security, banks, customs, airports, intelligent video monitoring, medical treatment and the like, and shows strong vitality.

Video Face Recognition (Video Face Recognition) is a technology for identifying the Face features contained in Video data. Video-based face recognition can not only utilize feature information (spatial information) in each frame of face image, but also make full use of information (temporal information) contained in the video sequence. Compared with the face recognition based on images, the face recognition based on videos has richer available information and is more beneficial to identity recognition.

At present, a video-based face recognition scheme is adopted: estimating the face pose in the video by using the face key points, and clustering face images according to pose estimation information; extracting the face image closest to the clustering center as a key frame image for feature extraction, and then comparing the similarity between the face features to determine the identity information corresponding to the face image in the video. In the technical scheme, extraction of face features in the video depends on extraction of key frames, the quality of the key frames directly influences face recognition accuracy, and the face recognition accuracy is low. Therefore, there is a need to develop a video-based face recognition scheme with higher accuracy.

Disclosure of Invention

The application provides a face recognition method, a face recognition device and a computer readable medium, which can improve the accuracy of face recognition.

In a first aspect, the present application provides a face recognition method, including:

inputting n frames of images contained in first video data into a feature extraction network to respectively extract human face features to obtain n personal face feature matrixes corresponding to the n frames of images one to one, wherein the first video data contains T frames of images, and the n frames of images are any n frames of images containing human faces to be recognized in the T frames of images; the feature extraction network is obtained by training by adopting a first generative pair learning method, and the first generative pair learning method comprises the following steps: training a first generation network and a first antagonistic network, and taking the converged first generation network as the feature extraction network; the training data used for training the first generation network comprises a first positive example and a first negative example, the first positive example comprises an m-personal-face feature matrix extracted from second video data, the first negative example comprises an f-personal-face feature matrix extracted from f pictures, the m-personal-face feature matrix is in one-to-one correspondence with m frames of images contained in the second video data, the f-personal-face feature matrix is in one-to-one correspondence with the f pictures, and the second video data and the f pictures both contain a first face; the first anti-network is used for judging whether the face feature matrix extracted by the first generation network is positive or negative to obtain a first judgment result; the first judgment result is used for calculating with the first positive example and the first negative example to obtain a first loss value, and the first loss value is used for updating the parameters of the first generation network and the parameters of the first countermeasure network until the first generation network converges;

fusing the n personal face feature matrixes to obtain a target face feature matrix of the face to be recognized;

carrying out face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result; wherein T is more than or equal to 2, n is more than or equal to 2 and less than or equal to T, m is more than or equal to 2, and f is more than or equal to 1.

The execution subject of the present application is a face recognition apparatus. The face recognition device can be a mobile phone, a notebook computer, a desktop computer, a tablet computer, a wearable device, a server and the like. The first video data may contain at least two frames of images. Any one of the n human face feature matrixes is used for describing the human face feature information contained in the corresponding image frame. The framework of the counterstudy method includes generating a network and discriminating the network. A generative network is a model that requires learning training with the goal of producing one type of data (positive case) that is indistinguishable from another known type of data (negative case). The discrimination network needs to distinguish between positive and negative examples, and aims to resist the generation network, so that the generation network is forced to produce data which is more difficult to distinguish. The aim of the counterlearning method is to train a generative network model which can generate artificial data in accordance with known data. In the application, a first antagonistic learning method is adopted to train the first generation network to obtain the feature extraction network; the face feature matrix extracted by the feature extraction network from the image frame contained in the video data can be made difficult to distinguish from the face feature matrix extracted from the picture.

Feature fusion is the organically combining together of distinguishing and complementary features in some way as a unified feature. Feature fusion is a common technical means in the technical field of biological identification, and can be realized in various ways. The information contained in the fused features is more accurate and richer. It can be understood that the target face feature matrix is more accurate and richer in information than any one of the n face feature matrices. Therefore, the accuracy of face recognition by using the target face feature matrix is higher. In practical application, feature extraction can be performed on two or more single-frame images contained in video data to obtain a plurality of face feature matrices, and the face feature matrices are fused to obtain a face feature matrix. Optionally, the first video data is disassembled according to a time sequence to obtain the n frames of images, and feature extraction is performed on the n frames of images respectively to obtain n face feature matrices corresponding to the n frames of images one to one. For example, the face recognition device disassembles an input video stream into a time-ordered single-frame image sequence, then extracts a face feature matrix corresponding to each single-frame image respectively, and fuses the face feature matrices into a personal face feature matrix.

According to the method and the device, the face characteristic matrix which is extracted from the video data and is fused with the plurality of face characteristic matrixes is used for face recognition, and the accuracy of face recognition can be improved.

In an alternative implementation, the first generating network is further configured to update a parameter of the first generating network with a weighted sum of the first loss value and the second loss value; and the second loss value represents the possibility of carrying out face recognition by using the m personal face feature matrixes to obtain wrong face recognition results.

The first loss value is calculated by using a first loss function. Optionally, the first loss function is as follows:

wherein d is_iThe method comprises the steps of obtaining positive and negative example types of an ith personal face feature matrix, namely actual positive and negative example results of the ith personal face feature matrix; n is the total number of positive and negative examples, namely the total number of the face feature matrixes of the first discrimination network is input;

judging the positive and negative examples of the ith personal face characteristic matrix;

is 0 or 1, d_iIs 0 or 1, one of 0 and 1 represents a positive case, and the other represents a negative case. A loss value L calculated by the first loss function^AFLAnd the difference between the first discrimination result and the first actual result is represented. In the first actual result, the m individual face feature matrices extracted from the second video data are all positive examples, and the f individual face feature matrices extracted from the f pictures are all negative examples.

The first discrimination network is used for discriminating whether the input face feature matrix is a positive example or a negative example, that is, the output first discrimination result indicates a positive example or a negative example. Specifically, if the first discrimination network discriminates that the input face feature matrix is a positive example, a first discrimination result indicating the positive example is output; otherwise, a first discrimination result indicating a negative example is output. The first discrimination result is 0 or 1; wherein 0 represents a positive case, and 1 represents a negative case; alternatively, 0 represents a negative example and 1 represents a positive example.

The second loss value is calculated by using a second loss function. Optionally, the second loss function is as follows:

wherein, y_iThe actual category label is corresponding to the ith personal face characteristic matrix; s_ijCalculating the probability that the ith personal face characteristic matrix belongs to the jth class, namely calculating the probability that the category label corresponding to the ith personal face characteristic matrix corresponds to the jth class; t is the total number of categories, and each category corresponds to a category label; m is the number of the input human face characteristic matrixes, and the input human face characteristic matrixes are all contained in the n human face characteristic matrixes.

The calculation formula of the weighted sum of the first loss value and the second loss value is as follows:

L1＝αL^AFL+βL^CL1

wherein α and β are constants corresponding to the weight of the first loss function and the weight of the second loss function, respectively.

In this implementation manner, the performance of extracting the face feature information by the first generation network can be optimized by updating the parameters of the first generation network according to the result of face recognition by using the face feature matrix extracted by the first generation network.

In an optional implementation, the method further includes:

inputting the m frames of images contained in the second video data into the first generation network for feature extraction respectively to obtain m face feature matrixes;

inputting the f pictures into the first generation network to respectively perform feature extraction to obtain f face feature matrixes;

inputting the m personal face feature matrixes and the f personal face feature matrixes into the first discrimination network to perform positive and negative case discrimination to obtain a first discrimination result;

updating parameters of the first discrimination network by using the first loss value by adopting a back propagation algorithm; the first loss value is calculated through a first loss function according to the first judgment result and first preset information; wherein the first preset information includes: the m personal face feature matrixes are positive examples, and the f personal face feature matrixes are negative examples; the first loss value represents the difference between the first judgment result and the first preset information;

updating parameters of the first generation network by using a back propagation algorithm and a weighted sum of the first loss value and the second loss value; and the second loss value is obtained by utilizing a second loss function for representing the possibility of carrying out face recognition by utilizing the m personal face characteristic matrixes to obtain wrong face recognition results.

In the application, the feature extraction network is obtained by adopting an antagonistic learning method, so that the implementation is simple; the face feature matrix extracted from the video data by the feature extraction network can be used for face recognition, and the recognition accuracy rate is high.

In an optional implementation manner, the n personal face feature matrices are homomorphic matrices, and the fusing the n personal face feature matrices to obtain the target face feature matrix of the face to be recognized includes:

and calculating the weighted sum of the n personal face feature matrices according to the weighted values or the weighted matrices corresponding to the n personal face feature matrices respectively to obtain a target face feature matrix of the face to be recognized.

In the implementation mode, the feature fusion is carried out by calculating the weighted sum of two or more face feature matrixes, the calculation is simple, and the complementary action of different features can be fully utilized.

In an optional implementation manner, before the calculating a weighted sum of the n face feature matrices according to weight values or weight matrices respectively corresponding to the n face feature matrices to obtain a target face feature matrix of the face to be recognized, the method further includes:

sequentially inputting the n personal face feature matrixes to a feature weight network obtained by training; the characteristic weight network is a recurrent neural network; the sequence of inputting each face feature matrix in the n face feature matrices to the feature weight network is the same as the sequence of the image frames corresponding to each face feature matrix in the n face feature matrices in the first video data;

the feature weight network determines weight values or weight matrixes corresponding to the n face feature matrixes respectively according to state vectors of the feature weight network when the face feature matrixes in the n face feature matrixes are input; and the state vector of the feature weight network at the current moment is related to the state vector of the feature weight network at the previous moment of the current moment and the input human face feature matrix.

Optionally, a formula for calculating a weight value or a weight matrix corresponding to each face feature matrix in the n face feature matrices is as follows:

s_t＝U·f_t+W·s_t-1

o_t＝V·s_t

wherein U, V and W are parameters of the feature weight network; f. of_tInputting a face feature matrix of the feature weight network; s_tA state vector of the feature weight network for a current time instant (t); s_t-1The state vector of the feature weight network, namely the face feature matrix f, is the last time (t-1) of the current time_tInputting a state vector of the feature weight network; o_tIs a calculated weight value or weight matrix.

In the implementation mode, the weight of the face feature matrix is determined according to the state vector of the feature weight network when the face feature matrix is input, the time sequence information of each single-frame image in the video data can be fully utilized, and the calculated weight is more accurate.

In an optional implementation manner, the feature weight network is obtained by training using a second generative counterattack learning method, where the second generative counterattack learning method includes: training a second generating network and a second antagonistic network, and taking the obtained converged second generating network as the feature weight network; the training data used for training the second generation network comprises a second positive example and a second negative example, the second positive example comprises k personal face feature matrixes extracted from k frames of images contained in third video data, the second negative example comprises h simulated face feature matrixes generated according to reference face feature matrixes, the k personal face feature matrixes correspond to the k frames of images one by one, the reference face feature matrixes are weights respectively corresponding to the k personal face feature matrixes determined according to the second generation network, and the weighted sum of the k personal face feature matrixes is obtained through calculation; the second antagonizing network is used for judging the positive and negative examples of the h simulated face feature matrixes and the k simulated face feature matrixes to obtain a second judgment result; the second judgment result is used for calculating with the second positive example and the second negative example to obtain a third loss value, and the third loss value is used for updating the parameters of the second generation network and the parameters of the second countermeasure network until the second generation network converges; wherein k is more than or equal to 2, and h is more than or equal to 2.

In the implementation mode, the characteristic weight network is obtained by utilizing the generated confrontation network training, and the training efficiency is high.

In an optional implementation, the second generation network is further configured to update a parameter of the second generation network with a weighted sum of the third loss value and the fourth loss value; and the fourth loss value represents the possibility of carrying out face recognition by using the reference face features to obtain an incorrect face recognition result.

And the third loss value is calculated by using a third loss function. Optionally, the third loss function is as follows:

wherein, b_iThe method comprises the steps of obtaining positive and negative example types of an ith personal face feature matrix, namely actual positive and negative example results of the ith personal face feature matrix; n1 is the total number of positive and negative cases, that is, the total number of the face feature matrices input to the second determination network;

the positive and negative example discrimination results of the ith personal face feature matrix,

is 0 or 1, b_iIs 0 or one of 1, 0 and 1The positive example is shown, and the other negative example. And the loss value obtained by the third loss function is used for representing the difference between the second judgment result and the second actual result. In the second actual result, all h simulated face feature matrices generated according to the reference face feature matrix are negative examples, and all k individual face feature matrices extracted from k frame images included in the third video data are positive examples.

And the fourth loss value is calculated by using a fourth loss function. Optionally, the fourth loss function is as follows:

wherein y is the actual category label of the third video data; s_jCalculating the probability that the third video data belongs to the jth class by using the reference face features, wherein j is a class label of the jth class; t is the total number of categories, and each category corresponds to a category label. A loss value L calculated by the fourth loss function^CL2And the face recognition method is used for representing the possibility of carrying out face recognition by using the reference face features to obtain an error face recognition result.

The formula for calculating the weighted sum of the third loss value and the fourth loss value is as follows:

L2＝aL^AAL+bL^CL2

wherein a and b are constants corresponding to the weights of the third loss function and the fourth loss function, respectively.

In this implementation manner, the result of face recognition by using the reference face feature matrix is used to update the parameters of the second generation network, so that the second generation network can optimize the performance of determining the weight value or the weight matrix corresponding to each face feature matrix.

In an optional implementation, the method further includes:

inputting the k personal face feature matrices into the second generation network, and determining k weight values or weight matrices corresponding to the k personal face feature matrices one to one;

calculating the weighted sum of the k personal face feature matrixes according to the k weight values or the weight matrixes to obtain a reference face feature matrix;

generating h simulated face feature matrixes corresponding to the reference face feature matrix;

inputting the h simulated face feature matrices and the k personal face feature matrices into the second judgment network to judge whether the positive and negative cases exist or not, and obtaining a second judgment result;

updating the parameters in the second discrimination network by using a third loss value by adopting a back propagation algorithm; the third loss value number is a loss value calculated through a third loss function according to the second judgment result and second preset information; wherein the second preset information includes: the k individual face feature matrixes are all positive examples, and the h simulated face feature matrixes are all negative examples; the third loss value represents a difference between the second judgment result and the second preset information;

updating the parameters in the second generation network by using a back propagation algorithm and a weighted sum of the third loss value and the fourth loss value; and the fourth loss value is calculated by using a fourth loss function and is used for representing the possibility of carrying out face recognition by using the reference face features to obtain an incorrect face recognition result.

In the implementation mode, the characteristic weight network is obtained by utilizing the generated confrontation network training, and the training efficiency is higher.

In an optional implementation manner, the h simulated face feature matrices are face feature matrices generated according to normal distribution parameters corresponding to the reference face features.

Optionally, a formula for calculating a normal distribution parameter corresponding to the reference face feature is as follows:

u＝f_v

wherein f is_vFor said reference face feature matrix, f_iAs the ith personal facial featureThe matrix, N2, is the number of face feature matrices generated. The h simulated face feature matrices are all face feature matrices conforming to normal distribution N (u, sigma).

In the implementation mode, the characteristic that the face feature matrix conforms to normal distribution is utilized, and the simulated face feature matrix can be rapidly generated according to the face feature matrix corresponding to the video data.

In an optional implementation, the method further includes:

inputting g frames of images contained in fourth video data into the feature extraction network to respectively extract human face features, and obtaining g personal face feature matrixes corresponding to the g frames of images one by one;

fusing the g personal face feature matrixes to obtain a first personal face feature matrix;

the face recognition of the face to be recognized through the target face feature matrix to obtain a face recognition result comprises the following steps:

calculating the similarity of the target face feature matrix and the first face feature matrix;

under the condition that the similarity of the target face feature matrix and the first face feature matrix exceeds a threshold value, obtaining a face recognition result indicating that the first video data and the fourth video data correspond to the same person;

and under the condition that the similarity of the target face feature matrix and the first face feature matrix does not exceed the threshold value, obtaining a face recognition result indicating that the first video data and the fourth video data do not correspond to the same person.

In the implementation mode, whether two video data contain the same face can be determined quickly and accurately.

In an optional implementation manner, the performing face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result includes:

calculating the similarity between the target face feature matrix and a face feature matrix in a face feature database; the face feature database comprises at least one face feature matrix and identity information corresponding to the at least one face feature matrix;

determining a second face feature matrix with the highest similarity to the target face feature matrix in the face feature database and identity information corresponding to the second face feature matrix; and the face recognition result comprises identity information corresponding to the second face feature matrix.

In the implementation mode, the identity information corresponding to the target face feature matrix can be accurately determined.

In an optional implementation, the method further includes:

acquiring identity information corresponding to a second face to be registered and fifth video data containing the second face;

extracting a face feature matrix of the second face according to the fifth video data;

and storing the identity information corresponding to the second face and the face feature matrix of the second face in the face feature database.

In the implementation mode, the identity information of the user can be quickly registered, and the face feature information of the user can be stored.

In a second aspect, the present application provides a face recognition apparatus, which includes means for performing the method of the first aspect and any implementation manner of the first aspect.

In a third aspect, an embodiment of the present invention provides another face recognition apparatus, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method according to the first aspect and any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the first aspect and the method of any implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments or background of the present invention will be described below.

Fig. 1 is a schematic diagram of an application scenario of face recognition provided in the present application;

fig. 2 is a schematic structural diagram of a face recognition apparatus provided in the present application;

fig. 3 is a schematic structural diagram of another face recognition apparatus provided in the present application;

FIG. 4 is a schematic structural diagram of a convolutional neural network provided in the present application;

FIG. 5 is a schematic flow chart of a method for training a feature extraction network according to the present application;

FIG. 6 is a schematic flow chart illustrating a method for training a feature weight network according to the present application;

fig. 7 is a flowchart illustrating a method for registering feature information and identity information according to the present application;

fig. 8 is a schematic flow chart of a face recognition method provided in the present application;

fig. 9 is a schematic structural diagram of another face recognition apparatus provided in the present application;

fig. 10 is a schematic structural diagram of a neural network processor provided in the present application.

Detailed Description

Fig. 1 is a schematic diagram of an application scenario provided in the present application, as shown in fig. 1, 101 denotes input video data (video stream), 102 denotes a face recognition apparatus, and 103 denotes an output face recognition result; the video data 101 includes a face to be recognized, and the face recognition device 102 is configured to extract face feature information of the face to be recognized from the video data 101, perform face recognition by using the face feature information, and output a face recognition result. The face recognition device 102 in fig. 1 may adopt the face recognition device provided in the present application, so as to improve the accuracy of face recognition.

Fig. 2 is a schematic structural diagram of a face recognition device according to the present application. As shown in fig. 2, the face recognition apparatus 200 includes: a first input unit 201, a feature extraction unit 202, a feature fusion unit 203, and a face recognition unit 204;

a first input unit 201 for inputting n-frame images contained in first video data to a feature extraction network;

a feature extraction unit 202, configured to perform face feature extraction on n frames of images included in the first video data by using the feature extraction network, respectively, to obtain n personal face feature matrices corresponding to the n frames of images one to one;

the feature fusion unit 203 is configured to fuse the n face feature matrices to obtain a target face feature matrix corresponding to a face to be recognized;

and the face recognition unit 204 is configured to perform face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result.

The first video data comprise T frame images, and the n frame images are any n frames of the T frame images which comprise the images of the faces to be recognized; the feature extraction network is obtained by training by using a first generative warfare learning method, and the first generative warfare learning method includes: training a first generation network and a first antagonistic network, and taking the converged first generation network as the feature extraction network; training data for training the first generation network includes a first positive example and a first negative example, the first positive example includes m personal face feature matrices extracted from second video data, the first negative example includes f personal face feature matrices extracted from f pictures, the m personal face feature matrices are in one-to-one correspondence with m frames of images included in the second video data, the f personal face feature matrices are in one-to-one correspondence with the f pictures, and the second video data and the f pictures each include a first person face; the first anti-network is used for judging whether the face feature matrix extracted by the first generated network is positive or negative to obtain a first judgment result; the first determination result is used for calculating with the first positive example and the first negative example to obtain a first loss value, and the first loss value is used for updating the parameter of the first generation network and the parameter of the first opposing network until the first generation network converges. Wherein T is more than or equal to 2, n is more than or equal to 2 and less than or equal to T, m is more than or equal to 2, and f is more than or equal to 1.

In the application, the face recognition device performs face recognition by using the face feature matrix obtained by fusing a plurality of face feature matrices extracted from video data, so that the accuracy of face recognition can be improved.

Fig. 3 is a schematic structural diagram of another face recognition apparatus provided in the present application, which is based on the apparatus shown in fig. 2 and is further detailed. As shown in fig. 3, the face recognition apparatus 300 includes: a first input unit 301, a feature extraction unit 302, a feature weighting unit 303, a feature fusion unit 304, and a face recognition unit 305. The face recognition device can execute the following face recognition operations:

3001. the first input unit 301 inputs the n-frame image included in the first video data to the feature extraction unit 302.

Optionally, the first input unit 301 disassembles the first video data into T-frame images in time sequence, and outputs any n frames of images containing faces to be recognized in the T-frame images to the feature extraction unit 302.

3002. Feature extraction section 302 extracts face features from each of the n frames of images, and outputs the extracted n face feature matrices to feature weighting section 303.

The n personal face feature matrices correspond to the n frames of images one to one. The sequence of inputting each face feature matrix in the n face feature matrices to the feature weight network is the same as the sequence of the image frames in the first video data corresponding to each face feature matrix in the n face feature matrices.

3003. Feature extraction section 302 outputs the n-person face feature matrix to feature fusion section 304.

3004. The feature weighting unit 303 sequentially determines weight values or weight matrices corresponding to the n face feature matrices, and outputs the determined weight values or weight matrices of the face feature matrices to the feature fusion unit 304.

The function of the feature weight unit 303 may be implemented by a feature weight network. The feature weighting unit 303 may determine weight values or weight matrices corresponding to the n face feature matrices, respectively, in sequence according to the state vector of the feature weighting network when each face feature matrix in the n face feature matrices is input; the state vector of the feature weight network at the current moment is related to the state vector of the feature weight network at the previous moment of the current moment and the input face feature matrix. Specifically, the following formula may be adopted to calculate the weight value or the weight matrix of the face feature matrix:

s_t＝U·f_t+W·s_t-1

o_t＝V·s_t

wherein U, V and W are parameters of the feature weight network; f. of_tInputting a face feature matrix of the feature weight network; s_tA state vector of the feature weight network at the current time (t); s_t-1The state vector of the feature weight network, namely the face feature matrix f, is the last time (t-1) of the current time_tInputting the state vector of the characteristic weight network; o_tIs a calculated weight value or weight matrix.

3005. The feature fusion unit 304 fuses the face feature matrices input by the feature extraction unit 1302 to obtain target face feature matrices by using the weight values or weight matrices of the face feature matrices input by the feature weight unit 303, and outputs the target face feature matrices to the face recognition unit 305.

The feature fusion unit 304 is specifically configured to calculate a weighted sum of the n face feature matrices according to weight values or weight matrices corresponding to the n face feature matrices, respectively, to obtain a target face feature matrix of the face to be recognized. Optionally, the following formula is adopted to calculate the face feature matrix after the fusion of the face feature matrices:

wherein f is_tIs the tth personal face feature matrix, Q_tIs the weight value or weight matrix of the tth individual face feature matrix, f_vThe face feature matrix after fusion, namely the target face feature matrix. It can be understood that if Q_tIs the weighted value of the tth individual face feature matrix, then f_t·Q_tRepresents a weight value Q_tAnd f_tA matrix obtained by multiplication; if Q_tIs the weight matrix of the tth individual face feature matrix, then f_t·Q_tRepresenting a weight matrix Q_tAnd f_tDot-multiplied to obtain a matrix.

3006. The face recognition unit 305 performs face recognition according to the target face feature matrix to obtain a face recognition result.

According to the face recognition method and device, the face features extracted from the video data and fused with the face features are used for face recognition, and therefore the accuracy of face recognition can be improved.

In an optional implementation manner, the feature extraction unit 302 is further configured to perform face feature extraction on g frames of images included in the fourth video data, respectively, to obtain g personal face feature matrices corresponding to the g frames of images one to one, where g is greater than or equal to 2;

the feature fusion unit 304 is further configured to fuse the g personal face feature matrices to obtain a first personal face feature matrix;

a face recognition unit 305, configured to calculate a similarity between the target face feature matrix and the first face feature matrix; under the condition that the similarity of the target face feature matrix and the first face feature matrix exceeds a threshold value, obtaining a face recognition result indicating that the first video data and the fourth video data correspond to the same person; and under the condition that the similarity between the target face feature matrix and the first face feature matrix does not exceed the threshold, obtaining a face recognition result indicating that the first video data and the fourth video data do not correspond to the same person.

The threshold may be 80%, 90%, 99%, 99.5%, etc.

In an alternative implementation manner, the face recognition unit 305 is specifically configured to calculate a similarity between the target face feature matrix and a face feature matrix in a face feature database; the face feature database comprises at least one face feature matrix and identity information corresponding to the at least one face feature matrix; determining a second face feature matrix with the highest similarity to the target face feature matrix in the face feature database and identity information corresponding to the second face feature matrix; the face recognition result comprises identity information corresponding to the second face feature matrix.

The storage device of the face recognition apparatus 300 stores the face feature database, or the face recognition apparatus 300 acquires the face feature database from a corresponding server by using a transceiver.

In an alternative implementation manner, as shown in fig. 3, the face recognition apparatus 300 further includes:

an obtaining unit 306, configured to obtain identity information corresponding to a second face to be registered and fifth video data including the second face;

a feature extraction unit 302, configured to extract a face feature matrix corresponding to the second face according to the fifth video data;

a storage unit 307, configured to store the identity information corresponding to the second face and the face feature matrix corresponding to the second face in the face feature database.

In the implementation mode, the identity information of the user can be quickly registered, and the face characteristics of the user can be stored.

In this application, the function of the feature extraction unit 302 is implemented by a feature extraction network, and the function of the feature weighting unit 303 is implemented by a feature weighting network. The feature extraction network may be a convolutional neural network. The feature weight network is a recurrent neural network.

Convolutional Neural Network (CNN) is a deep neural network with a Convolutional structure, and is a deep learning (deep learning) architecture. The deep learning architecture refers to an architecture in which learning is performed at a plurality of levels at different abstraction levels by a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

As shown in fig. 4, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 4, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels, depending on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depthdimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving a single weight matrix produces a single depth dimension of the convolved output, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same dimension are used. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimension are also the same, and the extracted feature maps with the same dimension are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 4, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value within a particular range as a result of the max pooling. In addition, just as the size of the weight matrix in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (class information or other relevant information as desired), the convolutional neural network 100 needs to generate one or a set of desired outputs using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 4) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the plurality of hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100, that is, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from 110 to 140 in fig. 4) of the entire convolutional neural network 100 is completed, the backward propagation (i.e., the propagation from 140 to 110 in fig. 4 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

Recurrent Neural Networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although the common neural network solves a plurality of problems, the common neural network still has no capability for solving a plurality of problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e. the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN. The error back-propagation algorithm is also used, but with a little difference: that is, if the RNN is network-deployed, the parameters therein, such as W, are shared; this is not the case with the conventional neural networks described above by way of example. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is referred to as the Time-based Back Propagation Through Time (BPTT).

Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, wherein the favorite place is Yunnan, and the opportunity is in future to go. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

The convolution layer of the feature extraction network performs convolution operation on each frame image contained in the input video data, so that the work of extracting the human face features from each frame image is completed, and the extracted human face features are input into the pooling layer; the pooling layer inputs the processed human face features into the neural network layer; the neural network layer 130 generates one or a group of required face feature matrices from the input face features. In the application, the face features are extracted by using the convolutional neural network, so that the efficiency is high.

The feature weight network adopts a recurrent neural network with a memory function, and can introduce the time sequence relationship among the face feature matrixes in the process of determining the weight values or weight matrixes corresponding to the face feature matrixes so as to better extract the face feature information contained in the video data by utilizing the time sequence relationship among the face images of each frame.

In the present application, a feature extraction network and a feature weight network need to be trained. In the application, a method for generating counterstudy is adopted to train and obtain a feature extraction network and a feature weight network. Fig. 5 and 6 below provide methods of training a feature extraction network and a feature weight network, respectively.

Fig. 5 is a flowchart illustrating a method for training a feature extraction network according to the present application. As shown in fig. 5, the method may include:

501. and inputting m frames of images contained in second video data into a first generation network for feature extraction respectively to obtain the m personal face feature matrixes corresponding to the m frames of images one by one.

Optionally, the second video data is disassembled into single-frame images in time sequence, and m-frame images included in the second video data are input to the first generation network.

502. Inputting the f pictures into the first generating network to respectively extract the features, and obtaining the f personal face feature matrixes which are in one-to-one correspondence with the f pictures.

The second video data and the f pictures both comprise a first face.

503. And the first generation network inputs the m personal face feature matrix and the f personal face feature matrix into a first judgment network to judge whether the number of the positive examples and the negative examples are positive or negative, and a first judgment result is obtained.

504. And calculating the first confrontation loss to obtain a first loss value.

The calculating the first immunity loss may be calculating a difference between the first discrimination result and the first actual result. The first actual result refers to actual positive and negative example results of the face feature matrix, that is, actual positive and negative example results of each of the face feature matrices in the m personal face feature matrices and the f personal face feature matrices. The first loss value is calculated by using a first loss function. Optionally, the first loss function is as follows:

is 0 or 1, d_iIs 0 or 1, one of 0 and 1 represents a positive case, and the other represents a negative case. A loss value L calculated by the first loss function^AFLFor characterizing the first discriminationThe difference of the result from the first actual result. In the first actual result, the m-face feature matrices extracted from the second video data are positive examples, and the f-face feature matrices extracted from the f pictures are negative examples.

505. And calculating the first classification loss to obtain a second loss value.

The calculating the second loss value may be performing face recognition using the m-number of face feature matrices, and calculating a probability of obtaining an erroneous face recognition result according to the face recognition result. The second loss value is calculated by using a second loss function. Optionally, the second loss function is as follows:

wherein, y_iThe actual category label is corresponding to the ith personal face characteristic matrix; s_ijCalculating the probability that the ith personal face characteristic matrix belongs to the jth class, namely calculating the probability that the category label corresponding to the ith personal face characteristic matrix corresponds to the jth class; t is the total number of categories, and each category corresponds to a category label; and M is the number of the input human face characteristic matrixes, and the input human face characteristic matrixes are all contained in the M human face characteristic matrixes.

506. And updating the first judgment network.

And updating the first judgment network by using a back propagation algorithm. Optionally, the following formula is adopted to update the parameter of the first discriminant network:

wherein, w₁Parameter w 'representing the first discrimination network before update'₁Indicating that w is updated using the first penalty function₁The parameters obtained are used to determine the parameters,

represents the first loss function relative to the pair w₁Gradient value of, rho tableShowing the learning rate. The learning rate may be set empirically or may be adjusted based on some strategy, such as adaptive dynamic adjustment.

507. And updating the first generation network.

Updating said first generated network using a back propagation algorithm using a loss function as follows:

L1＝αL^AFL+βL^CL1

where α and β are constants corresponding to the weights of the first immunity loss and the first classification loss, respectively, L^AFLIs a first loss function, L^CL1Is a second loss function. Optionally, the following formula is adopted to update the parameters of the first generation network:

wherein, w₂Parameter w 'representing the first generated network before update'₂Indicating that w is updated using a loss function L1₂The parameters obtained are used to determine the parameters,

represents the loss function L1 versus the pair w₂θ represents the learning rate.

508. Determining whether a weighted sum of the first classification loss and the first countermeasure loss converges.

If yes, proceed to 509; otherwise, 501 is performed. Specifically, it is determined whether the loss function L1 converges. The convergence condition may be that the absolute value of the difference between the loss value obtained by the nth iteration and the loss value obtained by the (N + K) th iteration of the loss function L1 is smaller than the second threshold. The first threshold may be one hundred thousand, ten thousand, one hundred, etc. The second threshold may be 0.01, 0.001, 0.0001, etc.

509. The training is stopped.

In the method and the device, the feature extraction network is obtained by training the generated confrontation network, so that the similarity between the face feature matrix extracted from each frame of image contained in the video data by the trained feature extraction network and the face feature matrix extracted from the image is high.

After the feature extraction network is obtained by training, that is, after the feature extraction unit 302 finishes training, the feature weight network is further obtained by training using a counterstudy method. Fig. 6 is a flowchart illustrating a method for training a feature weight network according to the present application. As shown in fig. 6, the method may include:

601. and the feature extraction network respectively extracts the features of the k frames of images contained in the third video data to obtain k personal face feature matrixes which are in one-to-one correspondence with the k frames of images and outputs the k personal face feature matrixes to the second generation network.

602. The second generation network determines k weight values or weight matrices corresponding to the k individual face feature matrices one to one.

Optionally, the formula for calculating the weight value or the weight matrix corresponding to each face feature matrix in the k face feature matrices by the second generation network is as follows:

s_t＝U·f_t+W·s_t-1

o_t＝V·s_t

603. And fusing the k personal face feature matrixes to obtain a reference face feature matrix.

The specific implementation is the same as the previously provided way of fusing multiple face feature matrices.

604. And generating h simulated face feature matrixes corresponding to the reference face feature matrix.

The generating of the h simulated face feature matrices corresponding to the reference face feature matrix may be determining normal distribution parameters N (u, σ) corresponding to the reference face feature matrix, and generating the h simulated face feature matrices according to the normal distribution parameters.

The formula for calculating the normal distribution parameter is as follows:

u＝f_v

wherein f is_vFor the above-mentioned reference face feature matrix, f_iFor the ith personal face feature matrix, N2 is the number of extracted face feature matrices, i.e., k.

605. And inputting the h simulated face feature matrixes and the k personal face feature matrixes into a second judgment network to judge whether the face is positive or negative, and obtaining a second judgment result.

606. And calculating the second confrontation loss to obtain a third loss value.

The calculating of the second immunity loss may be calculating a difference between the second determination result and a second actual result. The second actual result is an actual positive and negative example result of the face feature matrix, that is, an actual positive and negative example result of each face feature matrix in the h simulated face feature matrices and the k personal face feature matrices. The third loss value is calculated by using a third loss function. Optionally, the third loss function is as follows:

wherein, b_iThe method comprises the steps of obtaining actual positive and negative example results of an ith personal face feature matrix, wherein the actual positive and negative example results are actual positive and negative example types of the ith personal face feature matrix; n1 is the total number of positive and negative examples, that is, the total number of the face feature matrices of the second determination network is input;

is 0 or 1, b_iIs 0 or 1, one of 0 and 1 represents a positive case, and the other represents a negative case. And the loss value calculated by the third loss function is used for representing the difference between the second judgment result and the second actual result. In the second actual result, the h simulated face feature matrices generated from the reference face feature matrix are negative examples, and the k individual face feature matrices extracted from the k frame images included in the third video data are positive examples.

607. And calculating the second classification loss to obtain a fourth loss value.

And the fourth loss value is calculated by using a fourth loss function and is used for representing the possibility of carrying out face recognition by using the reference face features to obtain an incorrect face recognition result. Optionally, the fourth loss function is as follows:

wherein y is the actual category label of the third video data; s_jCalculating the probability that the third video data belongs to the jth class by using the reference face features, wherein j is a class label of the jth class; t is the total number of categories, and each category corresponds to a category label.

608. And updating the second judgment network.

And updating the second judgment network by using a back propagation algorithm. Optionally, the following formula is adopted to update the parameter of the second judgment network:

wherein, w₃A parameter w 'representing the second judgment network before update'₃Indicating that w is updated using the third penalty function₃The parameters obtained are used to determine the parameters,

represents the third lossFunction vs. pair w₃Represents the learning rate.

609. And updating the second generation network.

Updating said second generation network using a back-propagation algorithm using a loss function as follows:

L2＝aL^AAL+bL^CL2

wherein a and b are constants corresponding to the weights of the second immunity loss and the second classification loss, respectively, L^AALIs a third loss function, L^CL2Is a fourth loss function. Optionally, the following formula is adopted to update the parameters of the second generation network:

wherein, w₄A weight parameter w 'representing the second generation network before update'₄Indicating that w is updated using a loss function L2₄The obtained weight parameter is used as a weight parameter,

represents the loss function L2 versus the pair w₄E represents the learning rate.

610. And determining whether the weighted sum of the second classification loss and the second countermeasure loss converges.

If yes, 611 is executed; otherwise, 602 is performed. Specifically, it is determined whether the loss function L2 converges.

611. The training is stopped.

In the application, the characteristic weight network is obtained by training the generated confrontation network, so that the training efficiency is high.

In the face recognition method, a face feature matrix extracted from video data is compared with a face feature matrix in a face feature database, and identity information of people contained in the video data is determined according to a comparison result. The face feature database contains face feature information of at least one person. It will be appreciated that a database of facial features may need to be obtained prior to face recognition. Fig. 7 is a flowchart illustrating a method for registering feature information and identity information according to the present application. As shown in fig. 7, the method may include:

701. the face recognition device obtains fifth video data containing the second face.

702. And extracting the face feature matrix corresponding to the fifth video data.

Optionally, a face feature matrix corresponding to the fifth video data is extracted in a manner shown in fig. 2.

703. And obtaining the identity information corresponding to the second face.

704. And registering the identity information of the second face and the face feature matrix corresponding to the fifth video data.

In the application, the characteristic information and the identity information can be quickly registered so as to provide a human face characteristic database for human face identification.

The face recognition device in the application can also determine whether two or more video data contain the same person. Fig. 8 is a schematic flow chart of another face recognition method provided in the present application, and as shown in fig. 8, the method may include:

801. the face recognition device obtains first video data and fourth video data.

802. And extracting a target face characteristic matrix corresponding to the first video data and a first face characteristic matrix corresponding to the fourth video data.

The extraction of the face feature matrix from the video data is performed in the same manner as in fig. 2 and will not be described in detail here.

803. And calculating the similarity of the target face characteristic matrix and the first face characteristic matrix.

804. Judging whether the similarity exceeds a threshold value;

if the similarity exceeds the threshold, 805 is performed; otherwise, 806 is performed.

805. And outputting a first identification result indicating that the first video data and the fourth video data correspond to the same person.

806. And outputting a second recognition result indicating that the first video data and the fourth video data do not correspond to the same person.

In the application, the face recognition device can accurately determine whether two or more video data contain the same person.

Referring to fig. 9, a schematic block diagram of another face recognition apparatus provided in the present application is shown. As shown in fig. 9, the face recognition apparatus may include: one or more processors 901, one or more input devices 902, one or more output devices 903, memory 904, and receiver 905. The processor 901, the input device 902, the output device 903, the memory 904, and the receiver 905 are connected via a bus 906. The memory 902 is used for storing a computer program comprising program instructions, and the processor 901 is used for executing the program instructions stored by the memory 902. Wherein, the processor 901 is configured to call the above program instructions to execute: inputting n frames of images contained in first video data into a feature extraction network to respectively extract human face features to obtain n personal face feature matrixes which are in one-to-one correspondence with the n frames of images; fusing the n personal face feature matrixes to obtain a target face feature matrix of the face to be recognized; carrying out face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result; wherein n is more than or equal to 2. Input device 902 may include a camera. The face recognition device may collect the first video data through the camera, or obtain the first video data from the memory 904, or obtain the first video data from a database through the receiver 905, where the database may be at a server or a cloud.

It should be understood that, in the embodiment of the present invention, the Processor 901 may be a Central Processing Unit (CPU), and the Processor may also be other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The processor 901 can realize the functions of the feature extraction unit 302, the feature fusion unit 304, the feature weighting unit 303, and the face recognition unit 307 shown in fig. 3.

The Memory 903 includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable programmable Read-Only Memory (EPROM), or a portable Read-Only Memory (CD-ROM), which can be used to store related instructions and data. The memory 903 may implement the functions of the storage unit 108.

In a specific implementation, the processor 901, the input device 902, the output device 903, and the memory 904 described in the embodiment of the present invention may execute the implementation described in the face recognition method provided in the embodiment of the present invention, and may also execute the implementation of the face recognition apparatus described in the embodiment of the present invention, which is not described herein again. The input device 902 may implement the functions of the acquisition unit 307 and the first input unit 301 in fig. 3.

It should be understood that the face recognition apparatus according to the embodiment of the present invention may correspond to the apparatus for implementing face recognition shown in fig. 1 in the embodiment of the present invention, and may correspond to a corresponding main body for implementing the method for implementing face recognition in the embodiment of the present invention, and functions of each unit in the face recognition apparatus are respectively corresponding to processes of each method in the foregoing embodiments, and are not described herein again for brevity.

In another embodiment of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, the computer program comprising program instructions that when executed by a processor implement: inputting n frames of images contained in first video data into a feature extraction network to respectively extract human face features to obtain n personal face feature matrixes which are in one-to-one correspondence with the n frames of images; fusing the n personal face feature matrixes to obtain a target face feature matrix of the face to be recognized; carrying out face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result; wherein n is more than or equal to 2.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described above in accordance with the embodiments of the invention may be generated, in whole or in part, when the computer program instructions described above are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a Solid State Drive (SSD).

The neural network based algorithm in the foregoing embodiment may be implemented in the NPU chip shown in fig. 10.

The neural network processor 50 may be any processor suitable for large-scale exclusive-or processing, such as an NPU, a TPU, or a GPU. Taking NPU as an example: the NPU can be mounted on a main CPU (host CPU) as a coprocessor, and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 50, and the controller 1004 controls the arithmetic circuit 1003 to extract matrix data in the memories (1001 and 1002) and perform a multiply-add operation.

In some implementations, the arithmetic circuit 1003 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1003 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1001 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 1008.

The unified memory 1006 is used for storing input data and output data. The weight data is directly transferred to the weight Memory 1002 through a Memory Access Controller (DMAC) 1005. The input data is also carried into the unified memory 1006 by the DMAC.

The BIU is a Bus Interface Unit 1010, which is used for interaction between the DMAC and the Fetch memory 1009.

A Bus Interface Unit 1010(BIU, Bus Interface Unit) for fetching the instruction from the external memory by the instruction fetch memory 1009 and for fetching the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 1005.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1006 or to transfer weight data into the weight memory 1002 or to transfer input data into the input memory 1001.

The vector calculation unit 1007 includes a plurality of operation processing units, and further processes the output of the operation circuit such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for network computation of a non-convolution layer or a full-connection layer in a neural network, such as Pooling (Pooling), Normalization and the like.

In some implementations, the vector calculation unit 1007 can store the processed output vector to the unified memory 1006. For example, the vector calculation unit 1007 may apply a non-linear function to the output of the arithmetic circuit 1003, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 1007 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 1003, for example, for use in subsequent layers in a neural network.

An instruction fetch buffer 1009 connected to the controller 1004, for storing instructions used by the controller 1004; the unified memory 1006, the input memory 1001, the weight memory 1002, and the instruction fetch memory 1009 are On-Chip memories. The external memory is independent of the NPU hardware architecture.

The operation of each layer in the neural network may be performed by the operation circuit 1003 or the vector calculation unit 1007.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A face recognition method, comprising:

2. The face recognition method of claim 1, wherein the first generation network is further configured to update parameters of the first generation network with a weighted sum of the first loss value and the second loss value; and the second loss value represents the possibility of carrying out face recognition by using the m personal face feature matrixes to obtain wrong face recognition results.

3. The face recognition method according to claim 1, wherein the n personal face feature matrices are homomorphic matrices, and the obtaining of the target face feature matrix of the face to be recognized by fusing the n personal face feature matrices includes:

4. The method according to claim 3, wherein before calculating a weighted sum of the n face feature matrices according to weight values or weight matrices corresponding to the n face feature matrices, and obtaining a target face feature matrix of the face to be recognized, the method further comprises:

5. The face recognition method of claim 4, wherein the feature weight network is trained by a second generative warfare method, the second generative warfare method comprising: training a second generating network and a second antagonistic network, and taking the obtained converged second generating network as the feature weight network; the training data used for training the second generation network comprises a second positive example and a second negative example, the second positive example comprises k personal face feature matrixes extracted from k frames of images contained in third video data, the second negative example comprises h simulated face feature matrixes generated according to reference face feature matrixes, the k personal face feature matrixes correspond to the k frames of images one by one, the reference face feature matrixes are weights respectively corresponding to the k personal face feature matrixes determined according to the second generation network, and the weighted sum of the k personal face feature matrixes is obtained through calculation; the second antagonizing network is used for judging the positive and negative examples of the h simulated face feature matrixes and the k simulated face feature matrixes to obtain a second judgment result; the second judgment result is used for calculating with the second positive example and the second negative example to obtain a third loss value, and the third loss value is used for updating the parameters of the second generation network and the parameters of the second countermeasure network until the second generation network converges; wherein k is more than or equal to 2, and h is more than or equal to 2.

6. The face recognition method of claim 5, wherein the second generation network is further configured to update parameters of the second generation network with a weighted sum of the third loss value and a fourth loss value; and the fourth loss value represents the possibility of carrying out face recognition by using the reference face features to obtain an incorrect face recognition result.

7. The face recognition method according to claim 5 or 6, wherein the h simulated face feature matrices are face feature matrices generated according to normal distribution parameters corresponding to the reference face features.

8. The face recognition method according to any one of claims 1 to 6, wherein the method further comprises:

9. The method of claim 7, further comprising:

10. The face recognition method according to any one of claims 1 to 6, wherein the performing face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result comprises:

11. The method of claim 7, wherein the performing face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result comprises:

12. The method of claim 10, further comprising:

13. The method of claim 11, further comprising:

14. A face recognition apparatus, comprising:

a first input unit configured to input n frames of images included in first video data to a feature extraction network;

the feature extraction unit is configured to perform face feature extraction on n frames of images included in the first video data respectively by using the feature extraction network to obtain n personal face feature matrices corresponding to the n frames of images one to one, where the first video data includes a T frame of image, and the n frames of image are images of any n frames of the T frame of image that include a face to be recognized; the feature extraction network is obtained by training by adopting a first generative pair learning method, and the first generative pair learning method comprises the following steps: training a first generation network and a first antagonistic network, and taking the converged first generation network as the feature extraction network; the training data used for training the first generation network comprises a first positive example and a first negative example, the first positive example comprises an m-personal-face feature matrix extracted from second video data, the first negative example comprises an f-personal-face feature matrix extracted from f pictures, the m-personal-face feature matrix is in one-to-one correspondence with m frames of images contained in the second video data, the f-personal-face feature matrix is in one-to-one correspondence with the f pictures, and the second video data and the f pictures both contain a first face; the first anti-network is used for judging whether the face feature matrix extracted by the first generation network is positive or negative to obtain a first judgment result; the first judgment result is used for calculating with the first positive example and the first negative example to obtain a first loss value, and the first loss value is used for updating the parameters of the first generation network and the parameters of the first countermeasure network until the first generation network converges;

the feature fusion unit is used for fusing the n personal face feature matrixes to obtain a target face feature matrix of the face to be recognized;

the face recognition unit is used for carrying out face recognition on the face to be recognized through the target face feature matrix to obtain a face recognition result; wherein T is more than or equal to 2, n is more than or equal to 2 and less than or equal to T, m is more than or equal to 2, and f is more than or equal to 1.

15. The face recognition apparatus of claim 14, wherein the first generation network is further configured to update parameters of the first generation network with a weighted sum of the first loss value and the second loss value; and the second loss value represents the possibility of carrying out face recognition by using the m personal face feature matrixes to obtain wrong face recognition results.

16. The face recognition device of claim 14, wherein the n-person face feature matrices are homomorphic matrices;

the feature fusion unit is specifically configured to calculate a weighted sum of the n face feature matrices according to weight values or weight matrices corresponding to the n face feature matrices, respectively, so as to obtain a target face feature matrix of the face to be recognized.

17. The face recognition apparatus of claim 16,

the first input unit is further configured to sequentially input the n personal face feature matrices to a trained feature weight network; the characteristic weight network is a recurrent neural network; the sequence of inputting each face feature matrix in the n face feature matrices to the feature weight network is the same as the sequence of the image frames corresponding to each face feature matrix in the n face feature matrices in the first video data; the face recognition apparatus further includes:

the feature weight unit is used for determining weight values or weight matrixes corresponding to the n face feature matrixes respectively according to the state vector of the feature weight network when each face feature matrix in the n face feature matrixes is input; and the state vector of the feature weight network at the current moment is related to the state vector of the feature weight network at the previous moment of the current moment and the input human face feature matrix.

18. The face recognition apparatus of claim 17, wherein the feature weight network is trained by a second generative warfare method, the second generative warfare method comprising: training a second generating network and a second antagonistic network, and taking the obtained converged second generating network as the feature weight network; the training data used for training the second generation network comprises a second positive example and a second negative example, the second positive example comprises k personal face feature matrixes extracted from k frames of images contained in third video data, the second negative example comprises h simulated face feature matrixes generated according to reference face feature matrixes, the k personal face feature matrixes correspond to the k frames of images one by one, the reference face feature matrixes are weights respectively corresponding to the k personal face feature matrixes determined according to the second generation network, and the weighted sum of the k personal face feature matrixes is obtained through calculation; the second antagonizing network is used for judging the positive and negative examples of the h simulated face feature matrixes and the k simulated face feature matrixes to obtain a second judgment result; the second judgment result is used for calculating with the second positive example and the second negative example to obtain a third loss value, and the third loss value is used for updating the parameters of the second generation network and the parameters of the second countermeasure network until the second generation network converges; wherein k is more than or equal to 2, and h is more than or equal to 2.

19. The face recognition apparatus of claim 18, wherein the second generation network is further configured to update parameters of the second generation network with a weighted sum of the third loss value and a fourth loss value; and the fourth loss value represents the possibility of carrying out face recognition by using the reference face features to obtain an incorrect face recognition result.

20. The face recognition apparatus according to claim 18 or 19, wherein the h simulated face feature matrices are face feature matrices generated according to normal distribution parameters corresponding to the reference face features.

21. The face recognition apparatus according to any one of claims 14 to 19,

the feature extraction unit is further configured to perform face feature extraction on g frames of images included in fourth video data respectively to obtain g personal face feature matrices corresponding to the g frames of images one to one;

the feature fusion unit is further used for fusing the g individual face feature matrixes to obtain a first face feature matrix;

the face recognition unit is specifically configured to calculate a similarity between the target face feature matrix and the first face feature matrix; under the condition that the similarity of the target face feature matrix and the first face feature matrix exceeds a threshold value, obtaining a face recognition result indicating that the first video data and the fourth video data correspond to the same person; and under the condition that the similarity of the target face feature matrix and the first face feature matrix does not exceed the threshold value, obtaining a face recognition result indicating that the first video data and the fourth video data do not correspond to the same person.

22. The face recognition apparatus of claim 20,

23. The face recognition apparatus according to any one of claims 14 to 19,

the face recognition unit is specifically used for calculating the similarity between the target face feature matrix and a face feature matrix in a face feature database; the face feature database comprises at least one face feature matrix and identity information corresponding to the at least one face feature matrix; determining a second face feature matrix with the highest similarity to the target face feature matrix in the face feature database and identity information corresponding to the second face feature matrix; and the face recognition result comprises identity information corresponding to the second face feature matrix.

24. The face recognition apparatus of claim 20,

25. The face recognition apparatus of claim 23, wherein the face recognition apparatus further comprises:

the device comprises an acquisition unit, a registration unit and a display unit, wherein the acquisition unit is used for acquiring identity information corresponding to a second face to be registered and fifth video data containing the second face;

the feature extraction unit is further configured to extract a face feature matrix of the second face according to the fifth video data;

and the storage unit is used for storing the identity information corresponding to the second face and the face feature matrix of the second face into the face feature database.

26. The face recognition apparatus of claim 24, wherein the face recognition apparatus further comprises:

27. A face recognition apparatus comprising a processor and a memory, the processor and memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any one of claims 1 to 13.

28. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-13.