CN106909905A

CN106909905A - A kind of multi-modal face identification method based on deep learning

Info

Publication number: CN106909905A
Application number: CN201710122193.1A
Authority: CN
Inventors: 张�浩; 韩琥; 山世光; 陈熙霖
Original assignee: In Extension (beijing) Technology Co Ltd
Current assignee: In Extension (beijing) Technology Co Ltd
Priority date: 2017-03-02
Filing date: 2017-03-02
Publication date: 2017-06-30
Anticipated expiration: 2037-03-02
Also published as: CN106909905B

Abstract

The invention discloses a kind of multi-modal face identification method based on deep learning, Face datection and alignment are carried out to RGB facial images including (1), human face data the collection S0, S1, S2 ... for making RGB mode and other mode are cut according to the mapping relations between mode；(2) a depth convolutional neural networks structure N1 for multi-modal fusion is designed, and trains N1 networks；(3) a multi-modal shared depth convolutional neural networks structure N2 is designed, and trains N2 networks；(4) feature stage, (5) Similarity Measure stage and (6) similarity fusing stage are extracted.The present invention uses multimodal systems, by the collection for carrying out various face modal datas, using the respective advantage of multiple modalities information, overcome by convergence strategy in some of single mode system in weakness, multiple modalities information is sufficiently utilized simultaneously, effectively improve the performance of face identification system so that recognition of face is more quick accurate.

Description

A kind of multi-modal face identification method based on deep learning

Technical field

The present invention relates to a kind of face identification method, more particularly to a kind of multi-modal recognition of face side based on deep learning Method.

Background technology

Three-dimensional face identification has it to illumination robust, by factors such as attitude and expressions relative to two-dimension human face identification The advantages of influence is smaller, therefore developed rapidly in three dimensional data collection technology and the quality and precision of three-dimensional data are greatly promoted Afterwards, during their research is put into the field by many scholars.

The image of face different modalities is easily influenceed by different factors etc., and these factors have impact on to a certain extent The stability and accuracy of single mode face identification system.CN104778441A proposes a kind of fusion half-tone information and depth letter The multi-modal face identification device and method of breath, its core methed are the (institutes in invention after extracting multi-modal face characteristic The feature for using is the feature of hand-designed), it is stitched together to form a feature pool, it is characterized each feature construction one in pond Individual Weak Classifier, then using Adaboost algorithm, picks out for the maximally efficient feature, last base of classifying in feature pool In the feature that multi-modal Feature-level fusion is obtained, matching fraction is calculated using nearest neighbor classifier, multi-modal people is realized with this Face is recognized.But the feature that the invention is used is the feature of engineer, ability to express is not strong enough；And the feature of the invention Fusion Features and feature selecting are carried out using Adaboost algorithm, it is less efficient；And the invention is for specific both modalities which is designed , with limitation.

The content of the invention

For the weak point in solving the above problems, the invention provides a kind of multi-modal face based on deep learning Recognition methods.

In order to solve the above technical problems, the technical solution adopted by the present invention is：It is a kind of based on the multi-modal of deep learning Face identification method, comprises the following steps：

(1) Face datection is carried out to RGB facial images, positioning feature point, alignment cuts, and makes the RGB mode after cutting Human face data collection S0；According to the coordinate mapping relations between RGB mode and other mode, the feature of other mode faces is found Point, and cut human face data the collection S1, S2 ... for making other mode；

(2) a depth convolutional neural networks structure N1 for multi-modal fusion is designed, in this structure, first half is Several independent neutral net branches, input one mode of correspondence of each branch, then with specific network structure multiple Bimodal branch is fused into a branch, reconnects a series of neural network structure units；Then by S0, S1, S2 ... are sent to In the different branches of N1, N1 networks, the model for training is trained to be represented with M1；

Above-mentioned neural network structure unit includes but is not limited to convolutional layer, normalization layer, non-linear layer, pond layer and connects entirely Connect layer and distribution normalization layer；Mode includes but is not limited to RGB mode, depth mode and near-infrared mode；Specific network knot In structure, there is respective Classification Loss in each branch as the respective supervisory signals of each mode, and structure fusion method is included but not It is limited to simple merging features；

(3) a multi-modal shared depth convolutional neural networks structure N2 is designed, by S0, S1, S2 ... are not added with distinguishing The N2 of feeding together in, train N2 networks, the model for training represents with M2；

(4) feature stage is extracted, for registered set and the image of query set, its modality range is in training set modality range It is interior.The different modalities of certain image can be expressed as I0, I1, I2 ..., and then I0, I1, I2 ... are respectively in model M 1 and M2 Upper extraction feature, feature can use F0, F0C, F1, F1C, F2, F2C ... to represent, it is the feature extracted from M2 that C is represented；

(5) the similarity S11's, GF2 and PF2 of similarity S00, GF1 and PF1 respectively between calculating GF0 and PF0 is similar Degree S22；The similarity S01 between GF0C and PF1C is calculated, similar calculates cross-module state similarity S02, S03, S12, S13, S23……；

Above-mentioned GF0 represents that the F0 of registered set image, PF0 represent that the F0 of query set image, GF0C represent registered set image F0C, PF1C represent the F1C of query set image；

(6) summation fusion is weighted to all of registered set and query set similarity, final fusion similarity is obtained S, recognition of face and confirming face are carried out on the similarity matrix that fusion similarity S is constituted.

In step (2), when training N1 networks, loss layer using softmax with loss or can use it His loss layer.

In step (3), by S0, S1, S2 ... be not added with distinguish feeding N2 in when, if their port number is not Together, then can take and all be changed into single pass mode and be normalized to same channels, or to repeat single channel identical to port number, Then it is re-fed into being trained in network structure.

In step (4), for registered set and the image of query set, its modality range is within training set modality range.

The present invention uses multimodal systems, by carrying out the collection of various face modal datas, using multiple modalities information Respective advantage, is overcome in some of single mode system in weakness by convergence strategy, while sufficiently utilizing multiple modalities Information, effectively improves the performance of face identification system so that recognition of face is more quick accurate.

Brief description of the drawings

The present invention is further detailed explanation with reference to the accompanying drawings and detailed description.

Fig. 1 is the flow chart of face recognition algorithms of the present invention.

Fig. 2 is the structured flowchart of the depth convolutional neural networks of multi-modal fusion of the present invention.

Fig. 3 is the concrete structure block diagram of Loss4 in Fig. 2.

Specific embodiment

As shown in figure 1, the present invention specifically includes following steps：

(1) Face datection is carried out to RGB facial images, positioning feature point, alignment cuts, and makes the RGB mode after cutting Human face data collection S0；Then according to the coordinate mapping between RGB mode and other mode (such as depth information, Near Infrared Information etc.) Relation, finds the characteristic point of other mode faces, and cuts human face data the collection S1, S2 ... for making other mode；

(2) a deep neural network structure N1 for multi-modal fusion is designed, in this structure, first half is several Independent neutral net branch, one mode of input correspondence (such as the RGB mode, depth mode, near-infrared mode of each branch Deng), multiple bimodal branch are then permeated a neutral net branch of synthesis (such as these with specific network structure Feature is coupled together, or is stacked up by passage, or other attachment structures, such as structure etc. as accompanying drawing 3), reconnect A series of neural network structure units (such as convolutional layer, normalization layer, non-linear layer, pond layer, full articulamentum)；Then will S0, S1, S2 ... are sent in the different branches of N1, train N1 networks, the model for training to be represented with M1；

(3) a multi-modal shared deep neural network structure N2 is designed, by S0, S1, S2 ... are not added with for distinguishing Rise in feeding N2, train N2 networks, the model for training to be represented with M2；

(4) feature stage is extracted, for registered set and the image of query set, its modality range is in training set modality range Interior, the different modalities of certain image can be expressed as I0, I1, I2 ..., and then I0, I1, I2 ... are respectively in model M 1 and M2 Upper extraction feature, feature can use F0, F0C, F1, F1C, F2, F2C ... to represent, it is the feature extracted from M2 that C is represented；

(5) similarity between GF0 and PF0 (representing the F0 of the registered set image and F0 of query set image respectively) is calculated The similarity S22 of the similarity S11, GF2 and PF2 of S00, GF1 and PF1；Calculate GF0C and PF1C (represents registered set image respectively F0C and query set image F1C) between similarity S01, similar calculates cross-module state similarity S02, S03, S12, S13, S23 ...；

(6) similarity to all of registered set and query set is weighted summation fusion, obtains final fusion similar Degree S, recognition of face and confirming face are carried out on the similarity matrix that fusion similarity S is constituted.

The formula of above-mentioned weighted sum fusion can be expressed as：S=p1*S1+p2*S2+p3*S3 ..., wherein p1:p2: P3 can be S1, and S2, S3 individually carry out the inverse ratio (1/r1 of the recognition correct rate (r1, r2, r3) that face recognition experiment is obtained:1/ r2:1/r3)。

In deep neural network structure N1 in step (2), network structure elements can include but is not limited to convolutional layer, Pond layer, nonlinear function layer, full articulamentum, distribution normalization layer etc., and include but is not limited to any combination of these layers； The simple combination of network structure elements brand-new design, or can be changed based on existing open network structure, The simple combination of network structure elements is not in itself the scope of protection, as long as meet the present invention being described, first half is Several independent neutral net branches, a branch for the individual synthesis that permeated by ad hoc structure, and each mode in ad hoc structure There are respective independent supervisory signals in branch, connects a series of structure shape of neural network structure units after the branch after synthesis again Formula, then belong to the scope of protection.

One is given as shown in Figure 2 based on Google Inception-v2 (a kind of depth network structure that Google is proposed) The network structure come according to above-mentioned rules modification, in the network structure, if removing Loss4 loss structures, removes depth mould State and near-infrared bimodal branch, a remaining RGB branch is directly connected to 3*Inception structures, and (* represents multiple similar structure strings Connection, is 3 Inception structures in series such as herein, and Inception structures are a sub-network structure of Google's definition, [2] have the expression to Inception structures in the Figure 3 in) on, then completely it is exactly Google Inception-v2 network knots In itself, in the Figure 5 of [1], the 2* (convolution+pond) of accompanying drawing 2 corresponds to the convolution (convolutional layer) of Figure 5 to structure, Max pool (maximum pond layer), this four layers of convolution (convolutional layer), max pool (maximum pond layer), and from it is lower to On first 3*Inception then correspond to [1] in inception (3a), inception (3b), inception (3c), 4*Inception thereon then corresponds to inception (4a), inception (4b), inception (4c) in [1], Inception (4d), then 3*Inception thereon then corresponds to inception (4e), the inception (5a) in [1], Inception (5b), Loss3 ponds layer then corresponds to the avg pool (average pond layer) in [1].

In our network structure, we first allow the data (RGB, depth information, Near Infrared Information etc.) of different modalities By a series of neural network structures, (each respective supervisory signals of mode can in fusion structure for study to the distinctive feature of mode To ensure and promote the study of mode characteristic feature), then fusion connection is got up by a series of neural network structures, to help Network carries out the complementary feature learning of mode, and can be used when fusion connection will in the way of passage (channel) is stacked Characteristic pattern is stacked up.For Loss4 structures, it is the structure of particular design, except that can promote network convergence, also may be used To promote e-learning mode complementary characteristic, a kind of structure of Loss4 is shown in accompanying drawing 3, three branches correspond to three respectively Mode, picks out from three branches of accompanying drawing 2, and FC is to represent full articulamentum in structure, after FC or under numeral be to represent The node quantity of the full articulamentum；It is 5*5 that average pond Hua Ceng pondizations are interval, and interval steps are 3；Convolutional layer convolution kernel size is 1* 1, interval steps are 1；The total number of persons of training set, is the categorical measure of label (label)；+ expression is input to three 512 therein The full articulamentum of node is averaging (fc3 [i]=(fcc3 [i]+fcd3 [i]+fcn3 [i])/3, i=1,2,3 ... by node 512), it is clear that obtain an or full articulamentum for 512 nodes, namely fc3；FC2048 is three the 512 of mode and connects entirely Connect layer and three modality fusions 512 full articulamentums be connected in series after the full articulamentums of 2048 dimensions that obtain；

When present invention training N1 networks, loss layer can use softmax with loss (classical Classification Loss Layer), it is also possible to use unknown losses layer.The used algorithm of training is that Back Propagation reversely return algorithm, is passed through Return the error of loss layer to update each layer of parameter so that network parameter is updated, and finally gives convergence.Its is specific Training step, as a example by the data set that is used during with present invention experiment (non-public, to be gathered with partner cooperation), the data set Training set scale be for about 500,000 samples, about 500 people.Use 32 batch (data block) scale, basic learning rate 0.045, every 6400 iterative learning rate is multiplied by 0.9, weight decay (weights decay), and for 0.0002, momentum, (momentum is joined Number) it is 0.9. about 400,000 iteration of training.

The present invention in deep neural network structure N2, likewise, network structure elements can include but is not limited to convolution Layer, pond layer, nonlinear function layer, full articulamentum, distribution normalization layer etc., and include but is not limited to any group of these layers Close；In this step, network structure is not the scope of protection, trains multi-modal public characteristic empty by way of not differentiating between mode Between method be protection scope, network structure can be with brand-new design, it is also possible to using network structure disclosed in academia such as The networks such as AlexNet, GoogleNet, ResNet.By S0, S1, S2 etc. be not added with distinguish the N2 of feeding together in when, if Their port number is different, then can take all be changed into single pass mode to be normalized to same channels (such as RGB triple channels change It is gray-scale map single channel), or to repeat single channel identical (such as repeat near-infrared single channel figure three times to triple channel) to port number, Then it is re-fed into being trained in network structure.

The present embodiment is with RGB, depth information, as a example by Near Infrared Information, because depth information and Near Infrared Information are list The RGB figures of all samples, can be switched to single channel gray-scale map by passage, be re-fed into being trained in N2 networks.With Google As a example by Inception-v2 [1], by all samples of 500,000 training sets regardless of mode regard as 1,500,000 samples (RGB, Depth, three kinds of mode of near-infrared, wherein RGB switchs to gray-scale map herein), study is trained in input Inception-v2, make With 32 batch block sizes, basic learning rate 0.045, every 19200 iterative learning rates are multiplied by 0.9, weight decay (weights decay) is that 0.0002, momentum (momentum parameter) is 0.9, trains about 1,200,000 iteration.The model for obtaining It is set to M2.Then for the feature of model, it is believed that be the feature of cross-module state, the RGB of sample (having switched to a gray-scale map) feature Can be compared with its depth information feature, the RGB feature of query set can carry out similarity-rough set with the depth information feature of registered set, This is the rudimentary algorithm of cross-module state identification division of the present invention.

Feature stage is being extracted, can be training set modality range for the modality range of registered set and query set image Complete or collected works, or proper subclass, but can not be beyond training set modality range；The position for extracting feature can be in the He of model M 1 The top layer of M2, it is also possible at non-top layer (such as middle a certain layer).

In above-mentioned step (5), the distance metric method for calculating similarity can be COS distance, or European Distance, or other distance metrics such as mahalanobis distance.The present invention by taking COS distance as an example, for two characteristic vectors x1, x2, its COS distance d=x1 ' * x2/ (| x1 | * | x2 |).Where it is assumed that x1, x2 are column vector, x1 ' is the transposition of x1, and x1 ' * x2 are The dot product of x1 ' and x2, | x1 |, | x2 | are x1, and the mould of x2 is long, wherein mould satisfaction | x1 | 2=x1 ' * x1 long.

Face recognition experiment of the present invention refers to giving a registered set, and registered set has a series of different face figures Picture, when a query image is given, which individual that the query image is registered set is found, and method can think registration Collection image and that people of query image similarity highest；Confirming face experiment refers to given two query images, determines this Whether two images are same persons, and method can be to determine a threshold value, if two image similarities are higher than threshold value, then it is assumed that It is same person, otherwise it is assumed that not being same person.

The present invention has several key technology points：1) multi-modal fusion is carried out using network structure, fusion position is in network Middle part, neither bottom input position also non-top layer loss layer position, the mode quantity of fusion can be two, or many It is individual, for each mode provides single supervisory signals in fusion structure, to allow each mode feature of itself to be filled Divide and excavate and be unlikely to be fallen into oblivion by other modal characteristics.Therefore the most abundant Generalization Capability for recognition of face can be obtained Feature representation and syncretizing effect, and be not limited to two mode, be adapted to multiple mode, more flexibly.2) for across The mode of learning in mode public characteristic space, the face image data of multiple modalities is not added with differentiation is added to depth as input It is trained in degree neutral net, mode quantity can be two, or multiple, and the label of image is the label of people, no Mode is distinguished, the public characteristic space that training is obtained can choose the top layer of network, it is also possible to select the non-top layer of network； Technique effect can be the recognition effect of cross-module state between the various face mode of acquisition.3) it is similar to all multi-modal mode inside Similarity is weighted the mode of summation fusion between degree and cross-module morphotype state, obtains the multi-modal fusion phase of final two images Like spend, for test image mode can completely can also be imperfect, its modality type set can be training set matched moulds The complete or collected works of state type set can also be proper subclass；Technique effect be so that multi-modal recognition of face obtain very strong flexibility and Applicability, goes for many scenes (as only registered set is multi-modal, query image single mode etc.) of multi-modal recognition of face.

Therefore, the present invention is excavated and study face by designing multi-modal and cross-module state deep learning network, sufficiently Complementary characteristic between multi-modal, greatly improves the performance of recognition of face, and there is provided very strong multi-modal and cross-module state The flexibility of face recognition application and applicability.

Above-mentioned implementation method is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck Change, remodeling, addition or replacement that the technical staff in domain is made in the range of technical scheme, also belong to this hair Bright protection domain.

Claims

1. a kind of multi-modal face identification method based on deep learning, it is characterised in that comprise the following steps：

(1) Face datection is carried out to RGB facial images, positioning feature point, alignment cuts, and makes the RGB mode faces after cutting Data set S0；According to the coordinate mapping relations between RGB mode and other mode, the characteristic point of other mode faces is found, and Cut human face data the collection S1, S2 ... for making other mode；

(2) a depth convolutional neural networks structure N1 for multi-modal fusion is designed, in this structure, first half is several Independent neutral net branch, input one mode of correspondence of each branch, then with specific network structure multiple mode Branch is fused into a branch, reconnects a series of neural network structure units；Then by S0, S1, S2 ... are sent to N1's In respective branches, N1 networks, the model for training is trained to be represented with M1；

Above-mentioned mode includes but is not limited to RGB mode, depth mode and near-infrared mode；In specific network structure, each point Branch has respective Classification Loss as the respective supervisory signals of each mode, and structure fusion method includes but is not limited to simple spy Levy splicing；

(3) a multi-modal shared depth convolutional neural networks structure N2 is designed, by S0, S1, S2 ... are not added with for distinguishing Rise in feeding N2, train N2 networks, the model for training to be represented with M2；

(4) extract feature stage, for registered set and the image of query set, its modality range in training set modality range, certain The different modalities of image can be expressed as I0, I1, I2 ..., and then I0, I1, I2 ... are extracted in model M 1 and M2 respectively Feature, feature can use F0, F0C, F1, F1C, F2, F2C ... to represent, it is the feature extracted from M2 that C is represented；

(5) similarity of the similarity S11, GF2 and PF2 of similarity S00, GF1 and PF1 respectively between calculating GF0 and PF0 S22；The similarity S01 between GF0C and PF1C is calculated, similar calculates cross-module state similarity S02, S03, S12, S13, S23……；

Above-mentioned GF0 represents that the F0 of registered set image, PF0 represent that the F0 of query set image, GF0C represent the F0C of registered set image, PF1C represents the F1C of query set image；

(6) summation fusion is weighted to all of registered set and query set similarity, final fusion similarity S is obtained, Recognition of face and confirming face are carried out on the similarity matrix that fusion similarity S is constituted.

2. the multi-modal face identification method based on deep learning according to claim 1, it is characterised in that：The nerve Network structure elements include but is not limited to convolutional layer, normalization layer, non-linear layer, pond layer and full articulamentum and distribution normalization Layer.

3. the multi-modal face identification method based on deep learning according to claim 1, it is characterised in that：The step (2) in, when training N1 networks, loss layer can use softmax withloss or use unknown losses layer.

4. the multi-modal face identification method based on deep learning according to claim 1, it is characterised in that：The step (3) in, S0 when S1, S2 ... are not added with the feeding N2 for distinguishing, if their port number is different, can take All it is changed into single pass mode and is normalized to same channels, or to repeat single channel identical to port number, is then re-fed into net It is trained in network structure.

5. the multi-modal face identification method based on deep learning according to claim 1, it is characterised in that：The step (4) in, for registered set and the image of query set, its modality range is within training set modality range.