CN114398611A

CN114398611A - Bimodal identity authentication method, device and storage medium

Info

Publication number: CN114398611A
Application number: CN202111640915.5A
Authority: CN
Inventors: 蔡晓东; 周青松
Original assignee: Guilin Topintelligent Communication Technology Co ltd
Current assignee: Guilin Topintelligent Communication Technology Co ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-26

Abstract

The invention provides a bimodal identity authentication method, a bimodal identity authentication device and a storage medium, which belong to the technical field of image processing, and the bimodal identity authentication method comprises the following steps: s1: importing face images and voice data; s2: respectively carrying out picture characteristic analysis on each face image to obtain a face characteristic vector; s3: respectively carrying out voice feature analysis on each voice data to obtain a voiceprint feature vector; s4: constructing a training model, and training the training model according to all face feature vectors and all voiceprint feature vectors to obtain a bimodal identity authentication model; s5: and importing the face image to be detected and the voice data to be detected, and performing identity authentication on the face image to be detected and the voice data to be detected through a bimodal identity authentication model to obtain an identity authentication result. The method can complement the characteristic information of the two modes, effectively makes up the defect that the single-mode biometric authentication technology is easily influenced by deception attack, environmental noise and the like, and further improves the identification accuracy rate.

Description

Bimodal identity authentication method, device and storage medium

Technical Field

The invention mainly relates to the technical field of image processing, in particular to a bimodal identity authentication method, a bimodal identity authentication device and a storage medium.

Background

Although existing face recognition and speech recognition techniques are well established. However, these single-mode authentication techniques still have many limitations, for example, face recognition is easily affected by occlusion, angle, illumination, posture change, etc., and voice recognition is easily affected by ambient noise and changes in the physical conditions of the user, so that these single-mode authentication techniques have a poor recognition effect in some specific scenarios. More challenging, at present, there are many deception jamming means for face recognition or voiceprint recognition, and a common single-mode identity authentication method is often difficult to withstand some special attacks, and once attacked or counterfeited by an illegal molecule, serious losses are easily caused to the life and property safety of people.

Disclosure of Invention

The invention provides a bimodal identity authentication method, a bimodal identity authentication device and a storage medium, aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a bimodal identity authentication method comprises the following steps:

s1: importing a plurality of training data, wherein each training data comprises a face image and voice data;

s2: respectively carrying out picture characteristic analysis on the face images in the training data to obtain face characteristic vectors;

s3: respectively carrying out voice feature analysis on voice data in the training data to obtain voiceprint feature vectors;

s4: constructing a training model, and training the training model according to all the face feature vectors and all the voiceprint feature vectors to obtain a bimodal identity authentication model;

s5: and importing data to be authenticated, wherein the data to be authenticated comprises a face image to be authenticated and voice data to be authenticated, and authenticating the identity of the face image to be authenticated and the voice data to be authenticated through the bimodal identity authentication model to obtain an identity authentication result.

Another technical solution of the present invention for solving the above technical problems is as follows: a bimodal identity authentication apparatus comprising:

the data import module is used for importing a plurality of training data, and each training data comprises a face image and voice data;

the image feature analysis module is used for respectively carrying out image feature analysis on the face images in the training data to obtain face feature vectors;

the voice feature analysis module is used for respectively carrying out voice feature analysis on the voice data in the training data to obtain voiceprint feature vectors;

the model training module is used for constructing a training model, and training the training model according to all the face feature vectors and all the voiceprint feature vectors to obtain a bimodal identity authentication model;

and the identity authentication result obtaining module is used for importing data to be authenticated, the data to be authenticated comprises a face image to be authenticated and voice data to be authenticated, and identity authentication is carried out on the face image to be authenticated and the voice data to be authenticated through the dual-mode identity authentication model to obtain an identity authentication result.

Another technical solution of the present invention for solving the above technical problems is as follows: a dual-mode identity authentication device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and when the processor executes the computer program, the dual-mode identity authentication method is realized.

Another technical solution of the present invention for solving the above technical problems is as follows: a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a bimodal identity authentication method as described above.

The invention has the beneficial effects that: the face feature vectors are obtained through picture feature analysis of face images in the training data, the voiceprint feature vectors are obtained through voice feature analysis of voice data in the training data, a bimodal identity authentication model is obtained according to training of all the face feature vectors and all the voiceprint feature vectors on a training model, identity authentication results are obtained through identity authentication of a face image to be tested and voice data to be tested through the bimodal identity authentication model, feature information of two modes can be complemented, the defect that a single-mode biometric authentication technology is easily affected by attacks, environmental noises and the like is effectively overcome, and meanwhile, the recognition accuracy is further improved.

Drawings

Fig. 1 is a schematic flowchart of a bimodal identity authentication method according to an embodiment of the present invention;

fig. 2 is a block diagram of a bimodal identity authentication apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a bimodal identity authentication method according to an embodiment of the present invention.

As shown in fig. 1, a bimodal identity authentication method includes the following steps:

It should be understood that the face image may be picture data having the face image, and the voice data may be voice data of a spoken voice.

It should be understood that the face feature vector and the voiceprint feature vector represent face and voiceprint feature information, respectively.

In the above embodiment, the face feature vectors are obtained by respectively analyzing the picture features of the face images in each training data, the voiceprint feature vectors are obtained by respectively analyzing the voice features of the voice data in each training data, the bimodal identity authentication model is obtained by training the training model according to all the face feature vectors and all the voiceprint feature vectors, and the identity authentication result is obtained by authenticating the identity of the face image to be detected and the voice data to be detected through the bimodal identity authentication model.

Optionally, as an embodiment of the present invention, the step S2 process includes:

respectively carrying out face detection on the face images in the training data based on an MTCNN model to obtain detected face images corresponding to the face images;

and respectively carrying out picture feature extraction on each detected face picture based on a faceNet model to obtain a face feature vector corresponding to each face picture.

It should be understood that the MTCNN model is known as a multitasking convolutional neural network (Tutil-Task CNN), consisting of three cascaded lightweight CNNs: PNet, RNet and Onet. The image data is processed by the three networks in sequence, and finally face detection results and key point detection results are output.

It should be understood that the FaceNet model is a face recognition model, and the main idea is to map a face image to a multidimensional space, and represent the similarity of the face by spatial distance; the spatial distance between the face image and the face image is smaller, and the spatial distance between different face images is larger; therefore, the face recognition can be realized through the space mapping of the face image.

It should be understood that the FaceNet model may be replaced with other face recognition models such as insight face.

Specifically, the image containing the face image (i.e., the face image) is firstly subjected to face detection and alignment through the existing MTCNN model, and then sent into the existing faceNet model for feature extraction, so as to obtain a feature vector (i.e., the face feature vector) representing face feature information

Representing the vector dimension as d₁。

In the above embodiment, the picture feature analysis to the face image in each training data respectively obtains the face feature vector, and the face feature that can be accurate is drawed, provides the basis for follow-up data processing, has improved the discernment rate of accuracy, and simultaneously, also realized that expansibility is strong, characteristics that the range of application is wide.

Optionally, as an embodiment of the present invention, the process of step S3 includes:

respectively preprocessing the voice data in the training data to obtain processed voice data corresponding to the face images;

and respectively carrying out voice feature extraction on each processed voice data based on an x-vector model to obtain a voiceprint feature vector corresponding to each face image.

It should be understood that the x-vector model is a mainstream baseline model framework in the field of voiceprint recognition at present, and can accept input of any length and convert the input into feature expression of fixed length by virtue of a statistics posing layer in a network; in addition, a data enhancement strategy containing noise and reverberation is introduced in training, so that the model is more robust to interference such as the noise and the reverberation.

It should be understood that the x-vector model can be replaced by other voiceprint recognition models such as i-vector.

Specifically, the audio frequency (namely the voice data) containing the voiceprint information is subjected to preprocessing work such as framing windowing and pre-emphasis, and adverse factor influences such as aliasing, higher harmonic distortion and high frequency caused by human vocal organs and equipment for acquiring voice signals are eliminated. Then sending the voice print data into the existing x-vector model to carry out feature extraction on the voice print data to obtain a feature vector (namely the voice print feature vector) representing the voice print feature information and corresponding to the picture

In the above embodiment, the preprocessing of each voice data is performed to obtain the processed voice data, and the voice feature of each processed voice data is extracted to obtain the voiceprint feature vector based on the x-vector model, so that a basis is provided for subsequent data processing, the recognition accuracy is improved, and meanwhile, the characteristics of strong expansibility and wide application range are realized.

Optionally, as an embodiment of the present invention, the process of step S4 includes:

s41: constructing a training model, and respectively carrying out fusion analysis on each face feature vector and the voiceprint feature vector corresponding to each face image to obtain a global feature vector corresponding to each face image;

s42: respectively carrying out normalization processing on each global feature vector to obtain a predicted value corresponding to each face image;

s43: importing picture real values corresponding to the face images, and respectively calculating the predicted values and loss values of the picture real values corresponding to the face images to obtain loss values corresponding to the face images;

s44: and updating parameters of the training model by using a back propagation mechanism algorithm, a gradient descent algorithm and the loss values, and returning to the step S1 until a preset iteration number is reached to finally obtain the bimodal identity authentication model.

It should be understood that the model obtained by training after the fusion of the two-mode information (i.e. the dual-mode identity authentication model) is more robust and fault-tolerant than the single-mode identity authentication model, and has more accurate identification capability.

Specifically, the predicted values are compared according to a cross entropy loss function

And comparing the real values with corresponding imported sample real values (namely the picture real values) y to obtain a plurality of loss values, continuously and iteratively updating all learnable parameters from the steps S1 to S4 through a back propagation mechanism and a gradient descent method to enable the loss values to be minimum, and finally finishing the training of the whole model.

In the embodiment, the bimodal identity authentication model is obtained by training the training model through all the face feature vectors and all the voiceprint feature vectors, is more robust and fault-tolerant than a monomodal authentication model, has more accurate identification capability, can complement feature information of two modes, and effectively makes up the defect that a monomodal biometric authentication technology is easily influenced by deception attack, environmental noise and the like.

Optionally, as an embodiment of the present invention, in S41, the process of performing fusion analysis on each face feature vector and the voiceprint feature vector corresponding to each face image respectively to obtain a global feature vector corresponding to each face image includes:

respectively calculating the face hidden feature vectors of the face feature vectors by a first type to obtain the face hidden feature vectors corresponding to the face images, wherein the first type is as follows:

h_f＝tanh(w_fe_f+b_f)，

wherein the content of the first and second substances,

is a face hidden feature vector, tanh is a tanh activation function,

to face feature vector e_fLearnable weight matrix for transformation, b_fFor the face latent feature vector e_fBias term of e_fIs a face feature vector;

calculating the voiceprint hidden feature vectors of the voiceprint feature vectors respectively through a second formula to obtain the voiceprint hidden feature vectors corresponding to the face images, wherein the second formula is as follows:

h_v＝tanh(w_ve_v+b_v)，

wherein the content of the first and second substances,

is the voiceprint hidden feature vector, tanh is the tanh activation function,

as to the voiceprint feature vector e_vLearnable weight matrix for transformation, e_vIs a voiceprint feature vector, b_vFor hidden feature vectors e of voiceprints_vThe bias term of (d);

respectively calculating the gating vector of each face feature vector and the voiceprint feature vector corresponding to each face image through a third formula to obtain the gating vector corresponding to each face image, wherein the third formula is as follows:

z＝σ(w₁e_f+w₂e_v)，

wherein the content of the first and second substances,

for gating vectors, σ is the sigmoid activation function,

as a face feature vector e_fA matrix of weights that can be learned,

as a voiceprint feature vector e_vLearnable rightHeavy matrix, e_fIs a face feature vector, e_vIs a voiceprint feature vector;

respectively carrying out global feature vector calculation on each face feature vector, the voiceprint feature vector corresponding to each face image and the gating vector corresponding to each face image through a fourth formula to obtain the global feature vector corresponding to each face image, wherein the fourth formula is as follows:

h_G＝zh_f+(1-z)h_v，

wherein the content of the first and second substances,

in order to be a global feature vector,

in order to be the gating vector, the method comprises the following steps of,

the hidden feature vectors of the human face are taken as the feature vectors,

is a voiceprint hidden feature vector.

It should be understood that the face feature vector e is calculated according to equation (1) (i.e., the first equation)_fPerforming a non-linear transformation to map it from the original vector space to a new vector space

Said formula (1) (i.e. said first formula) is h_f＝tanh(w_fe_f+b_f) Wherein

Is a hidden feature vector after nonlinear transformation, tanh is a tanh activation function,

is to e_fLearnable weight matrix for transformation, b_fIs the corresponding bias term.

It should be understood that the voiceprint feature vector e is aligned according to equation (2) (i.e., the second equation)_vNon-linear transformation is carried out to ensure that the original vector space is also mapped to the same vector space S with the human face characteristic vector_P. The formula (2) (i.e., the second formula) is h_v＝tanh(w_ve_v+b_v) Wherein

Is a non-linear transformed latent feature vector,

is to e_vLearnable weight matrix for transformation, b_vIs the corresponding bias term.

Specifically, the face feature vector e is calculated according to equation (3) (i.e., the third equation)_fAnd the voiceprint feature vector e_vTransform, add, and activate operations are performed. Wherein the formula (3) (i.e., the third formula) is z ═ σ (w)₁e_f+w₂e_v) Wherein

Is a gating vector (the numerical range is 0-1) obtained by operation, sigma is a sigmoid activation function,

and

are respective learnable weight matrices.

Specifically, the gating vector z and the hidden feature vector h are processed according to equation (4) (i.e., the fourth equation)_f(i.e., the face latent feature vector) and h_v(i.e. the voiceprint hidden feature vectors) are in the same vector space S_PPerforming feature fusion operation. The formula (4) (i.e., the fourth formula) is h_G＝zh_f+(1-z)h_vWherein

The global feature vector is used for representing the combined features of the face and the voiceprint of the same person after being fused.

It should be understood that the gating vector z is used for controlling contributions of different features to overall output, that is, the weight of the face feature and the voiceprint feature to the global feature can be adaptively adjusted, and information of two modes is complemented to obtain the global feature with higher discriminability, so as to finally achieve the purpose of more accurately judging the identity of the user. Even if the information of a single mode fails, the information of the other mode can work smoothly, and z is changed into an all-0 vector or an all-1 vector.

In the embodiment, the global feature vector is obtained by respectively performing fusion analysis on each face feature vector and each voiceprint feature vector, so that feature information of two modes can be complemented, the purpose of accurately judging the identity of the user is achieved, and the defect that a single-mode biometric authentication technology is easily affected by deception attack, environmental noise and the like is effectively overcome.

Optionally, as an embodiment of the present invention, the process of step S42 includes:

respectively carrying out normalization calculation on each global feature vector through a fifth formula to obtain a predicted value corresponding to each face image, wherein the fifth formula is as follows:

wherein h is_GIn order to be a global feature vector,

is a predicted value.

It should be understood that AMsoftmax is called Additive mark softmax, and is an improvement on the original softmax function, so that intra-class samples among similar samples are more compact and heterogeneous samples are more discrete during classification, and a better classification effect is achieved.

In the above embodiment, the predicted values are obtained by respectively performing normalization calculation on the global feature vectors according to the fifth formula, so that intra-class samples of the same type are more compact during classification, and inter-class samples of different types are more discrete, thereby achieving a better classification effect.

Optionally, as an embodiment of the present invention, in step S43, the step of calculating loss values of each of the predicted values and the real picture value corresponding to each of the face images respectively to obtain the loss value corresponding to each of the face images includes:

calculating loss values of the predicted values and real picture values corresponding to the face images respectively through a sixth formula to obtain the loss values corresponding to the face images, wherein the sixth formula is as follows:

wherein y is the real value of the picture,

in order to predict the value of the target,

is the loss value.

It should be understood that the sixth expression is a loss function, and is used to evaluate the degree of inconsistency between the predicted value and the true value of the model, and the process of training or optimizing the model is the process of minimizing the loss function, and the smaller the loss function is, the closer the predicted value of the model is to the true value is, and the better the robustness of the model is.

In the embodiment, the loss values of the predicted values and the real values of the pictures are respectively calculated through the sixth formula to obtain the loss values, so that better parameters can be obtained, and the identification accuracy of the model is improved.

Optionally, as another embodiment of the present invention, the present invention includes an information acquisition module, an information processing module, an information base module, and a control module.

The information acquisition module is used for acquiring the face image and the speaking voice of the verified user and sending the face image and the speaking voice into the information processing module;

the information processing module is used for processing the face image and the speaking voice acquired by the information acquisition module, and particularly, the method of the invention is used for carrying out data preprocessing, feature extraction and feature fusion operation to obtain a global feature vector representing the face and voice print bimodal information of the user and then sending the global feature vector into the information base module;

the information base module is a database and is mainly used for storing the global feature vector and the corresponding user real identity information which are obtained by the information acquisition module and the information processing module when each user registers an account. Each user has one and only one global feature vector, which can be regarded as an electronic tag for identifying the identity ID of the user in the system. And the module supports feature vector similarity comparison, when the identity of a user is verified, the global feature vector obtained by the user through the information acquisition module and the information processing module and all feature vectors in the database are subjected to cosine similarity calculation to obtain a similarity score, and after the feature vector with the highest similarity to the global feature vector in the database is found, whether the user is a registered user is further judged according to a set threshold value. If the similarity score is larger than the threshold value, representing by a verification signal 1; below the threshold value is indicated by a verification signal 0. Finally, transmitting a verification signal 1 or 0 to the control module;

the control module is used for receiving the verification signal of the information base, making a decision and controlling whether the corresponding equipment allows the verification user to pass. A value of 1 indicates that the identity of the authenticated user is valid, the pass is allowed, and a value of 0 indicates that the pass and subsequent operations are not allowed.

Optionally, as another embodiment of the present invention, as shown in fig. 2, a dual-mode identity authentication apparatus includes:

Optionally, another embodiment of the present invention provides a dual-mode identity authentication apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the dual-mode identity authentication method as described above is implemented. The device may be a computer or the like.

Optionally, another embodiment of the invention provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the bimodal identity authentication method as described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A bimodal identity authentication method is characterized by comprising the following steps:

2. The bimodal identity authentication method according to claim 1, wherein the step S2 procedure comprises:

3. The bimodal identity authentication method according to claim 2, wherein the process of the step S3 includes:

4. The bimodal identity authentication method according to claim 3, wherein the process of the step S4 includes:

5. The dual-modality identity authentication method according to claim 4, wherein in the step S41, the process of performing fusion analysis on each face feature vector and the voiceprint feature vector corresponding to each face image to obtain the global feature vector corresponding to each face image includes:

h_f＝tanh(w_fe_f+b_f)，

wherein the content of the first and second substances,

is a face hidden feature vector, tanh is a tanh activation function,

h_v＝tanh(w_ve_v+b_v)，

wherein the content of the first and second substances,

is the voiceprint hidden feature vector, tanh is the tanh activation function,

z＝σ(w₁e_f+w₂e_v)，

wherein the content of the first and second substances,

for gating vectors, σ is the sigmoid activation function,

as a face feature vector e_fA matrix of weights that can be learned,

as a voiceprint feature vector e_vLearnable weight matrix, e_fIs a face feature vector, e_vIs a voiceprint feature vector;

h_G＝zh_f+(1-z)h_v，

wherein the content of the first and second substances,

in order to be a global feature vector,

in order to be the gating vector, the method comprises the following steps of,

the hidden feature vectors of the human face are taken as the feature vectors,

is a voiceprint hidden feature vector.

6. The bimodal identity authentication method according to claim 4, wherein the process of the step S42 includes:

wherein h is_GIn order to be a global feature vector,

is a predicted value.

7. The dual-mode identity authentication method according to claim 4, wherein in step S43, the process of calculating the loss value of each predicted value and the real picture value corresponding to each face image respectively to obtain the loss value corresponding to each face image comprises:

wherein y is the real value of the picture,

in order to predict the value of the target,

is the loss value.

8. A bimodal identity authentication apparatus, comprising:

9. A bimodal identity authentication system comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that when said processor executes said computer program, the bimodal identity authentication method as claimed in any one of claims 1 to 7 is implemented.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the bimodal identity authentication method as claimed in any one of claims 1 to 7.