CN117852007A

CN117852007A - Identity authentication method, device, equipment and storage medium integrating human face and voiceprint

Info

Publication number: CN117852007A
Application number: CN202311814587.5A
Authority: CN
Inventors: 周靖轩; 付宇; 张华军; 王征华; 邓小涛
Original assignee: Wuhan Dashengji Technology Co ltd
Current assignee: Wuhan Dashengji Technology Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-04-09

Abstract

The invention provides an identity authentication method, a device, equipment and a storage medium for fusing a human face and voiceprints, wherein the method comprises the following steps: acquiring a face image to be authenticated and a voiceprint to be authenticated of a person to be authenticated; determining a face authentication score of the face image to be authenticated, and determining a voiceprint authentication score of the voiceprint to be authenticated; and inputting the face authentication score and the voiceprint authentication score into a score fusion deep learning model with complete training, obtaining a fusion score, and determining whether the personnel to be authenticated pass authentication or not based on the fusion score. The invention respectively determines the face authentication score and the voiceprint authentication score, eliminates the dependence of two biological characteristics of the face and the voiceprint in the authentication process, improves the precision and the accuracy of identity authentication, fuses the face authentication score and the voiceprint authentication score based on the score fusion deep learning model to obtain a fusion score, fully plays the advantages of different biological characteristics, and further improves the precision and the accuracy of the identity authentication.

Description

Identity authentication method, device, equipment and storage medium integrating human face and voiceprint

Technical Field

The invention relates to the technical field of identity authentication, in particular to an identity authentication method, device, equipment and storage medium integrating human faces and voiceprints.

Background

In recent years, the technology and application of artificial intelligence are rapidly developed, and identity authentication is relatively popular by means of the artificial intelligence technology, so that various research results are applied to the fields of entrance guard, company card-punching and sign-in, public security department criminal investigation and the like. Of these, the most representative face recognition techniques and voiceprint recognition. However, single biometric identification technology creates significant resistance in practical popularization and use due to safety and applicability issues. In contrast, identification techniques based on fusion of multiple biological features are becoming a research hotspot.

The fusion technology of various biological characteristics in the prior art comprises the following specific steps: p1 and P2 are two biological feature images, R1 and R2 are biological feature templates corresponding to P1 and P2 respectively, P1 and R1 are compared in the identification process to obtain a comparison score S1, if the comparison score S1 is larger than a threshold T1, the comparison threshold corresponding to P2 is adjusted to obtain an adjusted comparison threshold T2, for example, the score of the original comparison threshold is adjusted to be lower, otherwise, if the comparison score S1 is smaller than the threshold T1, the score of the original comparison threshold T2 is adjusted to be higher. The prior art has the following technical problems: the two biological characteristic recognition processes are mutually dependent, and the characteristics between different biological characteristics are independent in practice, so that the accuracy of identity authentication is low.

Therefore, it is needed to provide an identity authentication method, device, equipment and storage medium for fusing a face and voiceprints, so as to solve the above technical problems.

Disclosure of Invention

In view of the foregoing, it is necessary to provide an identity authentication method, device, apparatus and storage medium for fusing a face and a voiceprint, so as to solve the technical problem of low accuracy of identity authentication in the prior art.

In one aspect, the invention provides an identity authentication method for fusing a human face and voiceprints, which comprises the following steps:

acquiring a face image to be authenticated and a voiceprint to be authenticated of a person to be authenticated;

determining a face authentication score of the face image to be authenticated, and determining a voiceprint authentication score of the voiceprint to be authenticated;

and inputting the face authentication score and the voiceprint authentication score into a score fusion deep learning model with complete training, obtaining a fusion score, and determining whether the personnel to be authenticated pass authentication or not based on the fusion score.

In some possible implementations, before the inputting the face authentication score and the voiceprint authentication score into the training complete score fusion deep learning model, the method further includes:

acquiring a training set, wherein the training set comprises a plurality of sample pairs, and each sample pair comprises a face authentication sample and a voiceprint authentication sample corresponding to the face authentication sample;

and training the initial score fusion deep learning model based on the training set and a preset loss function, and obtaining the score fusion deep learning model after the initial score fusion deep learning model is trained when the loss value of the loss function is smaller than a loss threshold value.

In some possible implementations, the loss function is:

in the method, in the process of the invention,loss value for the t-th iteration; s is(s) _it Face authentication score or voiceprint authentication score for the t-th iteration, face authentication score when i=1, voiceprint authentication score when i=2; n=2; a is a bias term; b _i Is a coefficient.

In some possible implementations, the acquiring the face image to be authenticated of the person to be authenticated includes:

acquiring a test face video of a person to be authenticated;

sampling the test face video with a preset sampling step length to obtain a plurality of face video frames;

and inputting each face video frame into a Retinaface model to obtain the face image to be authenticated.

In some possible implementations, the obtaining the voiceprint to be authenticated of the person to be authenticated includes:

acquiring test audio of the personnel to be authenticated;

removing a non-voice part in the test audio based on a voice activity detector to obtain test voice audio;

extracting logarithmic Mel frequency spectrum characteristics of the test voice audio based on a Mel filter bank, and carrying out short-time cepstrum average normalization processing on the logarithmic Mel frequency spectrum characteristics based on a preset sliding window to obtain the voiceprint to be authenticated.

In some possible implementations, the determining the face authentication score of the face image to be authenticated includes:

extracting face feature vectors to be authenticated of the face images to be authenticated based on a face vector extractor with complete training;

acquiring a plurality of registered face feature vectors in a registered face feature library, and determining a plurality of cosine similarities between the face feature vector to be authenticated and the plurality of registered face feature vectors;

and determining the maximum cosine similarity in the cosine similarities, and taking the maximum cosine similarity as the face authentication score.

In some possible implementations, the determining the voiceprint authentication score of the voiceprint to be authenticated includes:

extracting to-be-authenticated voiceprint features of the to-be-authenticated voiceprint based on a time delay neural network;

and acquiring a plurality of registered voiceprint features in a registered voiceprint feature library, and determining the voiceprint authentication score based on a Gaussian probability linear discriminant model.

On the other hand, the invention also provides an identity authentication device for fusing the face and the voiceprint, which comprises:

the image and voiceprint acquisition unit is used for acquiring a face image to be authenticated and voiceprints to be authenticated of a person to be authenticated;

an authentication score determining unit, configured to determine a face authentication score of the face image to be authenticated, and determine a voiceprint authentication score of the voiceprint to be authenticated;

and the scoring fusion unit is used for inputting the face authentication score and the voiceprint authentication score into a score fusion deep learning model with complete training, obtaining a fusion score, and determining whether the personnel to be authenticated pass authentication or not based on the fusion score.

In another aspect, the present invention also provides an authentication device, including a memory and a processor, wherein,

the memory is used for storing programs;

the processor is coupled to the memory, and is configured to execute the program stored in the memory, so as to implement the steps in the identity authentication method of fusing the face and the voiceprint in any one of the possible implementation manners.

In another aspect, the present invention further provides a computer readable storage medium, configured to store a computer readable program or instructions, where the program or instructions, when executed by a processor, implement the steps in the identity authentication method for fusing a face and a voiceprint described in any one of the possible implementations.

The beneficial effects of adopting the embodiment are as follows: according to the identity authentication method integrating the face and the voiceprint, the face authentication score of the face image to be authenticated and the voiceprint authentication score of the voiceprint to be authenticated are respectively determined, the voiceprint authentication score is not required to be determined on the premise of the face authentication score, the dependence of the two biological characteristics of the face and the voiceprint in the authentication process is eliminated, and the accuracy and the precision of the identity authentication are improved.

Furthermore, the invention does not directly carry out simple linear weighting on the face authentication score and the voiceprint authentication score, but fuses the face authentication score and the voiceprint authentication score based on the score fusion deep learning model to obtain the fusion score, thereby fully playing the advantages of different biological characteristics and further improving the precision and accuracy of identity authentication.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the drawings needed in the description of the embodiments, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an embodiment of an identity authentication method for fusing a face with a voiceprint;

FIG. 2 is a schematic flow chart of an embodiment of an initial score fusion deep learning model training process provided by the present invention;

FIG. 3 is a flowchart illustrating an embodiment of acquiring a face image to be authenticated in S101 of FIG. 1 according to the present invention;

FIG. 4 is a flowchart illustrating an embodiment of obtaining a voiceprint to be authenticated in S101 of FIG. 1 according to the present invention;

FIG. 5 is a flowchart illustrating an embodiment of determining the face feature score in S102 of FIG. 1 according to the present invention;

FIG. 6 is a flowchart illustrating an embodiment of determining a voiceprint authentication score in S102 of FIG. 1 according to the present invention;

FIG. 7 is a schematic diagram of an embodiment of an identity authentication device with a face and voiceprint fusion function according to the present invention;

fig. 8 is a schematic structural diagram of an embodiment of an authentication device provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present invention. It should be appreciated that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor systems and/or microcontroller systems.

The descriptions of "first," "second," and the like in the embodiments of the present invention are for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implying an order of magnitude of the indicated features. Thus, a technical feature defining "first" and "second" may explicitly or implicitly include at least one such feature.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The invention provides an identity authentication method, device, equipment and storage medium integrating a human face and voiceprints, which are respectively described below.

Fig. 1 is a schematic flow chart of an embodiment of an identity authentication method for fusing a face and a voiceprint, where, as shown in fig. 1, the identity authentication method for fusing a face and a voiceprint includes:

s101, acquiring a face image to be authenticated and a voiceprint to be authenticated of a person to be authenticated;

s102, determining a face authentication score of a face image to be authenticated, and determining a voiceprint authentication score of a voiceprint to be authenticated;

s103, inputting the face authentication score and the voiceprint authentication score into a score fusion deep learning model with complete training, obtaining a fusion score, and determining whether the personnel to be authenticated pass authentication or not based on the fusion score.

Compared with the prior art, the identity authentication method integrating the face and the voiceprint provided by the embodiment of the invention has the advantages that the face authentication score of the face image to be authenticated and the voiceprint authentication score of the voiceprint to be authenticated are respectively determined, the voiceprint authentication score is not required to be determined on the premise of the face authentication score, the dependence of the two biological characteristics of the face and the voiceprint in the authentication process is eliminated, and the accuracy and the precision of the identity authentication are improved.

Furthermore, the embodiment of the invention does not directly carry out simple linear weighting on the face authentication score and the voiceprint authentication score, but fuses the face authentication score and the voiceprint authentication score based on a score fusion deep learning model to obtain a fusion score, thereby fully playing the advantages of different biological characteristics and further improving the precision and accuracy of identity authentication.

In step S103, whether the person to be authenticated is authenticated by the user is determined based on the fusion score: when the fusion score is greater than or equal to the score threshold, the personnel to be authenticated pass authentication, and when the fusion score is less than the score threshold, the personnel to be authenticated do not pass authentication.

It should be understood that: the score threshold should be set or adjusted according to the actual application scenario or the empirical value, and is not particularly limited herein.

In some embodiments of the present invention, before step S103, training is performed on the constructed initial score fusion deep learning model, as shown in fig. 2, where the training process of the initial score fusion deep learning model includes:

s201, acquiring a training set, wherein the training set comprises a plurality of sample pairs, and each sample pair comprises a face authentication sample and a voiceprint authentication sample corresponding to the face authentication sample;

s202, training the initial score fusion deep learning model based on a training set and a preset loss function, and when the loss value of the loss function is smaller than a loss threshold value, finishing training the initial score fusion deep learning model to obtain the score fusion deep learning model.

The initial score fusion deep learning model is a logistic regression model, and the functional expression of the logistic regression model is as follows:

wherein g (z) is an output value of the initial score fusion deep learning model; x is x _1t Scoring the face authentication of the t-th sample pair; x is x _2t A voiceprint authentication score for the t-th sample pair; θ _t Weights for the t-th sample pair; z is the predicted target value for the plurality of sample pairs.

In a specific embodiment of the invention, the loss function is:

It should be noted that: in some embodiments, the score-fusion deep learning model is placed in the BOSARIS tools using the specific steps of:

1. preparing data: the face authentication score and voiceprint authentication score data are prepared and stored as binary formats (e.g., HDF5 format) or text formats supported by BOSARIS tools kit.

2. Creating an index: using the index object of the BOSARIS tools, an index file is created that describes the trial list to be fused (i.e., the data to be fused).

3. Loading score data: the face authentication score and voiceprint authentication score are loaded using the scoring object of BOSARIS tools.

4. Fusion score: and fusing the face authentication score and the voiceprint authentication score by using a score fusion deep learning model integrated by the BOSARIS tool to obtain a fusion score.

To ensure accuracy of the score fusion deep learning model, in some embodiments of the present invention, after obtaining a trained complete score fusion deep learning model, a test set may also be obtained for the score fusion deep learning model, and a performance index, such as EER (error rate), etc., of the score fusion deep learning model may be determined based on the test set.

When the error rate is smaller than the error rate threshold, step S103 is executed, and if the error rate is greater than or equal to the error rate threshold, the initial score fusion deep learning model is retrained to ensure the effectiveness and accuracy of the score fusion deep learning model with complete training.

In some embodiments of the present invention, as shown in fig. 3, the step S101 of acquiring a face image to be authenticated of a person to be authenticated includes:

s301, acquiring a test face video of a person to be authenticated;

s302, sampling the tested face video with a preset sampling step length to obtain a plurality of face video frames;

s303, inputting each face video frame into a Retinaface model to obtain a face image to be authenticated.

The step S301 of obtaining the test face video of the person to be authenticated may specifically be: and acquiring the test face video of the personnel to be authenticated on the basis of a camera or a camera in real time, or calling and acquiring the test face video from a storage medium storing the test face video.

In some embodiments of the present invention, step S302 is specifically: FFmpeg is used to extract one frame image every 10 frames as a face video frame.

In some embodiments of the present invention, step S303 is specifically: face clipping is created using coordinates of the face bounding box detected by RetinaFace. The face cuts are aligned using five facial marker points located in the top regions of the eyes, nose tips and lips. All face cuts are adjusted to a 112x112 pixel size.

In some embodiments of the present invention, as shown in fig. 4, the step S101 of obtaining a voiceprint to be authenticated of a person to be authenticated includes:

s401, acquiring test audio of a person to be authenticated;

s402, removing non-voice parts in the test audio based on a voice activity detector (SAD) to obtain the test voice audio;

s403, extracting logarithmic Mel frequency spectrum characteristics of the test voice audio based on the Mel filter bank, and carrying out short-time cepstrum average normalization (short-time cepstral mean subtraction) processing on the logarithmic Mel frequency spectrum characteristics based on a preset sliding window to obtain the voiceprint to be authenticated.

The step S401 of obtaining the test audio of the person to be authenticated may specifically be: the test audio of the personnel to be authenticated is acquired in real time based on a recordable device such as a recorder or the like, or the test audio is acquired by calling from a storage medium storing the test audio.

Wherein, the number of channels of the mel filter bank in step S403 is 64, the frequency ranges from 80Hz to 3800Hz, the logarithmic mel spectrum characteristics of 64 dimensions are extracted, one frame is extracted every 10 ms, and the frame length is 25 ms. The step length of the preset sliding window is 3s.

According to the embodiment of the invention, the short-time cepstrum average value normalization processing is carried out on the logarithmic Mel frequency spectrum characteristics, so that deviation caused by channel noise in a cepstrum domain and convolution caused by a time domain, such as channel distortion, can be eliminated, the accuracy of the obtained voiceprint to be authenticated is improved, and the accuracy of identity verification is further improved.

In some embodiments of the present invention, as shown in fig. 5, determining a face authentication score of a face image to be authenticated in step S102 includes:

s501, extracting a face feature vector to be authenticated of a face image to be authenticated based on a training complete face vector extractor;

s502, acquiring a plurality of registered face feature vectors in a registered face feature library, and determining a plurality of cosine similarities between the face feature vector to be authenticated and the registered face feature vectors;

s503, determining the maximum cosine similarity in the cosine similarities, and taking the maximum cosine similarity as a face authentication score.

The network structure of the face vector extractor is a ResNet101-IR-SE-AN network, and based on the ResNet101, the techniques of inverted residual (reverse residual), squeeze-and-specification (extrusion-Excitation) and attentive normalization (attention normalization) are combined, so that the accuracy of feature extraction can be improved, and the accuracy and precision of identity authentication can be further improved.

Specifically, the ResNet101-IR-SE-AN network performs feature compression along the space dimension through global pooling to obtain a global receptive field; modeling correlation among characteristic channels through a sample specialization activation function; and weighting the correlation to the previous features channel by channel through multiplication to finish recalibration of the original features in the channel dimension, so as to realize selective amplification of valuable feature channels from global information, inhibit useless feature channels and enhance the feature extraction capability.

In some embodiments of the present invention, as shown in fig. 6, determining a voiceprint authentication score of a voiceprint to be authenticated in step S102 includes:

s601, extracting to-be-authenticated voiceprint characteristics of the to-be-authenticated voiceprint based on a Time delay neural network (Time-Delay Neural Network, TDNN);

s602, acquiring a plurality of registered voiceprint features in a registered voiceprint feature library, and determining a voiceprint authentication score based on a Gaussian probability linear discriminant model (Probabilistic Linear Discriminant Analysis, PLDA).

Specifically, the TDNN includes 7 hidden layers, a statistical pooling layer disposed between the 5 th hidden layer and the 6 th hidden layer, a parameterized modified linear unit (prime) activation function layer connected after the 7 th hidden layer, and an output layer.

The statistical pooling layer accumulates and calculates the mean and standard deviation of all frames of the input segment from all frames of the layer 5 hidden layer. The statistical pooling layer can extract specific speaker information without adding redundant voice separation networks, reduces system redundancy and improves extraction efficiency of voiceprint features.

The loss function of the output layer is a cosine loss function with an additive boundary.

According to the embodiment of the invention, the loss function of the output layer is set to be the cosine loss function with the additive boundary, so that the intra-class difference is reduced, the inter-class difference is enlarged, and the identity authentication accuracy of different personnel is further improved.

In order to better implement the identity authentication method of the fused face and voiceprint in the embodiment of the present invention, correspondingly, the embodiment of the present invention further provides an identity authentication device of the fused face and voiceprint, as shown in fig. 7, where the identity authentication device 700 of the fused face and voiceprint includes:

an image and voiceprint acquiring unit 701, configured to acquire a face image to be authenticated and a voiceprint to be authenticated of a person to be authenticated;

an authentication score determining unit 702, configured to determine a face authentication score of a face image to be authenticated, and determine a voiceprint authentication score of a voiceprint to be authenticated;

the score fusion unit 703 is configured to input the face authentication score and the voiceprint authentication score into a score fusion deep learning model with complete training, obtain a fusion score, and determine whether the person to be authenticated passes authentication based on the fusion score.

The identity authentication device 700 for fusing a face and a voiceprint provided in the foregoing embodiment may implement the technical solution described in the foregoing embodiment of the identity authentication method for fusing a face and a voiceprint, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing embodiment of the identity authentication method for fusing a face and a voiceprint, which is not described herein again.

As shown in fig. 8, the present invention also provides an identity authentication device 800 accordingly. The authentication device 800 includes a processor 801, a memory 802, and a display 803. Fig. 8 shows only some of the components of the identity authentication device 800, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

The processor 801 may be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip in some embodiments for executing program code or processing data stored in the memory 802, such as the face and voiceprint fused identity authentication method of the present invention.

In some embodiments, the processor 801 may be a single server or a group of servers. The server farm may be centralized or distributed. In some embodiments, the processor 801 may be local or remote. In some embodiments, the processor 801 may be implemented in a cloud platform. In an embodiment, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-internal, multiple clouds, or the like, or any combination thereof.

The memory 802 may be an internal storage unit of the authentication device 800 in some embodiments, such as a hard disk or memory of the authentication device 800. The memory 802 may also be an external storage device of the authentication device 800 in other embodiments, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card) or the like, which are provided on the authentication device 800.

Further, the memory 802 may also include both internal storage units and external storage devices of the authentication device 800. The memory 802 is used to store application software and various types of data for installing the authentication device 800.

The display 803 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 803 is for displaying information at the authentication device 800 and for displaying a visual user interface. The components 801-803 of the authentication device 800 communicate with each other over a system bus.

In one embodiment, when the processor 801 executes the identity authentication procedure of fusing the face and the voiceprint in the memory 802, the following steps may be implemented:

determining a face authentication score of a face image to be authenticated, and determining a voiceprint authentication score of a voiceprint to be authenticated;

It should be understood that: the processor 801 may perform other functions in addition to the above functions when executing the identity authentication procedure of fusing a face and a voiceprint in the memory 802, and particularly, reference may be made to the foregoing description of the corresponding method embodiments.

Further, the type of the authentication device 800 is not particularly limited in the embodiment of the present invention, and the authentication device 800 may be a portable authentication device such as a mobile phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a laptop (laptop), and the like. Exemplary embodiments of portable authentication devices include, but are not limited to, portable authentication devices that carry IOS, android, microsoft or other operating systems. The portable identity authentication device described above may also be other portable identity authentication devices, such as a laptop computer (laptop) or the like having a touch-sensitive surface, e.g. a touch panel. It should also be appreciated that in other embodiments of the invention, the authentication device 800 may not be a portable authentication device, but rather a desktop computer having a touch-sensitive surface (e.g., a touch panel).

Correspondingly, the embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium is used for storing a computer readable program or instruction, and when the program or instruction is executed by a processor, the steps or functions in the identity authentication method for fusing the face and the voiceprint provided by the above method embodiments can be realized.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program stored in a computer readable storage medium to instruct related hardware (e.g., a processor, a controller, etc.). The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The identity authentication method, device, equipment and storage medium for fusing human face and voiceprint provided by the invention are described in detail, and specific examples are applied to illustrate the principle and implementation of the invention, and the description of the above examples is only used for helping to understand the method and core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. The identity authentication method integrating the face and the voiceprint is characterized by comprising the following steps of:

2. The method for authenticating identity of a fused face and voiceprint of claim 1, further comprising, prior to said inputting said face authentication score and said voiceprint authentication score into a well-trained score fusion deep learning model, prior to obtaining a fusion score:

3. The identity authentication method of fusing a face and a voiceprint according to claim 2, wherein the loss function is:

4. The identity authentication method of fusing a face and a voiceprint according to claim 1, wherein the acquiring a face image to be authenticated of a person to be authenticated includes:

acquiring a test face video of a person to be authenticated;

5. The identity authentication method for fusing a face and a voiceprint according to claim 1, wherein the obtaining the voiceprint to be authenticated of the person to be authenticated comprises:

acquiring test audio of the personnel to be authenticated;

6. The identity authentication method of fusing a face and voiceprints according to claim 1, wherein the determining a face authentication score of the face image to be authenticated comprises:

7. The method for authenticating an identity of a fused face and voiceprint of claim 1, wherein determining a voiceprint authentication score for the voiceprint to be authenticated comprises:

8. An identity authentication device integrating a human face and voiceprints, comprising:

9. An identity authentication device comprising a memory and a processor, wherein,

the memory is used for storing programs;

the processor is coupled to the memory and is configured to execute the program stored in the memory, so as to implement the steps in the identity authentication method of fusing a face and a voiceprint according to any one of claims 1 to 7.

10. A computer readable storage medium storing a computer readable program or instructions which when executed by a processor is capable of carrying out the steps of the method of identity authentication of a fused face and voiceprint as claimed in any one of claims 1 to 7.