CN111862990B

CN111862990B - Speaker identity verification method and system

Info

Publication number: CN111862990B
Application number: CN202010705582.9A
Authority: CN
Inventors: 钱彦旻; 陈正阳; 王帅
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2022-11-11
Anticipated expiration: 2040-07-21
Also published as: CN111862990A

Abstract

The invention discloses a speaker identity verification method, which comprises the following steps: acquiring audio data and facial image data of the speaker; extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data; and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification. The invention provides a scheme for carrying out personal identity verification by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be carried out due to the fact that the identity verification cannot be carried out under a single mode because the identity verification cannot be easily influenced by external factors is solved, and the success rate of identity verification is improved.

Description

Speaker identity verification method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a speaker identity verification method and a speaker identity verification system.

Background

The speaker identity verification methods in the prior art include a voiceprint-based verification method and a face recognition-based verification method. These techniques use certain physiological characteristics of a person to achieve the goal of verifying the identity of a person. A certain physiological characteristic of a person may in certain circumstances not have the condition to distinguish a certain person. For example, in a noisy environment, we may not hear the sound of a particular person; a human face feature may not have a condition for distinguishing one person when one person twists or while he/she is moving.

Disclosure of Invention

An embodiment of the present invention provides a method and a system for speaker identity verification, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a speaker identity verification method, including:

acquiring audio data and facial image data of the speaker;

extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data;

and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification.

In a second aspect, an embodiment of the present invention provides a speaker authentication system, including:

the audio-visual data acquisition module is used for acquiring audio data and facial image data of the speaker;

the feature extraction module is used for extracting voice feature embedding from the audio data and extracting facial feature embedding from the facial image data;

and the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification.

In a third aspect, an embodiment of the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any speaker authentication method of the present invention.

In a fourth aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speaker verification methods of the present invention described above.

In a fifth aspect, embodiments of the present invention further provide a computer program product, the computer program product including a computer program stored on a storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform any one of the above speaker authentication methods.

The embodiment of the invention has the beneficial effects that: the scheme for carrying out the identity verification of the person by using multi-modal information (from the face and the voice) is provided, the problem that the identity verification cannot be carried out due to the fact that the identity verification cannot be carried out under the single mode is easily affected by external factors is solved, and the success rate of the identity verification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a speaker verification method of the present invention;

FIG. 2 is a flow chart of another embodiment of the speaker verification method of the present invention;

FIG. 3 is a functional block diagram of an embodiment of a speaker verification system of the present invention;

FIG. 4 is a functional block diagram of another embodiment of the speaker verification system of the present invention;

FIG. 5a is a schematic illustration of simple soft attention fusion as used in the present invention;

FIG. 5b is a schematic diagram of a compact bilinear pooling fusion used in the present invention;

FIG. 5c is a schematic representation of portal multimodal fusion used in the present invention;

FIG. 6a is a graph of the distance between positive and negative pairs at the loss of original contrast in the present invention;

FIG. 6b is a graph of the distance between positive and negative pairs at the new loss of contrast in the present invention;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this application, the terms "module," "apparatus," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprises 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The invention provides a speaker identity verification method, which can be used for terminal equipment, wherein the terminal equipment has the functions of face recognition and voiceprint recognition at the same time, for example, the terminal equipment can be a smart phone, a tablet personal computer, a smart sound box, a vehicle terminal, a smart robot and the like, and the invention is not limited to the above.

As shown in fig. 1, an embodiment of the present invention provides a speaker authentication method, including:

and S10, acquiring the audio data and the facial image data of the speaker.

Illustratively, the terminal device executing the method is a smartphone, and the audio data of the speaker can be acquired through a microphone of the smartphone, and meanwhile, the face image of the speaker can be acquired through a camera of the smartphone.

And S20, extracting voice feature embedding from the audio data, and extracting facial feature embedding from the facial image data.

Illustratively, the speech feature embedding may be a voiceprint feature. Resnet34 and SeResnet50 may be used to extract voiceprint and facial feature embedding, respectively.

And S30, determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to be used for speaker identity verification.

The embodiment of the invention provides a scheme for verifying the identity of a person by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be performed due to the fact that the identity verification cannot be performed under a single mode because the identity verification cannot be easily influenced by external factors is solved, and the success rate of identity verification is improved.

Fig. 2 is a flow chart of another embodiment of the speaker verification method of the present invention, in which the determining identity feature embedding according to the voice feature embedding and the facial feature embedding includes:

and S31, inputting the voice feature embedding into a first embedding feature conversion layer to obtain the preprocessing voice feature embedding.

And S32, inputting the facial feature embedding into a second embedding feature conversion layer to obtain the preprocessed facial feature embedding.

Illustratively, by translation layer f _{trans_f} And f _{trans_v} E is to be _f And e _v Are respectively converted into

And

after conversion

And

in the co-embedding space, which is more suitable for later fusion.

And S33, carrying out fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.

Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:

determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:

determining a weighting factor from the attention score:

determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:

wherein e is _v For speech feature embedding, e _f For the purpose of facial feature embedding,

in order to pre-process the speech feature embedding,

is embedded for pre-processing facial features.

and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.

determining a gate vector from the speech feature embedding and the facial feature embedding:

z＝σ(f _att ([e _f ，e _v ]))

adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:

in order to pre-process the speech feature embedding,

for preprocessed facial feature embedding, an indicates a element-by-element product.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As shown in fig. 3, an embodiment of the present invention further provides a speaker identity verification system 300, which may be used in a terminal device, where the terminal device has both functions of face recognition and voiceprint recognition, for example, the terminal device may be a smart phone, a tablet computer, a smart speaker, a car terminal, a smart robot, and the like, which is not limited in this respect.

As shown in fig. 3, the speaker authentication system 300 includes:

an audio-visual data acquisition module 310 for acquiring audio data and facial image data of the speaker;

a feature extraction module 320 for extracting speech feature embeddings from the audio data and facial feature embeddings from the facial image data;

an identity feature embedding determination module 330, configured to determine identity feature embedding according to the voice feature embedding and the facial feature embedding, for speaker authentication.

Fig. 4 is a schematic block diagram of another embodiment of the speaker verification system of the present invention, in which the identity embedding determination module includes:

a first embedded feature conversion layer 331, configured to perform preprocessing on the voice feature embedding to obtain a preprocessed voice feature embedding;

a second embedded feature conversion layer 332, configured to perform preprocessing on the facial feature embedding to obtain a preprocessed facial feature embedding;

and a fusion module 333, configured to perform fusion processing on the preprocessed voice feature embedding and the preprocessed facial feature embedding to obtain identity feature embedding.

determining, by an attention layer, an attention score based on the speech feature embedding and the facial feature embedding:

determining a weighting factor from the attention score:

in order to pre-process the speech feature embedding,

and embedding for preprocessing facial features.

z＝σ(f _att ([e _f ，e _v ]))

in order to pre-process the speech feature embedding,

for preprocessed facial feature embedding, an indication of element-by-element product.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above speaker identity verification methods of the present invention.

In some embodiments, the present invention also provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speaker authentication methods described above.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speaker verification method.

In some embodiments, the present invention further provides a storage medium, on which a computer program is stored, where the program is executed by a processor to implement a speaker identification verification method.

The speaker authentication system according to the embodiment of the present invention may be used to execute the speaker authentication method according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the speaker authentication method according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

In order to more clearly describe the technical solutions of the present invention and to more directly prove the real-time performance and the benefits of the present invention over the prior art, the technical background, the technical solutions, the experiments performed, etc. of the present invention will be described in more detail below.

Abstract

Information from different forms will usually compensate for each other. In the present invention, we use the audiovisual data in the VoxCeleb dataset for people verification. We explored different information fusion strategies and loss functions for audiovisual personnel verification systems at the level of embedded features. System performance was evaluated using a common test list on the VoxCeleb1 dataset. We achieved 0.585%,0.427%, and 0.735% EER on the three public test lists of VoxCeleb1 using the best system of audiovisual knowledge at the embedded feature level, which is the best reported result on this data set. Furthermore, to mimic a more complex test environment of modal impairments or deletions, we constructed a noisy evaluation set based on the VoxCeleb1 data set. We use data enhancement strategies at the embedding feature level to help our audiovisual system distinguish between noisy and clean embedding. With this data enhancement strategy, the proposed audiovisual personnel verification system is more powerful on noisy evaluation sets.

1. Introduction to

A variety of biometrics may be used to verify a person's identity, with voice and facial expressions being two typical features. Face verification and speaker verification are subjects of intense research in the field of biometric identification. Deep learning techniques have recently greatly improved the performance of both tasks. Over the past few years researchers have studied different architectures and different loss functions, resulting in well-performing systems that can even be commercialized for real world applications.

Despite the success in single-modality applications, multi-modality learning has attracted increasing attention in both academic and industrial areas. The motivation has two aspects:

1. complementary information from different modalities may improve system performance.

2. Models built from multiple modalities tend to be more robust and fault tolerant and can repair or suppress faults in a single mode.

In the present invention, cross-modal integration is performed at the embedding level, while speaker embedding using more powerful segment-level training. Different fusion strategies and loss functions were studied and compared in a multimodal learning framework.

Furthermore, to mimic a real-world scenario, we construct a noisy set of estimates, where one modality is corrupted or missing. In order to compensate performance reduction, a new Noise Distribution Matching (NDM) data enhancement method embedded in a feature level is proposed, and the method greatly improves performance under a noisy condition.

All systems were evaluated on a standard Vox-Celeb1 dataset, while our best multimodal system achieved EERs of 0.585%,0.427%, and 0.735% on the three test lists (Vox-Celeb 1_ O, voxCeleb1_ E, and VoxCeleb1_ H), respectively. To our knowledge, this is the best reported result on this data set. Furthermore, NDM-based multimodal systems show the ability to select more prominent modality information when evaluating on a noisy evaluation set.

2. Method for producing a composite material

2.1 feature level-Embedded multimodal fusion

In this section, we will describe three ways of embedding e facial features _f And speech feature embedding e _v Fusion into identity feature embedding e _p The method of (1). As shown in fig. 5a to 5c, first by the conversion layer f _{trans_f} And f _{trans_v} E is to be _f And e _v Respectively convert into

And

after conversion

And

in the co-embedding space, which is more suitable for later fusion.

2.1.1 simple Soft attention fusion

In this section, we first introduce a Simple Soft Attention (SSA). As shown in FIG. 5a, given face and speech features are embedded e _f And e _v Through the attention layer f _att (. The score attention)

Is defined as:

then through weighted sum computation fusion embedding:

2.1.2 compact bilinear pooling fusion

Bilinear pooling exploits the outer product operation to fully explore the relationship between two vectors and does not involve training parameters. But do notIs, it is generally not feasible in practice due to the high dimension of the outer product. A method called multi-modal compact bilinear pooling (MCB) has been proposed in the prior art to approximate the outer product result and at the same time reduce the dimension of the result. Notably, there are no training parameters in the MCB either. As shown in FIG. 5b, we use compact bilinear pooling directly

And

fusion to e _p . Details of the implementation of compact bilinear pooling can be found in the prior art, and the present invention is not limited in this regard.

2.1.3 Portal multimodal fusion

In this section, we use GATEs to control the flow of information in face and speech modalities, which we call gated multi-modal fusion (GATE). As shown in fig. 5c, given face and speech features are embedded e _f And e _v The exit vector z ∈ R can be calculated ^D ：

z＝σ(f _att ([e _f ，e _v ]))

Then, we use the gate vector z to

To know

Fusion to e _p And, indicates a product element by element:

2.2 loss function

In this section, we will introduce a loss function for optimizing the proposed multimodal fusion system.

2.2.1 contrast loss for aggressive sampling strategy

The original contrast loss is defined as:

where D is the distance between a pair and N and M are the number of positive and negative pairs in a batch. y =1 and y =0 represent the positive and negative pairs, respectively, and m is the margin. In our experiment, we use cosine similarity to measure the distance of the embedded pair.

The adjusted margin m in the original contrast penalty makes the penalty more focused on the difficult negative pair. However, the difficult alignment is not considered. Here we introduce a more aggressive sampling strategy. During the training process, after the neural network is propagated forward, we calculate the loss using only a subset of γ M most difficult negative pairs and γ N most difficult positive pairs (γ e (0, 1)), and the contrast loss using the new sampling strategy can be defined as:

wherein D is _{p_low} Denotes the minimum distance, D, of all "hardest" positive numbers _{n_high} Representing the maximum distance among all "hardest" negatives.

2.2.2 addition Angle margin loss

In addition, we also tried the prevailing loss of angular margin in the experiment. For tags with personal identities y _s The loss is defined as:

where m is the addition margin and s is the scale parameter, which may help the model to converge faster. In our experiment, s was set to 32 and m to 0.6 in the fusion system.

2.3 Embedded layer enhancement for noisy evaluation

2.3.1 noisy evaluation set construction

Information from different modalities is not always available or significant enough to perform the verification task. In practical applications, a modality is often damaged or missing due to certain unavoidable external factors (e.g., ambient light, human movement, or background noise). To address this situation, we constructed a noisy evaluation set based on the VoxCeleb1 evaluation set.

For image data, we use vertical and horizontal motion blur to mimic human motion in front of the lens, and gaussian blur to mimic other noise. For audio data, the three types of noise in Musan are combined with the original data to generate corrupted audio samples. We also consider the case of a complete absence of one modality by directly setting the corresponding extracted embeddings to zero values. A detailed flow of constructing this data set is shown in algorithm 1.

Algorithm 1: noisy evaluation set structure

2.3.2 Embedded layer enhancement

In order to build a system that is more robust to corrupted audiovisual data, an additional embedded layer enhancement strategy is proposed in this work. In our previous work, we used deep generation models such as the generative countermeasure network (GAN) or the Variational Autocoder (VAE) to simulate the distribution of noisy speaker embeddings. Here, a simple statistical-based distribution matching algorithm is used.

We randomly selected 100,000 records (1,092,009 records) from the training set and generated different types of damage data. Then, for each noise type, we assume that the difference between the noise embedding and the original embedding can be described by a gaussian distribution. After estimating the parameters of the noise distribution, we sample the noise in the distribution and add it directly to the original embedding to generate a noisy embedding. We refer to this embedding layer enhancement method as Noise Distribution Matching (NDM). Compared to embedding where noise is added directly to the entire training set and enhanced extraction, NDM uses only a small portion of the training data and enhances the embedding directly, saving time and disk. Furthermore, we still use the zero vector to model the case of modal loss.

3. Experimental setup

3.1, data set

In our experiments we used both visual and audio data from both the VoxCeleb1 and 2 datasets. For training, we used the DEV portion of the VoxCeleb2 dataset, which includes 5,994 bits of speaker and 1,092,009 utterances. VoxCeeb 1 is used as the evaluation set, and all three formal test lists VoxCeeb 1-O, voxCeeb 1-E, and VoxCeeb 1-H are used as the evaluations. Note that the visual data from the official VoxCeleb1 dataset is incomplete and we downloaded the missing visual data from youtube and published it.

3.2 Experimental setup

3.2.1 Single Modal System

For audio data, 40-dimensional Fbank features were extracted using the Kaldi toolkit and silence frames were deleted using an energy-based voice activity detector. Then we perform CMN on the Fbank function using the sliding window size 300. For video data, we extract 1 frame per second. We then use MTCNN to detect face landmarks and use similarity transformation to map face regions to the same shape (3 x112x 96). Finally, we normalize the pixel values of each image to [0,1] and subtract 0.5 to map the range of values to [ -0.5,0.5].

During the training process, the Fbank feature from a sentence is divided into blocks, with block sizes from 200 to 400. During the test, we extract one speech embedding for each recording, and then average the face embeddings of a recording to obtain a face representation.

In our experiment, a 50-layer SE-ResNet was used for the face system and a 34-layer ResNet was used for the voice system. The embedding of both systems is set to 512 dimensions. The margin (margin) m =0.2 AAM loss is used to optimize both systems.

3.2.2 multimodal System

Facial and speech feature insertions are extracted from the single modality system for all recordings in the training set. Then, L2 normalization is performed on all embeddings to build a new training set for the audiovisual multimodal system.

For the SSA fusion system, the translation layers are two fully connected layers each having 512 units, and the attention layer is a fully connected layer having two cells. For compact bilinear fusion and gated multi-model fusion, the translation layer is a fully-connected layer with 512 cells. The attention layer in the gated multi-model fusion system is two fully connected layers with 32 and 512 cells respectively. For all the above adjacent fully connected layers we insert another batchnorm and relu layer in the middle.

4. Results and analysis

4.1 evaluation of embedding layer multimodal fusion

In order to fuse information in face and voice modalities, different fusion strategies and different loss functions were explored and compared in our embedded-level fusion system. The results and analysis are presented in this section.

The results for the single modality system are shown at the top of table 1. We find face and speech single modality systems to be substantially comparable. As shown in the third row of table 1, the simple average score results between the two single-modality systems greatly surpassed the two single-modality systems, indicating a strong complementary role between audio and visual modalities.

4.1.1 loss function comparison

First, the SSA fusion strategy under controlled loss supervision was investigated. However, as shown in the middle of table 1, in our experiments the original contrast loss based system did not converge to the optimum and the performance of the fused system was much worse than even the single modality system. To increase the comparable loss, a revision with the more aggressive sampling strategy introduced in section 2.2.1 was used, with much better results (SSA + Con-new). To more intuitively demonstrate the effectiveness of the new strategy, the distribution of the distance between the positive and negative pairs is shown in fig. 6a and 6b, where fig. 6a is a plot of the distance between the positive and negative pairs at the original contrast loss and fig. 6a is a plot of the distance between the positive and negative pairs at the new contrast loss. Indicating that a new loss of contrast may enlarge the difference between the positive and negative distances. Furthermore, in addition to the destructive losses, we also used AAM-Softmax losses based on classification for multi-modal system optimization, which outperforms the destructive losses far. AAM-softmax and new comparative losses will be used mainly for the following experiments.

4.1.2 fusion strategy comparison

In this section the different fusion strategies introduced in section 2.1 are compared, while the AAM-softmax loss or new contrast loss provides the monitoring signal. The results are shown in the middle part of the table. From the results, all three fusion strategies achieved significant improvements compared to the unimodal system, and the gated multi-modal fusion architecture performed best. However, simple score averaging still performed best, which is not consistent with the findings in [23 ]. The possible reason is that we have a more powerful single-modal system in this work: using the same VoxColeb 2 test list, our face and speech EERs are 4.08% and 3.43%, respectively, while the corresponding numbers in [23] are 14.5% and 8.03%2. The larger difference can also be attributed to different experimental settings, we have employed segment-level optimization in the system, while the authors in [23] use a frame-level embedding extractor for online validation.

Further, when we use both AAM loss and new contrast loss, further improvement can be obtained and the performance on the Voxceleb 1E and Voxceleb 1H paths exceeds the score average results. The results are shown in the penultimate row of the table. 1. Surprisingly, we found that a fusion system using the proposed model is a complement to the simple fractional averaging system. When we further averaged the score of the GATE + AAM + Con-new fusion system with the average score of the unimodal system, the best system performance was obtained. To our knowledge, this is also the best release result for human validation on the VoxCeleb1 evaluation dataset.

Table 1: results of different fusion strategies and penalties were used for comparison. The disadvantages are as follows: the original contrast is lost. The method comprises the following steps: the contrast loss suggested by using a more aggressive sampling strategy. M in Con-orig is set to 0.5 and m in Con-new is set to 0.05.

4.2 assessment of Damage and loss modes

To test fusion systems under more complex realistic conditions, where one mode is corrupted or lost, the results are evaluated using the noisy evaluation set shown in section 2.3.1 and displayed in a table. 2. From the results, we found that simple fractional averaging operations can still significantly improve performance, and the proposed multimodal fusion system trained with enhanced embedded data achieves the best results in this case. Furthermore, audiovisual fusion systems that train only clean embedding do not have the ability to distinguish noise embedding from clean embedding well and the results are somewhat worse. Note that the results in parentheses indicate that the proposed fusion system using enhanced embedded training still performs well on a clean evaluation set.

Table 2: results (EER%) comparison over a noisy evaluation set. We used here the GATE + AMM + Con-new fusion system. Train _ Clean: the fusion system is trained with clean embedding. Train _ Noise: the fusion system is trained with enhanced noise embedding. The results in brackets were tested in a clean evaluation set.

5. Conclusion

In this context, we explore different multimodal fusion strategies and loss functions for people verification systems, and can efficiently combine audiovisual information at the embedding level. Based on a powerful single-modality system, our best system achieved 0.585%,0.427%, and 0.735% EER on the three official test lists of VoxColeb 1, which, to our knowledge, is the best result published on this data set. We also introduce an embedded level data enhancement method that can help audiovisual multi-modal people verification systems perform well when certain modalities are damaged or lost.

Fig. 7 is a schematic hardware structure diagram of an electronic device for performing a speaker authentication method according to another embodiment of the present application, and as shown in fig. 7, the electronic device includes:

one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7.

The apparatus performing the speaker authentication method may further include: an input device 730 and an output device 740.

The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7.

The memory 720, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speaker identification verification method in the embodiments of the present application. The processor 710 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 720, so as to implement the speaker identity authentication method of the above-described method embodiments.

The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speaker authentication apparatus, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 720 optionally includes memory located remotely from processor 710, which may be connected to the speaker authentication device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 730 may receive input numeric or character information and generate signals related to user settings and function control of the speaker authentication device. The output device 740 may include a display device such as a display screen.

The one or more modules are stored in the memory 720 and, when executed by the one or more processors 710, perform the speaker verification exercise method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functions and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions in essence or part contributing to the related art can be embodied in the form of a software product, which can be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to various embodiments or some parts of embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present application.

Claims

1. A speaker identity verification method, comprising:

acquiring audio data and facial image data of the speaker;

determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification;

wherein said determining identity feature embedding from said voice feature embedding and said facial feature embedding comprises:

embedding the speech features in e _v Input to a first Embedded feature conversion layer f _{trans_v} Deriving preprocessed speech feature embedding

Embedding the facial features into e _f Input to a second Embedded feature conversion layer f _{trans_f} Deriving pre-processed facial feature embedding

Thereby, the device

And

in a co-embedding space to be more suitable for later fusion;

and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.

2. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

determining a weighting factor from the attention score:

determining identity feature embedding according to the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:

in order to pre-process the speech feature embedding,

is embedded for pre-processing facial features.

3. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

4. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

z＝σ(f _att ([e _f ，e _υ ]))

in order to pre-process the speech feature embedding,

5. A speaker authentication system, comprising:

the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification;

wherein the identity feature embedding determination module comprises:

a first embedded feature conversion layer for embedding e into the speech feature _v Input to a first Embedded feature conversion layer f _{trans_v} Deriving preprocessed speech feature embedding

A second embedded feature conversion layer for facing the surfacePartial feature embedding e _f Input to a second Embedded feature conversion layer f _{trans_f} Deriving preprocessed facial feature embedding

Thereby to obtain

And

in a co-embedding space to be more suitable for later fusion;

and the fusion module is used for carrying out fusion processing on the preprocessed voice characteristic embedding and the preprocessed face characteristic embedding so as to obtain identity characteristic embedding.

6. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

determining a weighting factor from the attention score:

in order to pre-process the speech feature embedding,

is embedded for pre-processing facial features.

7. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

8. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:

z＝σ(f _att ([e _f ，e _v ]))

in order to pre-process the speech feature embedding,

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.