WO2021257000A1

WO2021257000A1 - Cross-modal speaker verification

Info

Publication number: WO2021257000A1
Application number: PCT/SG2021/050358
Authority: WO
Inventors: Ruijie TAO; Rohan Kumar DAS; Haizhou Li
Original assignee: National University Of Singapore
Priority date: 2020-06-19
Filing date: 2021-06-21
Publication date: 2021-12-23

Abstract

Described is a method for training a neural network for speaker verification The method involves receiving a voice waveform and face image (face) for each of a plurality of speakers. From each voice waveform, one or more speaker embeddings are extracted. From ach image, one or more face embeddings are extracted. The neural network is then trained by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the voice waveform and the face.

Description

CROSS-MODAL SPEAKER VERIFICATION

Technical Field

The present invention relates, in general terms, to methods for verifying (i.e. confirming the identity of) speakers using cross-modal authentication. More particularly, the present invention provides methods and systems that use both voice embeddings and face embeddings to verify speakers.

Background

Automatic speaker recognition systems have witnessed major breakthroughs in the past decade. These breakthroughs have led to many real-world practical systems. Most of the existing works build systems that recognise the speaker's face and do a voice comparison to authenticate speech.

These systems are expected to perform effectively under adverse conditions. While creating audio-visual speaker recognition evaluation (SRE), the simplest way to perform multimedia based speaker recognition is to have separate systems for audio and visual inputs, then authenticate the speaker based on the results of speaker and face recognition processes. This separates the recognition task into two subtasks and is a straight forward approach to simplify the problem. However, the two subsystems performing the subtasks are disjoint and one does not consider knowledge from other.

It would be desirable to overcome or reduce at least one of the above- described problems with existing speaker recognition systems, or at least to provide a useful alternative.

Summary Investigations into human recognition systems show that humans associate the voice and the face of a person in the memory. While listening to a voice of an individual one can select, from two static faces, the static face of the individual at a higher than chance level and vice versa. The associations learned by humans, between audio and visual cues, are general identity features (such as gender, age and ethnicity) and appearance features (such as big nose, chubby and double chin).

The present invention was developed on the understanding that general cross-modal discriminative features provide additional information in audio-visual speaker recognition. In particular, disclosed of a voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on the evaluation set of 2019 NIST SRE.

Accordingly, the present invention provides a method for training a neural network for speaker verification, comprising: receiving, for each of a plurality of speakers, at least one voice waveform and at least one face image comprising a face of the respective speaker; extracting, from each voice waveform, one or more speaker embeddings; extracting, from each face image, one or more face embeddings; and training the neural network by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the at least one voice waveform and the respective face in the at least one face image, so that the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.

Notably, the cross-modal similarity may be expressed as a probability.

Training the neural network may further comprise negatively training the neural network on negative voice-face pairs, each negative voice-face pair comprising speaker embeddings and face embeddings of different speakers.

Training the neural network may involve transforming the embeddings into a transformed feature space. Training the neural network may involve applying cosine similarity scoring to the transformed embeddings. The neural network may output a probability pi that the at least one voice waveform and the face in the at least one face image belong to the same speaker pi may be calculated according to:

where e_v and ef are the voice embeddings and face embeddings respectively, T_v(e_v ) and T_f(e ) are the transformed voice embeddings and face embeddings respectively, the cosine similarity score is S(T_v(ev), T_f(e )) for the positive voice-face pairs and 1 - S(T_v(ev), T_f(e )) for the negative voice-face pairs. The neural network may output a probability p₂ that the voice waveform and the face in the at least one face image belong to different speakers, wherein p₂ is calculated according to:

Also disclosed herein is a method for speaker verification, comprising: receiving, via a receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extracting, from the voice waveform using a speaker embedding extractor, one or more speaker embeddings; extracting, from each face image, using a face embedding extractor, one or more face embeddings; determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verifying the speaker if the similarity score exceeds a predetermined threshold.

Determining a cross-modal similarity may comprise applying a neural network trained according to the method described above.

The method may further comprise determining, based only on the speaker embeddings, a probability that the voice waveform corresponds to the speaker, wherein the similarity score is also based on said probability. Determining the probability may comprise calculating a probabilistic linear discriminant (PLDA) based likelihood score.

The method may further comprise determining, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, wherein the similarity score is also based on said face similarity score.

The face similarity score may comprise a cosine similarity.

Disclosed herein is a system for speaker verification, comprising: memory; a receiver; a speaker embedding extractor; a face embedding extractor; and at least one processor, wherein the memory comprises instructions that, when executed by the at least one processor, cause the system to: receive, via the receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extract, from the voice waveform using the speaker embedding extractor, one or more speaker embeddings; extract, from each face image, using the face embedding extractor, one or more face embeddings; determine a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verify the speaker if the similarity score exceeds a predetermined threshold.

The at least one processor may be configured to determine a cross-modal similarity by applying a neural network trained according to the method set out above. The at least one processor may further be configured to determine, based only on the speaker embeddings, a probability that voice waveform corresponds to the speaker, and to determine the similarity score also based on said probability.

The at least one processor may further be configured to determine, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, and to determine the similarity score also based on said face similarity score.

Also disclosed herein is a computer-readable medium having instructions stored thereon that, when executed by at least one processor of a computer system, cause the computer system to perform the method for training a neural network for speaker verification as described above, or to perform the method for speaker verification as described above.

Advantageously, the invention enables cross-modal discriminative network assistive speaker recognition. For a speaker recognition system, if an enrolled speaker's face is also available, the test speech can be used to find a general relation with the speaker's face and then assist the speaker recognition system.

Advantageously, the invention similarly enables cross-modal discriminative network assistive face recognition. For a face recognition system, if an enrolled individual's voice is also available, the test face can be used to find a general relation with the speaker's voice and then to assist the face recognition system.

Advantageously, the invention enables cross-modal discriminative network assistive audio-visual speaker recognition. For an audio-visual speaker recognition system, either the test face or voice can be used to find a general relation with the speaker's voice or face with cross-modal discriminative network and then to assist the audio-visual speaker recognition system.

Brief description of the drawings

Embodiments of the present invention will now be described, by way of non-limiting example, with reference to the drawings in which:

Figure 1 shows a method for training a neural network for speaker verification, and a method for verifying a speaker;

Figure 2 illustrates an architecture of the proposed cross-modal discrimination network, VFNet, that relates the voice and face of a person; Figure 3 is a block diagram of proposed audio-visual (AV) speaker recognition framework with VFNet, where VFNet provides voice-face cross- modal verification information that strengthens the baseline audio-visual speaker recognition decision; and

Figure 4 is a schematic of a system on which the methods of Figure 1 can be implemented.

Detailed description

Described is a voice-face discriminative network (VFNet) that enables cross-modal similarity to be detected in a similar manner to the way a human can identify the face of an unknown speaker from a small number of people, based on the voice of the speaker. VFNet establishes a general relation between human voice and face.

Experiments show that VFNet provides additional speaker discriminative information, enabling significant improvements for audio-visual speaker recognition over the standard fusion of separate audio and visual systems. Further, the cross-modal discriminative network can also be useful for improving both speaker and face recognition individual system performance. The use of cross-modal information between voice and face can be used in various applications including:

(i) methods using a cross-modal relationship between a voice-face pair (i.e. a voice waveform and a face) to support audio-visual speaker recognition;

(ii) methods using a cross-modal relationship between a voice-face pair to support either speaker recognition or face recognition systems when both voice and face are available for the target speakers.

The discussion below describes these methods, a system for implementing the methods and demonstrates that the cross-modal discriminative network finds a general relation between voice and face, such that a voice- face pair can be used to assist audio-visual speaker recognition with robust performance. The same improvement is achieved for the results of speaker and face recognition separately, with access to both the voice and face of enrolled individuals.

With reference to Figure 1, the basis for these methods relies on providing a model with learned features mapping particular voice characteristics with particular facial features. A neural network model is presently proposed, for which Figure 1 illustrates a method 100 for training the neural network for speaker verification. The method 100 broadly comprises:

Step 102: receiving voice and face inputs;

Step 104: extracting speaker (i.e. voice) embeddings;

Step 106: extracting face embeddings; and

Step 108: training the neural network using the embeddings.

The above method yields VFNet, the trained neural network model for cross-modal speaker verification. The neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings. In general, the trained neural network (i.e. the product of step 108) will be able to verify both of a voice and face of a further speaker based on the associations and the face embeddings and voice embeddings, respectively.

Figure 2 shows the architecture 200 of the neural network, showing consideration of two inputs: a voice waveform 202 and a human face 204. The output 206 of the network 200 is a confidence score that describes at least one of the level of confidence that the voice and the face come from the same person and, where a threshold confidence level is provided, a positive or negative response indicating that the network 200 considers the voice and face to be of the same person or different people, respectively.

In more detail, step 102 involves receiving a waveform 202 and a face image 204 for each of a plurality of speakers. In general, each waveform will be paired with a face image and vice versa, to form a voice-face pair or face-voice pair. Moreover, the method 100 will generally be employed to a plurality, often many thousands, of voice-face pairs. Each face image comprises a face of a speaker and each waveform comprises a voice of a speaker. Where positive training is being employed, the voice in the waveform will be for the same speaker as the face in the image. Where negative training is being employed, the voice in the waveform will be for a different speaker to that whose face is in the image.

The waveform and face may be extracted from a database, the received through a receiver portion of a transceiver, through direct capture (e.g. video feed capturing one or more face images of a speaker and a voice input) or by any other suitable method. Step 104 involves extracting one or more speaker embeddings from each voice waveform. The speaker embeddings are low-dimensional spaces that describe features of the voice waveform such that voice waveforms with similar, or close, embeddings indicate that the speaker's voice in each case is semantically similar.

The speaker embeddings are extracted using a speaker and bedding extractor 208. The inputs from which the speaker embeddings are extracted may be derived from any suitable corpus such as the VoxCelebl- 2 corpora. The voice waveforms or speech waveforms may be taken from the audio channel of a video feed, and the face images may be selected from the image channel of the video feed - thus, both the voice waveform or waveforms, and face image or images, may be extracted from the same video feed.

Various systems can be used for extracting speaker embeddings. In an example, an x-vector based system is used for speaker embeddings extraction. Speech utterances, each of which comprises a voice of a speaker and is herein referred to as a voice waveform, are processed with energy based voice activity detection. This process removes silence regions. Therefore, while in some instances only part of a voice waveform may be used for extracting the speaker embeddings, in other instances, such as where the voice waveform is an audio channel of a video, the entire waveform may be used with pre-processing removing silent and/or noisy regions thereby reducing the size of the input from which the model trains. Energy based voice activity detection may alternatively, or in addition, be used for extracting one or both of frequency and spectral features. For example, mel frequency cepstral coefficient (MFCC) features may be extracted. The MFCC features may be any desired dimension, such as 30- dimensional MFCC features. In addition, windowing of the input 202 and/or 204 may be employed. Windowing enables features to be extracted from portions of a waveform or face image, reduces computation load, and enables features in one portion of a waveform or face image to be associated with features in a different portion of the same waveform or face image, those associations being lower-level features when compared with the features that they associate. Normalisation may also be applied across each window. In an example, short-time cepstral mean normalization is applied over a 3- second sliding window.

Step 106 involves extracting one or more face embeddings from each face image. Various models can be used for extracting face embeddings from images. For example, the face embedding extractor 210 may comprise the ResNet-50 RetinaFace model trained on a suitable database, such as the WIDER FACE database, to detect faces.

The face embeddings may then be aligned. One method for aligning the face embeddings is to use a multi-task cascaded convolutional network (MTCNN). Faces are detected in the images, recognised and aligned using suitable functions such as those provided in the InsightFace library, to obtain highly discriminative features for face recognition. To facilitate this process, additive angular margin loss may be used, e.g. for feature extraction. The face embedding extractor 210 may also include the ResNet- 100 extractor model trained on the VGGFace2 and cleaned MS1MV2 database to extract the face embeddings.

In the example shown in Figure 2, the dimension of both speaker and face embeddings is 512 (212, 214). The 512-dimensional embeddings 212, 214 are used in step 108 four training the neural network 200. In general, positive training will be used either alone, or with negative training, to train the neural network 200. In this sense, positive training is where, for each voice-face pair, the voice and face are of the same speaker. Conversely, negative training is where, for each voice-face pair, the voice and face are of different speakers.

By performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, the neural network 200 can learn one or more associations between the voice waveform or waveforms and the respective face in the face image or face images.

The speaker embeddings and face embeddings represent information from respectively different modalities. To learn associations between the two, the inputs (i.e. their 512-dimensional inputs) are fed to a lower dimensional layer, or successively lower dimensional layers. In the present embodiment, the 512-dimensional input is fed to a 256-dimensional layer 216, 218 followed by a 128-dimensional layer 220, 222. The 256- dimensional layer 216, 218 is a fully connected layer (FC1) with rectified linear unit (ReLU) activation. The 128-dimensional layer 220, 222 is a fully connected layer (FC2) without the ReLU. These layers 216, 218, 220, 222 are introduced to lead the speaker and face embeddings for learning the cross-modal identity information from each other. Further, they help to project the embeddings from both modalities into a new domain, where their relation can be established.

To determine similarity, step 108 may further involve transforming the embeddings into a transformed feature space. For a given pair of speaker embedding e_v and a face embedding er, their transformed embeddings T_v(e_v ) and T_fiei) are derived from VFNet. The cosine similarity (224) is then determined to produce the cosine similarity scoring S(7V(e _v), T_f{er )) between the embeddings or the transformed embeddings. 1 - S T_v(e_v), T_f{ef)) can then be used to represent a lack of cosine similarity. This is particularly applied for example during negative training on negative voice-face pairs. Thus, the final output pi is the score to describe the probability that the voice and the face belong to the same person, p₂ is the score depicting the probability that the voice and the face do not belong to the same person. In other words, pi is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to the same speaker, and p₂ is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to different speakers.

By using softmax function based on S(T_v(ev), Tfef)) and 1 - S(T_v(ev), T_f(e_f))_r the neural network 200 can output pi and p₂, expressed as:

where e_v and e_f are the voice embeddings and face embeddings respectively, T_v(e_v) and T_fie ) are the transformed voice embeddings and face embeddings respectively, and:

These probabilities constitute predictions of whether the face and voice belong to the same or different people. These probabilities all predictions, along with ground truth verification labels

I used to calculate the cross- entropy loss according to:

The loss is then propagated back through the layers 216, 218, 220, 222 to adjust weights applied by those layers to particular features found in the speaker embeddings and face embeddings, to cause the neural network 200 to learn. After being trained, the neural network 200 will be able to determine a cross-modal similarity for a particular voice-face input pair. In other words, the neural network 200 will be able to specify a likelihood or probability that the face visible in an input (e.g. video feed) corresponds to the voice audible in that input.

The speaker and face embeddings for the model 200 to perform cross- modal verification follow the same pipeline discussed above.

In experiments, the NIST SRE audio-visual corpus was used for the speaker recognition application. Manually marked diarization labels of voice and keyframe indices along with bounding boxes that mark a face of the individual- i.e. the dataset may include, for some of the frames, a bounding box identifying an enrolled speaker or bounding boxes identifying enrolled speakers. This enables enrolment of the target speakers from the videos (i.e. the audio-visual corpus) during training of model (i.e. network) 200.

To ensure the model 200 is agnostic is to ethnicity, age and other biases of input data sets, the speaker and face embeddings of the target person may be extracted from the enrolment segments (i.e. audio-visual feed) of a development set (e.g. NIST SRE), and the model is then retrained based on the combination of the enrolment segments and a second database (e.g. VoxCeleb2). Thus, one data set may be used to enrol specific individual speakers - e.g. employees of a company for whom voice and/or face recognition is to be used - and a second data set can then be used to refine or generalise the model 200.

During testing processes, labels were omitted from the input.

A summary of the corpus for training and testing is shown in Table 1.

Table 1: summary of VoxCeleb2 and 2019 NIST SRE audio-visual

(AV) corpora.

The neural network model 200 trained using the above method 100 results in a model that can be used to determine a cross-modal similarity between a voice and a face of a speaker. The output of the model 200 can be used either independently, to verify that a face and voice are the same individual, or be fused with a speaker recognition system or face recognition system for enhanced speaker and face recognition, respectively. The output of the model 200 can also be fused with the output of a baseline audio-visual recognition system as shown in Figure 2. The fused output can then be used in a final decision for verifying the identity of the speaker.

To that end, Figure 1 shows a method 110 for speaker verification. In some embodiments, as reflected by broken line 112, the method 110 leverages of a neural network trained according to the method 100. Speaker verification method 110 broadly comprises:

Step 114: receiving a voice waveform and at least one face image of a speaker;

Step 116: extracting one or more face embeddings from the face image or images;

Step 118: extracting one or more voice embeddings from the voice what;

Step 120: determining a cross-modal similarity between the voice embedding or embeddings and the face embedding or embeddings; and Step 122: verifying the speaker.

At step 114 a voice waveform and one or more face images are received through a receiver forming part of, for example, transceiver 412 of Figure 4. The voice waveform and face image or images are those of a speaker.

At step 116 and 118 speaker embeddings and face embeddings, respectively, are extracted in the same manner as those embeddings are extracted at steps 104 and 106, respectively.

Step 120 involves determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings. As reflected by the broken line 112, the cross- modal similarity is determined from a cross-modal similarity score that is the output from the trained model 200 developed using the method 100.

Step 122 involves verifying the speaker if the similarity score exceeds a predetermined threshold. Verification can be a Yes/No verification in which a speaker is verified - i.e. that the voice and face match the same person - anything cross-modal similarity score is greater than the predetermined threshold, and the speaker is not verified - i.e. that the voice and face are unlikely to be the same person - if the cross-modal similarity score is lower than or equal to the predetermined threshold. In other embodiments, verification may employ multiple thresholds. For example, a cross-modal similarity score below a first threshold is an indication of high confidence that the face and voice are not a match, between the first threshold and a second threshold is an indication that further information is required to confidently determine whether or not the face and voice are a match, and above the second threshold indicates high confidence that the face and voice are a match.

Figure 3 illustrates an architecture in which the cross-modal similarity is used to enhance voice and face verification. The architecture 300 provides an audio-visual speaker recognition framework in which the left panel 302 represents the process for cross-modal speaker verification, and baseline recognition is set out in the right panel 304. Voice segments of the target person are inputted (306) to the system 300 and are used in panel 302 to determine face embeddings corresponding to the speaker embeddings. Embeddings are extracted using the x-vector system.

The speaker verification network 308 determines a speaker verification score 312 from embeddings in the voice segments 306 and the whole audio waveform 310, for speaker verification - i.e. a confidence score that the speaker is enrolled with the system 300.

InsightFace or other system extracts the face embeddings for given faces of the target speakers from the enrollment videos and all detected faces from the test videos 314 in both panel 302 and 304. On the left panel, 302, the cross-modal network provides an association score (i.e. cross-modal verification score) between the target speaker voice in the enrolment video and the faces detected from the test video. Matching pairs between voice and face will give rise to high association, while mismatches, such as age, gender, weight and ethnicity discrepancy, will do otherwise. In panel 304 the audio and visual systems 314, 316 respectively, run in parallel to verify the target person's identity by computing a match between the enrolment and the test embeddings. In the present embodiment, the speaker recognition system 316 considers probabilistic linear discriminant (PLDA) based likelihood scores. In contrast, the face recognition system 318 computes cosine similarity scores. The state-of- the-art baseline for present purposes is then a score level fusion 320 between the two parallel systems 316, 318.

Therefore, while the output of the model 200, namely the cross-modal similarity score, may be used to verify a speaker, the system 300 uses the cross-modal similarity score with the output of one or both of speaker verification model 316 and face verification model 318. In some embodiments, such as when model 200 is used for speaker recognition, a probability that the voice waveform 306 corresponds to the speaker is determined based only on the speaker embeddings, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed probability - i.e. that the voice waveform 306 corresponds to the speaker. In some embodiments, such as when model 200 is used for face recognition, a face similarity score you determine, specifying a similarity between the face in the target faces 322 and the speaker, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed face similarity - i.e. that the in the target faces face corresponds to the speaker. As mentioned above, the face similarity score may be a cosine similarity.

Score level fusion 324 may be performed using various methods. In one embodiment, score level fusion is performed using logistic regression. Thus, in each of the audio-visual SRE applications, an enrolment video (comprising audio and visual channels) provides the target individual's biometric information (voice and face) and the assignment asks the model to automatically determine whether the target person is present in a given test video. That determination is based on a cross-modal confidence score that specifies whether the voice is likely to match the face, and one or both of face recognition and speaker recognition.

Experiments

Experiments were conducted based on videos in the VoxCeleb2 corpus to derive a set with voices and faces for cross-modal verification. For each video, the entire audio channel is extracted to represent the voice of the speaker. For the face recognition application, face detection was performed on each video (i.e. a series of face images) and the most prominent faces representing an individual were selected. For cross-modal discriminative training, the positive trials are faces and voices come from the same identity, whereas the negative trials are obtained by shuffling the faces and voices belonging to different persons. The model 200 learned the general association between voice and face from VoxCeleb2.

As mentioned above, the embedding extraction for voice and face produces a 512-dimensional output. Although the dimension of speaker (i.e. voice) and face embeddings are same, the back-end scoring for respective individual system is different. Linear discriminant analysis (LDA) was used on speaker embeddings for channel/session compensation and to reduce the dimension of x-vectors - presently to 150. Finally, PLDA is used as a classifier to get the final speaker recognition score. For facial recognition, cosine similarity between face embeddings from the enrolment video and those from the detected faces in the test video are computed. The average of top 20% scores of the number of face embeddings in the test video were taken to derive the final face recognition score. Focusing now on the back-end of audio-visual SRE with VFNet. The VFNet back-end computes the likelihood score between the speaker embedding of the target speaker and all the face embeddings of detected faces in the test video - this score is determined using Equation (1). Finally, the average of all scores, or a predetermined set or proportion of scores - e.g. the top 20% - is taken. This average is combined with the scores generated from one or both of the audio and visual systems. That combination may be achieved using a variety of methods including logistic regression. Notably, cross-modal verification can also be done by considering the all the given faces in the enrolment video and the detected multiple speaker voices in the test audio. This can be achieved using a speaker diarization module that detects the voice belonging to different speakers in the test audio. For example, the model may determine that a mouth of a particular face is moving in a manner corresponding to an audio channel feed. Various processes and libraries can be used to fuse scores. For example, the Bosaris toolkit can be used to calibrate and fuse the scores of the different systems 302, 316, 318. For experimental purposes, the performance of systems is reported in terms of equal error rate (EER), minimum detection cost function (minDCF) and actual detection cost function (actDCF) following the protocol of 2019 NIST SRE.

Notably, the model 200 trained according to the method 100 performed effectively for cross-modal verification. For some applications, such as where there are multiple speakers present in the test videos each of which has to be matched with the target speaker in the enrolment video, the method 100 may comprise formulating and adding one more shared weights sub-branches to the neural network model for selection requirements.

The results of cross-modal audio-visual speaker recognition, fusing the single modality speaker, face recognition systems as shown in Table 4. In particular, Table 4 shows the changes in results when cross-modal audio visual speaker recognition is used as per the model trained according to method 100, when compared with speaker recognition, face recognition and audio-visual recognition systems that do not use cross-modal audio visual speaker recognition.

When examining the effect on single modality systems as reflected in the above Table 4, it is evident that the contribution of VFNet (i.e. the trained model) is greater for speaker recognition system on the evaluation set. Further, the trained model is also able to enhance the audio-visual baseline system performance. This suggests a usefulness for associating audio and visual cues by cross-modal verification for audio-visual SRE. The relative improvements in each case are 16.54%, 2.00% and 8.83% in terms of EER, minDCF and actDCF, respectively.

Thus Figure 1 illustrates method 100 for training, and method 110 for using, a novel framework for audio-visual speaker recognition with a cross- modal discrimination network. The VFNet based cross-modal discrimination network finds the relations between a given pair of human voice and face to generate a confidence score based on a confidence that the voice and face correspond to belong to the same person. While the trained model can perform comparably to existing state-of-the-art cross-modal verification systems, the proposed frame- work of audio-visual speaker recognition with cross-modal verification outperforms the baseline audio-visual system. This highlights the importance of cross-modal verification, in other words, the relation between audio and visual cues for audio-visual speaker recognition.

Figure 4 is a block diagram showing an exemplary computer device 400, in which embodiments of the invention, particularly methods 100 and 110 of Figure 1, may be practiced. The computer device 400 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.

As shown, the mobile computer device 400 includes the following components in electronic communication via a bus 406:

(a) a display 402;

(b) non-volatile (non-transitory) memory 404; (c) random access memory ("RAM") 408;

(d) N processing components 410;

(e) a transceiver component 412 that includes N transceivers; and

(f) user controls 414.

Although the components depicted in Figure 4 represent physical components, Figure 4 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 4 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 4.

The display 402 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro- projector and OLED displays).

In general, the non-volatile data storage 404 (also referred to as non volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 404, or by instructions stored in memory 404 - e.g. memory 404 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 410 cause the processor(s) 410 to perform the methods 100 and/or 110 described with reference to Figure 1.

In some embodiments for example, the non-volatile memory 404 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity. In many implementations, the non-volatile memory 404 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 404, the executable code in the non-volatile memory 404 is typically loaded into RAM 408 and executed by one or more of the N processing components 410.

The N processing components 410 in connection with RAM 408 generally operate to execute the instructions stored in non-volatile memory 404. As one of ordinarily skill in the art will appreciate, the N processing components 410 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.

The transceiver component 412 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.

The system 400 of Figure 4 may be connected to any appliance 418, such as an external server, database, video feed or other source from which inputs may be obtained.

It should be recognized that Figure 4 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 404. Non-transitory computer-readable medium 404 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A method for training a neural network for speaker verification, comprising: receiving, for each of a plurality of speakers, at least one voice waveform comprising a voice of the respective speaker and at least one face image comprising a face of the respective speaker; extracting, from each voice waveform, one or more speaker embeddings; extracting, from each face image, one or more face embeddings; and training the neural network by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the at least one voice waveform and the respective face in the at least one face image, so that the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.

2. The method of claim 1, wherein training the neural network further comprises negatively training the neural network on negative voice-face pairs, each negative voice-face pair comprising speaker embeddings and face embeddings of different speakers.

3. The method of claim 1 or 2, wherein training the neural network involves transforming the embeddings into a transformed feature space.

4. The method of claim 3, wherein training the neural network involves applying cosine similarity scoring to the transformed embeddings.

5. The method of claim 4 when dependent on claim 2, wherein the neural network outputs a probability pi that the at least one voice waveform and the face in the at least one face image belong to the same speaker.

6. The method of claim 5, wherein pi is calculated according to:

where e_v and e_f are the voice embeddings and face embeddings respectively, T_v(e_v) and T_fie ) are the transformed voice embeddings and face embeddings respectively, the cosine similarity score is S(T_v(ev), Tfe_f)) for the positive voice-face pairs and 1 - S(T_v(ev), Tfe_f)) for the negative voice-face pairs.

7. The method of claim 6, wherein the neural network outputs a probability p₂ that the voice waveform and the face in the at least one face image belong to different speakers, wherein p₂ is calculated according to:

8. A method for speaker verification, comprising: receiving, via a receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extracting, from the voice waveform using a speaker embedding extractor, one or more speaker embeddings; extracting, from each face image, using a face embedding extractor, one or more face embeddings; determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verifying the speaker if the similarity score exceeds a predetermined threshold.

9. The method of claim 8, wherein determining a cross-modal similarity comprises applying a neural network trained according to the method of any one of 1 to 7.

10. The method of claim 8 or 9, further comprising determining, based only on the speaker embeddings, a probability that the voice waveform corresponds to the speaker, wherein the similarity score is also based on said probability.

11. The method of claim 10, wherein determining the probability comprising calculating a probabilistic linear discriminant (PLDA) based likelihood score.

12. The method of any one of claims 8 to 11, further comprising determining, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, wherein the similarity score is also based on said face similarity score.

13. The method of claim 12, wherein the face similarity score comprises a cosine similarity.

14. A system for speaker verification, comprising: memory; a receiver; a speaker embedding extractor; a face embedding extractor; and at least one processor, wherein the memory comprises instructions that, when executed by the at least one processor, cause the system to: receive, via the receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extract, from the voice waveform using the speaker embedding extractor, one or more speaker embeddings; extract, from each face image, using the face embedding extractor, one or more face embeddings; determine a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verify the speaker if the similarity score exceeds a predetermined threshold.

15. The system of claim 14, wherein the at least one processor is configured to determine a cross-modal similarity by applying a neural network trained according to the method of any one of claims 1 to 7.

16. The system of claim 14 or 15, wherein the at least one processor is further configured to determine, based only on the speaker embeddings, a probability that voice waveform corresponds to the speaker, and to determine the similarity score also based on said probability.

17. The system of any one of claims 14 to 16, wherein the at least one processor is further configured to determine, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, and to determine the similarity score also based on said face similarity score.

18. A computer-readable medium having instructions stored thereon that, when executed by at least one processor of a computer system, cause the computer system to perform the method for training a neural network for speaker verification in accordance with any one of claims 1 to 7, or to perform the method for speaker verification in accordance with any one of claims 8 to 13.