WO2021257000A1 - Cross-modal speaker verification - Google Patents

Cross-modal speaker verification Download PDF

Info

Publication number
WO2021257000A1
WO2021257000A1 PCT/SG2021/050358 SG2021050358W WO2021257000A1 WO 2021257000 A1 WO2021257000 A1 WO 2021257000A1 SG 2021050358 W SG2021050358 W SG 2021050358W WO 2021257000 A1 WO2021257000 A1 WO 2021257000A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
speaker
embeddings
voice
neural network
Prior art date
Application number
PCT/SG2021/050358
Other languages
French (fr)
Inventor
Ruijie TAO
Rohan Kumar DAS
Haizhou Li
Original Assignee
National University Of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University Of Singapore filed Critical National University Of Singapore
Publication of WO2021257000A1 publication Critical patent/WO2021257000A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/248Character recognition characterised by the processing or recognition method involving plural approaches, e.g. verification by template match; Resolving confusion among similar patterns, e.g. "O" versus "Q"
    • G06V30/2552Combination of methods, e.g. classifiers, working on different input data, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • the present invention relates, in general terms, to methods for verifying (i.e. confirming the identity of) speakers using cross-modal authentication. More particularly, the present invention provides methods and systems that use both voice embeddings and face embeddings to verify speakers.
  • VFNet voice-face discriminative network
  • the present invention provides a method for training a neural network for speaker verification, comprising: receiving, for each of a plurality of speakers, at least one voice waveform and at least one face image comprising a face of the respective speaker; extracting, from each voice waveform, one or more speaker embeddings; extracting, from each face image, one or more face embeddings; and training the neural network by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the at least one voice waveform and the respective face in the at least one face image, so that the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.
  • the cross-modal similarity may be expressed as a probability.
  • Training the neural network may further comprise negatively training the neural network on negative voice-face pairs, each negative voice-face pair comprising speaker embeddings and face embeddings of different speakers.
  • Training the neural network may involve transforming the embeddings into a transformed feature space. Training the neural network may involve applying cosine similarity scoring to the transformed embeddings.
  • the neural network may output a probability pi that the at least one voice waveform and the face in the at least one face image belong to the same speaker pi may be calculated according to: where e v and ef are the voice embeddings and face embeddings respectively, T v (e v ) and T f (e ) are the transformed voice embeddings and face embeddings respectively, the cosine similarity score is S(T v (ev), T f (e )) for the positive voice-face pairs and 1 - S(T v (ev), T f (e )) for the negative voice-face pairs.
  • the neural network may output a probability p 2 that the voice waveform and the face in the at least one face image belong to different speakers, wherein p 2 is calculated according to:
  • Also disclosed herein is a method for speaker verification, comprising: receiving, via a receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extracting, from the voice waveform using a speaker embedding extractor, one or more speaker embeddings; extracting, from each face image, using a face embedding extractor, one or more face embeddings; determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verifying the speaker if the similarity score exceeds a predetermined threshold.
  • Determining a cross-modal similarity may comprise applying a neural network trained according to the method described above.
  • the method may further comprise determining, based only on the speaker embeddings, a probability that the voice waveform corresponds to the speaker, wherein the similarity score is also based on said probability. Determining the probability may comprise calculating a probabilistic linear discriminant (PLDA) based likelihood score.
  • PLDA probabilistic linear discriminant
  • the method may further comprise determining, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, wherein the similarity score is also based on said face similarity score.
  • the face similarity score may comprise a cosine similarity.
  • a system for speaker verification comprising: memory; a receiver; a speaker embedding extractor; a face embedding extractor; and at least one processor, wherein the memory comprises instructions that, when executed by the at least one processor, cause the system to: receive, via the receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extract, from the voice waveform using the speaker embedding extractor, one or more speaker embeddings; extract, from each face image, using the face embedding extractor, one or more face embeddings; determine a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verify the speaker if the similarity score exceeds a predetermined threshold.
  • the at least one processor may be configured to determine a cross-modal similarity by applying a neural network trained according to the method set out above.
  • the at least one processor may further be configured to determine, based only on the speaker embeddings, a probability that voice waveform corresponds to the speaker, and to determine the similarity score also based on said probability.
  • the at least one processor may further be configured to determine, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, and to determine the similarity score also based on said face similarity score.
  • Also disclosed herein is a computer-readable medium having instructions stored thereon that, when executed by at least one processor of a computer system, cause the computer system to perform the method for training a neural network for speaker verification as described above, or to perform the method for speaker verification as described above.
  • the invention enables cross-modal discriminative network assistive speaker recognition.
  • a speaker recognition system if an enrolled speaker's face is also available, the test speech can be used to find a general relation with the speaker's face and then assist the speaker recognition system.
  • the invention similarly enables cross-modal discriminative network assistive face recognition.
  • the test face can be used to find a general relation with the speaker's voice and then to assist the face recognition system.
  • the invention enables cross-modal discriminative network assistive audio-visual speaker recognition.
  • test face or voice can be used to find a general relation with the speaker's voice or face with cross-modal discriminative network and then to assist the audio-visual speaker recognition system.
  • Figure 1 shows a method for training a neural network for speaker verification, and a method for verifying a speaker
  • Figure 2 illustrates an architecture of the proposed cross-modal discrimination network, VFNet, that relates the voice and face of a person
  • Figure 3 is a block diagram of proposed audio-visual (AV) speaker recognition framework with VFNet, where VFNet provides voice-face cross- modal verification information that strengthens the baseline audio-visual speaker recognition decision; and
  • AV audio-visual
  • Figure 4 is a schematic of a system on which the methods of Figure 1 can be implemented.
  • VFNet voice-face discriminative network
  • VFNet provides additional speaker discriminative information, enabling significant improvements for audio-visual speaker recognition over the standard fusion of separate audio and visual systems.
  • the cross-modal discriminative network can also be useful for improving both speaker and face recognition individual system performance.
  • the use of cross-modal information between voice and face can be used in various applications including:
  • FIG. 1 illustrates a method 100 for training the neural network for speaker verification.
  • the method 100 broadly comprises:
  • Step 102 receiving voice and face inputs
  • Step 104 extracting speaker (i.e. voice) embeddings
  • Step 106 extracting face embeddings
  • Step 108 training the neural network using the embeddings.
  • the above method yields VFNet, the trained neural network model for cross-modal speaker verification.
  • the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.
  • the trained neural network i.e. the product of step 108 will be able to verify both of a voice and face of a further speaker based on the associations and the face embeddings and voice embeddings, respectively.
  • Figure 2 shows the architecture 200 of the neural network, showing consideration of two inputs: a voice waveform 202 and a human face 204.
  • the output 206 of the network 200 is a confidence score that describes at least one of the level of confidence that the voice and the face come from the same person and, where a threshold confidence level is provided, a positive or negative response indicating that the network 200 considers the voice and face to be of the same person or different people, respectively.
  • step 102 involves receiving a waveform 202 and a face image 204 for each of a plurality of speakers.
  • each waveform will be paired with a face image and vice versa, to form a voice-face pair or face-voice pair.
  • the method 100 will generally be employed to a plurality, often many thousands, of voice-face pairs.
  • Each face image comprises a face of a speaker and each waveform comprises a voice of a speaker.
  • the voice in the waveform will be for the same speaker as the face in the image.
  • negative training is being employed, the voice in the waveform will be for a different speaker to that whose face is in the image.
  • the waveform and face may be extracted from a database, the received through a receiver portion of a transceiver, through direct capture (e.g. video feed capturing one or more face images of a speaker and a voice input) or by any other suitable method.
  • Step 104 involves extracting one or more speaker embeddings from each voice waveform.
  • the speaker embeddings are low-dimensional spaces that describe features of the voice waveform such that voice waveforms with similar, or close, embeddings indicate that the speaker's voice in each case is semantically similar.
  • the speaker embeddings are extracted using a speaker and bedding extractor 208.
  • the inputs from which the speaker embeddings are extracted may be derived from any suitable corpus such as the VoxCelebl- 2 corpora.
  • the voice waveforms or speech waveforms may be taken from the audio channel of a video feed, and the face images may be selected from the image channel of the video feed - thus, both the voice waveform or waveforms, and face image or images, may be extracted from the same video feed.
  • an x-vector based system is used for speaker embeddings extraction.
  • Speech utterances each of which comprises a voice of a speaker and is herein referred to as a voice waveform
  • Speech utterances are processed with energy based voice activity detection. This process removes silence regions. Therefore, while in some instances only part of a voice waveform may be used for extracting the speaker embeddings, in other instances, such as where the voice waveform is an audio channel of a video, the entire waveform may be used with pre-processing removing silent and/or noisy regions thereby reducing the size of the input from which the model trains.
  • Energy based voice activity detection may alternatively, or in addition, be used for extracting one or both of frequency and spectral features.
  • mel frequency cepstral coefficient (MFCC) features may be extracted.
  • the MFCC features may be any desired dimension, such as 30- dimensional MFCC features.
  • windowing of the input 202 and/or 204 may be employed. Windowing enables features to be extracted from portions of a waveform or face image, reduces computation load, and enables features in one portion of a waveform or face image to be associated with features in a different portion of the same waveform or face image, those associations being lower-level features when compared with the features that they associate. Normalisation may also be applied across each window. In an example, short-time cepstral mean normalization is applied over a 3- second sliding window.
  • Step 106 involves extracting one or more face embeddings from each face image.
  • Various models can be used for extracting face embeddings from images.
  • the face embedding extractor 210 may comprise the ResNet-50 RetinaFace model trained on a suitable database, such as the WIDER FACE database, to detect faces.
  • the face embeddings may then be aligned.
  • One method for aligning the face embeddings is to use a multi-task cascaded convolutional network (MTCNN). Faces are detected in the images, recognised and aligned using suitable functions such as those provided in the InsightFace library, to obtain highly discriminative features for face recognition. To facilitate this process, additive angular margin loss may be used, e.g. for feature extraction.
  • the face embedding extractor 210 may also include the ResNet- 100 extractor model trained on the VGGFace2 and cleaned MS1MV2 database to extract the face embeddings.
  • the dimension of both speaker and face embeddings is 512 (212, 214).
  • the 512-dimensional embeddings 212, 214 are used in step 108 four training the neural network 200.
  • positive training will be used either alone, or with negative training, to train the neural network 200.
  • positive training is where, for each voice-face pair, the voice and face are of the same speaker.
  • negative training is where, for each voice-face pair, the voice and face are of different speakers.
  • the neural network 200 can learn one or more associations between the voice waveform or waveforms and the respective face in the face image or face images.
  • the speaker embeddings and face embeddings represent information from respectively different modalities.
  • the inputs i.e. their 512-dimensional inputs
  • the 512-dimensional input is fed to a 256-dimensional layer 216, 218 followed by a 128-dimensional layer 220, 222.
  • the 256- dimensional layer 216, 218 is a fully connected layer (FC1) with rectified linear unit (ReLU) activation.
  • the 128-dimensional layer 220, 222 is a fully connected layer (FC2) without the ReLU.
  • These layers 216, 218, 220, 222 are introduced to lead the speaker and face embeddings for learning the cross-modal identity information from each other. Further, they help to project the embeddings from both modalities into a new domain, where their relation can be established.
  • step 108 may further involve transforming the embeddings into a transformed feature space.
  • T v (e v ) and T f iei) are derived from VFNet.
  • the cosine similarity (224) is then determined to produce the cosine similarity scoring S(7V(e v ), T f ⁇ er )) between the embeddings or the transformed embeddings. 1 - S T v (e v ), T f ⁇ ef)) can then be used to represent a lack of cosine similarity. This is particularly applied for example during negative training on negative voice-face pairs.
  • the final output pi is the score to describe the probability that the voice and the face belong to the same person
  • p 2 is the score depicting the probability that the voice and the face do not belong to the same person.
  • pi is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to the same speaker
  • p 2 is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to different speakers.
  • the neural network 200 can output pi and p 2 , expressed as: where e v and e f are the voice embeddings and face embeddings respectively, T v (e v ) and T f ie ) are the transformed voice embeddings and face embeddings respectively, and:
  • the loss is then propagated back through the layers 216, 218, 220, 222 to adjust weights applied by those layers to particular features found in the speaker embeddings and face embeddings, to cause the neural network 200 to learn.
  • the neural network 200 will be able to determine a cross-modal similarity for a particular voice-face input pair. In other words, the neural network 200 will be able to specify a likelihood or probability that the face visible in an input (e.g. video feed) corresponds to the voice audible in that input.
  • the speaker and face embeddings for the model 200 to perform cross- modal verification follow the same pipeline discussed above.
  • the NIST SRE audio-visual corpus was used for the speaker recognition application.
  • Manually marked diarization labels of voice and keyframe indices along with bounding boxes that mark a face of the individual- i.e. the dataset may include, for some of the frames, a bounding box identifying an enrolled speaker or bounding boxes identifying enrolled speakers. This enables enrolment of the target speakers from the videos (i.e. the audio-visual corpus) during training of model (i.e. network) 200.
  • the speaker and face embeddings of the target person may be extracted from the enrolment segments (i.e. audio-visual feed) of a development set (e.g. NIST SRE), and the model is then retrained based on the combination of the enrolment segments and a second database (e.g. VoxCeleb2).
  • a development set e.g. NIST SRE
  • VoxCeleb2 e.g. VoxCeleb2
  • one data set may be used to enrol specific individual speakers - e.g. employees of a company for whom voice and/or face recognition is to be used - and a second data set can then be used to refine or generalise the model 200.
  • Table 1 summary of VoxCeleb2 and 2019 NIST SRE audio-visual
  • the neural network model 200 trained using the above method 100 results in a model that can be used to determine a cross-modal similarity between a voice and a face of a speaker.
  • the output of the model 200 can be used either independently, to verify that a face and voice are the same individual, or be fused with a speaker recognition system or face recognition system for enhanced speaker and face recognition, respectively.
  • the output of the model 200 can also be fused with the output of a baseline audio-visual recognition system as shown in Figure 2. The fused output can then be used in a final decision for verifying the identity of the speaker.
  • Figure 1 shows a method 110 for speaker verification.
  • the method 110 leverages of a neural network trained according to the method 100.
  • Speaker verification method 110 broadly comprises:
  • Step 114 receiving a voice waveform and at least one face image of a speaker
  • Step 116 extracting one or more face embeddings from the face image or images
  • Step 118 extracting one or more voice embeddings from the voice what;
  • Step 120 determining a cross-modal similarity between the voice embedding or embeddings and the face embedding or embeddings; and Step 122: verifying the speaker.
  • a voice waveform and one or more face images are received through a receiver forming part of, for example, transceiver 412 of Figure 4.
  • the voice waveform and face image or images are those of a speaker.
  • speaker embeddings and face embeddings are extracted in the same manner as those embeddings are extracted at steps 104 and 106, respectively.
  • Step 120 involves determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings.
  • the cross- modal similarity is determined from a cross-modal similarity score that is the output from the trained model 200 developed using the method 100.
  • Step 122 involves verifying the speaker if the similarity score exceeds a predetermined threshold.
  • Verification can be a Yes/No verification in which a speaker is verified - i.e. that the voice and face match the same person - anything cross-modal similarity score is greater than the predetermined threshold, and the speaker is not verified - i.e. that the voice and face are unlikely to be the same person - if the cross-modal similarity score is lower than or equal to the predetermined threshold.
  • verification may employ multiple thresholds.
  • a cross-modal similarity score below a first threshold is an indication of high confidence that the face and voice are not a match
  • between the first threshold and a second threshold is an indication that further information is required to confidently determine whether or not the face and voice are a match
  • above the second threshold indicates high confidence that the face and voice are a match.
  • Figure 3 illustrates an architecture in which the cross-modal similarity is used to enhance voice and face verification.
  • the architecture 300 provides an audio-visual speaker recognition framework in which the left panel 302 represents the process for cross-modal speaker verification, and baseline recognition is set out in the right panel 304.
  • Voice segments of the target person are inputted (306) to the system 300 and are used in panel 302 to determine face embeddings corresponding to the speaker embeddings.
  • Embeddings are extracted using the x-vector system.
  • the speaker verification network 308 determines a speaker verification score 312 from embeddings in the voice segments 306 and the whole audio waveform 310, for speaker verification - i.e. a confidence score that the speaker is enrolled with the system 300.
  • InsightFace or other system extracts the face embeddings for given faces of the target speakers from the enrollment videos and all detected faces from the test videos 314 in both panel 302 and 304.
  • the cross-modal network provides an association score (i.e. cross-modal verification score) between the target speaker voice in the enrolment video and the faces detected from the test video. Matching pairs between voice and face will give rise to high association, while mismatches, such as age, gender, weight and ethnicity discrepancy, will do otherwise.
  • the audio and visual systems 314, 316 respectively, run in parallel to verify the target person's identity by computing a match between the enrolment and the test embeddings.
  • the speaker recognition system 316 considers probabilistic linear discriminant (PLDA) based likelihood scores.
  • the face recognition system 318 computes cosine similarity scores.
  • the state-of- the-art baseline for present purposes is then a score level fusion 320 between the two parallel systems 316, 318.
  • the system 300 uses the cross-modal similarity score with the output of one or both of speaker verification model 316 and face verification model 318.
  • a probability that the voice waveform 306 corresponds to the speaker is determined based only on the speaker embeddings, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed probability - i.e. that the voice waveform 306 corresponds to the speaker.
  • a face similarity score you determine specifying a similarity between the face in the target faces 322 and the speaker, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed face similarity - i.e. that the in the target faces face corresponds to the speaker.
  • the face similarity score may be a cosine similarity.
  • Score level fusion 324 may be performed using various methods. In one embodiment, score level fusion is performed using logistic regression.
  • an enrolment video (comprising audio and visual channels) provides the target individual's biometric information (voice and face) and the assignment asks the model to automatically determine whether the target person is present in a given test video. That determination is based on a cross-modal confidence score that specifies whether the voice is likely to match the face, and one or both of face recognition and speaker recognition.
  • the embedding extraction for voice and face produces a 512-dimensional output.
  • the dimension of speaker (i.e. voice) and face embeddings are same, the back-end scoring for respective individual system is different.
  • LDA Linear discriminant analysis
  • PLDA is used as a classifier to get the final speaker recognition score.
  • cosine similarity between face embeddings from the enrolment video and those from the detected faces in the test video are computed. The average of top 20% scores of the number of face embeddings in the test video were taken to derive the final face recognition score.
  • the VFNet back-end computes the likelihood score between the speaker embedding of the target speaker and all the face embeddings of detected faces in the test video - this score is determined using Equation (1). Finally, the average of all scores, or a predetermined set or proportion of scores - e.g. the top 20% - is taken. This average is combined with the scores generated from one or both of the audio and visual systems. That combination may be achieved using a variety of methods including logistic regression. Notably, cross-modal verification can also be done by considering the all the given faces in the enrolment video and the detected multiple speaker voices in the test audio.
  • a speaker diarization module that detects the voice belonging to different speakers in the test audio.
  • the model may determine that a mouth of a particular face is moving in a manner corresponding to an audio channel feed.
  • Various processes and libraries can be used to fuse scores.
  • the Bosaris toolkit can be used to calibrate and fuse the scores of the different systems 302, 316, 318.
  • the performance of systems is reported in terms of equal error rate (EER), minimum detection cost function (minDCF) and actual detection cost function (actDCF) following the protocol of 2019 NIST SRE.
  • the model 200 trained according to the method 100 performed effectively for cross-modal verification.
  • the method 100 may comprise formulating and adding one more shared weights sub-branches to the neural network model for selection requirements.
  • Table 4 shows the changes in results when cross-modal audio visual speaker recognition is used as per the model trained according to method 100, when compared with speaker recognition, face recognition and audio-visual recognition systems that do not use cross-modal audio visual speaker recognition.
  • VFNet i.e. the trained model
  • the trained model is also able to enhance the audio-visual baseline system performance. This suggests a usefulness for associating audio and visual cues by cross-modal verification for audio-visual SRE.
  • the relative improvements in each case are 16.54%, 2.00% and 8.83% in terms of EER, minDCF and actDCF, respectively.
  • Figure 1 illustrates method 100 for training, and method 110 for using, a novel framework for audio-visual speaker recognition with a cross- modal discrimination network.
  • the VFNet based cross-modal discrimination network finds the relations between a given pair of human voice and face to generate a confidence score based on a confidence that the voice and face correspond to belong to the same person. While the trained model can perform comparably to existing state-of-the-art cross-modal verification systems, the proposed frame- work of audio-visual speaker recognition with cross-modal verification outperforms the baseline audio-visual system. This highlights the importance of cross-modal verification, in other words, the relation between audio and visual cues for audio-visual speaker recognition.
  • FIG 4 is a block diagram showing an exemplary computer device 400, in which embodiments of the invention, particularly methods 100 and 110 of Figure 1, may be practiced.
  • the computer device 400 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.
  • the mobile computer device 400 includes the following components in electronic communication via a bus 406:
  • non-volatile (non-transitory) memory 404 (b) non-volatile (non-transitory) memory 404; (c) random access memory (“RAM”) 408;
  • transceiver component 412 that includes N transceivers
  • Figure 4 Although the components depicted in Figure 4 represent physical components, Figure 4 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 4 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 4.
  • the display 402 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro- projector and OLED displays).
  • displays e.g., CRT, LCD, HDMI, micro- projector and OLED displays.
  • non-volatile data storage 404 functions to store (e.g., persistently store) data and executable code.
  • the system architecture may be implemented in memory 404, or by instructions stored in memory 404 - e.g. memory 404 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 410 cause the processor(s) 410 to perform the methods 100 and/or 110 described with reference to Figure 1.
  • the non-volatile memory 404 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
  • the non-volatile memory 404 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well.
  • flash memory e.g., NAND or ONENAND memory
  • the executable code in the non-volatile memory 404 is typically loaded into RAM 408 and executed by one or more of the N processing components 410.
  • the N processing components 410 in connection with RAM 408 generally operate to execute the instructions stored in non-volatile memory 404.
  • the N processing components 410 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
  • the transceiver component 412 includes N transceiver chains, which may be used for communicating with external devices via wireless networks.
  • Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme.
  • each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
  • the system 400 of Figure 4 may be connected to any appliance 418, such as an external server, database, video feed or other source from which inputs may be obtained.
  • appliance 418 such as an external server, database, video feed or other source from which inputs may be obtained.
  • Non-transitory computer-readable medium 404 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium may be any available medium that can be accessed by a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

Described is a method for training a neural network for speaker verification The method involves receiving a voice waveform and face image (face) for each of a plurality of speakers. From each voice waveform, one or more speaker embeddings are extracted. From ach image, one or more face embeddings are extracted. The neural network is then trained by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the voice waveform and the face.

Description

CROSS-MODAL SPEAKER VERIFICATION
Technical Field
The present invention relates, in general terms, to methods for verifying (i.e. confirming the identity of) speakers using cross-modal authentication. More particularly, the present invention provides methods and systems that use both voice embeddings and face embeddings to verify speakers.
Background
Automatic speaker recognition systems have witnessed major breakthroughs in the past decade. These breakthroughs have led to many real-world practical systems. Most of the existing works build systems that recognise the speaker's face and do a voice comparison to authenticate speech.
These systems are expected to perform effectively under adverse conditions. While creating audio-visual speaker recognition evaluation (SRE), the simplest way to perform multimedia based speaker recognition is to have separate systems for audio and visual inputs, then authenticate the speaker based on the results of speaker and face recognition processes. This separates the recognition task into two subtasks and is a straight forward approach to simplify the problem. However, the two subsystems performing the subtasks are disjoint and one does not consider knowledge from other.
It would be desirable to overcome or reduce at least one of the above- described problems with existing speaker recognition systems, or at least to provide a useful alternative.
Summary Investigations into human recognition systems show that humans associate the voice and the face of a person in the memory. While listening to a voice of an individual one can select, from two static faces, the static face of the individual at a higher than chance level and vice versa. The associations learned by humans, between audio and visual cues, are general identity features (such as gender, age and ethnicity) and appearance features (such as big nose, chubby and double chin).
The present invention was developed on the understanding that general cross-modal discriminative features provide additional information in audio-visual speaker recognition. In particular, disclosed of a voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on the evaluation set of 2019 NIST SRE.
Accordingly, the present invention provides a method for training a neural network for speaker verification, comprising: receiving, for each of a plurality of speakers, at least one voice waveform and at least one face image comprising a face of the respective speaker; extracting, from each voice waveform, one or more speaker embeddings; extracting, from each face image, one or more face embeddings; and training the neural network by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the at least one voice waveform and the respective face in the at least one face image, so that the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.
Notably, the cross-modal similarity may be expressed as a probability.
Training the neural network may further comprise negatively training the neural network on negative voice-face pairs, each negative voice-face pair comprising speaker embeddings and face embeddings of different speakers.
Training the neural network may involve transforming the embeddings into a transformed feature space. Training the neural network may involve applying cosine similarity scoring to the transformed embeddings. The neural network may output a probability pi that the at least one voice waveform and the face in the at least one face image belong to the same speaker pi may be calculated according to:
Figure imgf000005_0001
where ev and ef are the voice embeddings and face embeddings respectively, Tv(ev ) and Tf(e ) are the transformed voice embeddings and face embeddings respectively, the cosine similarity score is S(Tv(ev), Tf(e )) for the positive voice-face pairs and 1 - S(Tv(ev), Tf(e )) for the negative voice-face pairs. The neural network may output a probability p2 that the voice waveform and the face in the at least one face image belong to different speakers, wherein p2 is calculated according to:
Figure imgf000006_0001
Also disclosed herein is a method for speaker verification, comprising: receiving, via a receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extracting, from the voice waveform using a speaker embedding extractor, one or more speaker embeddings; extracting, from each face image, using a face embedding extractor, one or more face embeddings; determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verifying the speaker if the similarity score exceeds a predetermined threshold.
Determining a cross-modal similarity may comprise applying a neural network trained according to the method described above.
The method may further comprise determining, based only on the speaker embeddings, a probability that the voice waveform corresponds to the speaker, wherein the similarity score is also based on said probability. Determining the probability may comprise calculating a probabilistic linear discriminant (PLDA) based likelihood score.
The method may further comprise determining, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, wherein the similarity score is also based on said face similarity score.
The face similarity score may comprise a cosine similarity.
Disclosed herein is a system for speaker verification, comprising: memory; a receiver; a speaker embedding extractor; a face embedding extractor; and at least one processor, wherein the memory comprises instructions that, when executed by the at least one processor, cause the system to: receive, via the receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extract, from the voice waveform using the speaker embedding extractor, one or more speaker embeddings; extract, from each face image, using the face embedding extractor, one or more face embeddings; determine a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verify the speaker if the similarity score exceeds a predetermined threshold.
The at least one processor may be configured to determine a cross-modal similarity by applying a neural network trained according to the method set out above. The at least one processor may further be configured to determine, based only on the speaker embeddings, a probability that voice waveform corresponds to the speaker, and to determine the similarity score also based on said probability.
The at least one processor may further be configured to determine, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, and to determine the similarity score also based on said face similarity score.
Also disclosed herein is a computer-readable medium having instructions stored thereon that, when executed by at least one processor of a computer system, cause the computer system to perform the method for training a neural network for speaker verification as described above, or to perform the method for speaker verification as described above.
Advantageously, the invention enables cross-modal discriminative network assistive speaker recognition. For a speaker recognition system, if an enrolled speaker's face is also available, the test speech can be used to find a general relation with the speaker's face and then assist the speaker recognition system.
Advantageously, the invention similarly enables cross-modal discriminative network assistive face recognition. For a face recognition system, if an enrolled individual's voice is also available, the test face can be used to find a general relation with the speaker's voice and then to assist the face recognition system.
Advantageously, the invention enables cross-modal discriminative network assistive audio-visual speaker recognition. For an audio-visual speaker recognition system, either the test face or voice can be used to find a general relation with the speaker's voice or face with cross-modal discriminative network and then to assist the audio-visual speaker recognition system.
Brief description of the drawings
Embodiments of the present invention will now be described, by way of non-limiting example, with reference to the drawings in which:
Figure 1 shows a method for training a neural network for speaker verification, and a method for verifying a speaker;
Figure 2 illustrates an architecture of the proposed cross-modal discrimination network, VFNet, that relates the voice and face of a person; Figure 3 is a block diagram of proposed audio-visual (AV) speaker recognition framework with VFNet, where VFNet provides voice-face cross- modal verification information that strengthens the baseline audio-visual speaker recognition decision; and
Figure 4 is a schematic of a system on which the methods of Figure 1 can be implemented.
Detailed description
Described is a voice-face discriminative network (VFNet) that enables cross-modal similarity to be detected in a similar manner to the way a human can identify the face of an unknown speaker from a small number of people, based on the voice of the speaker. VFNet establishes a general relation between human voice and face.
Experiments show that VFNet provides additional speaker discriminative information, enabling significant improvements for audio-visual speaker recognition over the standard fusion of separate audio and visual systems. Further, the cross-modal discriminative network can also be useful for improving both speaker and face recognition individual system performance. The use of cross-modal information between voice and face can be used in various applications including:
(i) methods using a cross-modal relationship between a voice-face pair (i.e. a voice waveform and a face) to support audio-visual speaker recognition;
(ii) methods using a cross-modal relationship between a voice-face pair to support either speaker recognition or face recognition systems when both voice and face are available for the target speakers.
The discussion below describes these methods, a system for implementing the methods and demonstrates that the cross-modal discriminative network finds a general relation between voice and face, such that a voice- face pair can be used to assist audio-visual speaker recognition with robust performance. The same improvement is achieved for the results of speaker and face recognition separately, with access to both the voice and face of enrolled individuals.
With reference to Figure 1, the basis for these methods relies on providing a model with learned features mapping particular voice characteristics with particular facial features. A neural network model is presently proposed, for which Figure 1 illustrates a method 100 for training the neural network for speaker verification. The method 100 broadly comprises:
Step 102: receiving voice and face inputs;
Step 104: extracting speaker (i.e. voice) embeddings;
Step 106: extracting face embeddings; and
Step 108: training the neural network using the embeddings.
The above method yields VFNet, the trained neural network model for cross-modal speaker verification. The neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings. In general, the trained neural network (i.e. the product of step 108) will be able to verify both of a voice and face of a further speaker based on the associations and the face embeddings and voice embeddings, respectively.
Figure 2 shows the architecture 200 of the neural network, showing consideration of two inputs: a voice waveform 202 and a human face 204. The output 206 of the network 200 is a confidence score that describes at least one of the level of confidence that the voice and the face come from the same person and, where a threshold confidence level is provided, a positive or negative response indicating that the network 200 considers the voice and face to be of the same person or different people, respectively.
In more detail, step 102 involves receiving a waveform 202 and a face image 204 for each of a plurality of speakers. In general, each waveform will be paired with a face image and vice versa, to form a voice-face pair or face-voice pair. Moreover, the method 100 will generally be employed to a plurality, often many thousands, of voice-face pairs. Each face image comprises a face of a speaker and each waveform comprises a voice of a speaker. Where positive training is being employed, the voice in the waveform will be for the same speaker as the face in the image. Where negative training is being employed, the voice in the waveform will be for a different speaker to that whose face is in the image.
The waveform and face may be extracted from a database, the received through a receiver portion of a transceiver, through direct capture (e.g. video feed capturing one or more face images of a speaker and a voice input) or by any other suitable method. Step 104 involves extracting one or more speaker embeddings from each voice waveform. The speaker embeddings are low-dimensional spaces that describe features of the voice waveform such that voice waveforms with similar, or close, embeddings indicate that the speaker's voice in each case is semantically similar.
The speaker embeddings are extracted using a speaker and bedding extractor 208. The inputs from which the speaker embeddings are extracted may be derived from any suitable corpus such as the VoxCelebl- 2 corpora. The voice waveforms or speech waveforms may be taken from the audio channel of a video feed, and the face images may be selected from the image channel of the video feed - thus, both the voice waveform or waveforms, and face image or images, may be extracted from the same video feed.
Various systems can be used for extracting speaker embeddings. In an example, an x-vector based system is used for speaker embeddings extraction. Speech utterances, each of which comprises a voice of a speaker and is herein referred to as a voice waveform, are processed with energy based voice activity detection. This process removes silence regions. Therefore, while in some instances only part of a voice waveform may be used for extracting the speaker embeddings, in other instances, such as where the voice waveform is an audio channel of a video, the entire waveform may be used with pre-processing removing silent and/or noisy regions thereby reducing the size of the input from which the model trains. Energy based voice activity detection may alternatively, or in addition, be used for extracting one or both of frequency and spectral features. For example, mel frequency cepstral coefficient (MFCC) features may be extracted. The MFCC features may be any desired dimension, such as 30- dimensional MFCC features. In addition, windowing of the input 202 and/or 204 may be employed. Windowing enables features to be extracted from portions of a waveform or face image, reduces computation load, and enables features in one portion of a waveform or face image to be associated with features in a different portion of the same waveform or face image, those associations being lower-level features when compared with the features that they associate. Normalisation may also be applied across each window. In an example, short-time cepstral mean normalization is applied over a 3- second sliding window.
Step 106 involves extracting one or more face embeddings from each face image. Various models can be used for extracting face embeddings from images. For example, the face embedding extractor 210 may comprise the ResNet-50 RetinaFace model trained on a suitable database, such as the WIDER FACE database, to detect faces.
The face embeddings may then be aligned. One method for aligning the face embeddings is to use a multi-task cascaded convolutional network (MTCNN). Faces are detected in the images, recognised and aligned using suitable functions such as those provided in the InsightFace library, to obtain highly discriminative features for face recognition. To facilitate this process, additive angular margin loss may be used, e.g. for feature extraction. The face embedding extractor 210 may also include the ResNet- 100 extractor model trained on the VGGFace2 and cleaned MS1MV2 database to extract the face embeddings.
In the example shown in Figure 2, the dimension of both speaker and face embeddings is 512 (212, 214). The 512-dimensional embeddings 212, 214 are used in step 108 four training the neural network 200. In general, positive training will be used either alone, or with negative training, to train the neural network 200. In this sense, positive training is where, for each voice-face pair, the voice and face are of the same speaker. Conversely, negative training is where, for each voice-face pair, the voice and face are of different speakers.
By performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, the neural network 200 can learn one or more associations between the voice waveform or waveforms and the respective face in the face image or face images.
The speaker embeddings and face embeddings represent information from respectively different modalities. To learn associations between the two, the inputs (i.e. their 512-dimensional inputs) are fed to a lower dimensional layer, or successively lower dimensional layers. In the present embodiment, the 512-dimensional input is fed to a 256-dimensional layer 216, 218 followed by a 128-dimensional layer 220, 222. The 256- dimensional layer 216, 218 is a fully connected layer (FC1) with rectified linear unit (ReLU) activation. The 128-dimensional layer 220, 222 is a fully connected layer (FC2) without the ReLU. These layers 216, 218, 220, 222 are introduced to lead the speaker and face embeddings for learning the cross-modal identity information from each other. Further, they help to project the embeddings from both modalities into a new domain, where their relation can be established.
To determine similarity, step 108 may further involve transforming the embeddings into a transformed feature space. For a given pair of speaker embedding ev and a face embedding er, their transformed embeddings Tv(ev ) and Tfiei) are derived from VFNet. The cosine similarity (224) is then determined to produce the cosine similarity scoring S(7V(e v), Tf{er )) between the embeddings or the transformed embeddings. 1 - S Tv(ev), Tf{ef)) can then be used to represent a lack of cosine similarity. This is particularly applied for example during negative training on negative voice-face pairs. Thus, the final output pi is the score to describe the probability that the voice and the face belong to the same person, p2 is the score depicting the probability that the voice and the face do not belong to the same person. In other words, pi is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to the same speaker, and p2 is the probability that the voice in the voice waveform or waveforms and the face in the face image or images belong to different speakers.
By using softmax function based on S(Tv(ev), Tfef)) and 1 - S(Tv(ev), Tf(ef))r the neural network 200 can output pi and p2, expressed as:
Figure imgf000015_0001
where ev and ef are the voice embeddings and face embeddings respectively, Tv(ev) and Tfie ) are the transformed voice embeddings and face embeddings respectively, and:
Figure imgf000015_0002
These probabilities constitute predictions of whether the face and voice belong to the same or different people. These probabilities all predictions, along with ground truth verification labels
Figure imgf000015_0004
I used to calculate the cross- entropy loss according to:
Figure imgf000015_0005
Figure imgf000015_0003
The loss is then propagated back through the layers 216, 218, 220, 222 to adjust weights applied by those layers to particular features found in the speaker embeddings and face embeddings, to cause the neural network 200 to learn. After being trained, the neural network 200 will be able to determine a cross-modal similarity for a particular voice-face input pair. In other words, the neural network 200 will be able to specify a likelihood or probability that the face visible in an input (e.g. video feed) corresponds to the voice audible in that input.
The speaker and face embeddings for the model 200 to perform cross- modal verification follow the same pipeline discussed above.
In experiments, the NIST SRE audio-visual corpus was used for the speaker recognition application. Manually marked diarization labels of voice and keyframe indices along with bounding boxes that mark a face of the individual- i.e. the dataset may include, for some of the frames, a bounding box identifying an enrolled speaker or bounding boxes identifying enrolled speakers. This enables enrolment of the target speakers from the videos (i.e. the audio-visual corpus) during training of model (i.e. network) 200.
To ensure the model 200 is agnostic is to ethnicity, age and other biases of input data sets, the speaker and face embeddings of the target person may be extracted from the enrolment segments (i.e. audio-visual feed) of a development set (e.g. NIST SRE), and the model is then retrained based on the combination of the enrolment segments and a second database (e.g. VoxCeleb2). Thus, one data set may be used to enrol specific individual speakers - e.g. employees of a company for whom voice and/or face recognition is to be used - and a second data set can then be used to refine or generalise the model 200.
During testing processes, labels were omitted from the input.
A summary of the corpus for training and testing is shown in Table 1.
Figure imgf000017_0001
Table 1: summary of VoxCeleb2 and 2019 NIST SRE audio-visual
(AV) corpora.
The neural network model 200 trained using the above method 100 results in a model that can be used to determine a cross-modal similarity between a voice and a face of a speaker. The output of the model 200 can be used either independently, to verify that a face and voice are the same individual, or be fused with a speaker recognition system or face recognition system for enhanced speaker and face recognition, respectively. The output of the model 200 can also be fused with the output of a baseline audio-visual recognition system as shown in Figure 2. The fused output can then be used in a final decision for verifying the identity of the speaker.
To that end, Figure 1 shows a method 110 for speaker verification. In some embodiments, as reflected by broken line 112, the method 110 leverages of a neural network trained according to the method 100. Speaker verification method 110 broadly comprises:
Step 114: receiving a voice waveform and at least one face image of a speaker;
Step 116: extracting one or more face embeddings from the face image or images;
Step 118: extracting one or more voice embeddings from the voice what;
Step 120: determining a cross-modal similarity between the voice embedding or embeddings and the face embedding or embeddings; and Step 122: verifying the speaker.
At step 114 a voice waveform and one or more face images are received through a receiver forming part of, for example, transceiver 412 of Figure 4. The voice waveform and face image or images are those of a speaker.
At step 116 and 118 speaker embeddings and face embeddings, respectively, are extracted in the same manner as those embeddings are extracted at steps 104 and 106, respectively.
Step 120 involves determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings. As reflected by the broken line 112, the cross- modal similarity is determined from a cross-modal similarity score that is the output from the trained model 200 developed using the method 100.
Step 122 involves verifying the speaker if the similarity score exceeds a predetermined threshold. Verification can be a Yes/No verification in which a speaker is verified - i.e. that the voice and face match the same person - anything cross-modal similarity score is greater than the predetermined threshold, and the speaker is not verified - i.e. that the voice and face are unlikely to be the same person - if the cross-modal similarity score is lower than or equal to the predetermined threshold. In other embodiments, verification may employ multiple thresholds. For example, a cross-modal similarity score below a first threshold is an indication of high confidence that the face and voice are not a match, between the first threshold and a second threshold is an indication that further information is required to confidently determine whether or not the face and voice are a match, and above the second threshold indicates high confidence that the face and voice are a match.
Figure 3 illustrates an architecture in which the cross-modal similarity is used to enhance voice and face verification. The architecture 300 provides an audio-visual speaker recognition framework in which the left panel 302 represents the process for cross-modal speaker verification, and baseline recognition is set out in the right panel 304. Voice segments of the target person are inputted (306) to the system 300 and are used in panel 302 to determine face embeddings corresponding to the speaker embeddings. Embeddings are extracted using the x-vector system.
The speaker verification network 308 determines a speaker verification score 312 from embeddings in the voice segments 306 and the whole audio waveform 310, for speaker verification - i.e. a confidence score that the speaker is enrolled with the system 300.
InsightFace or other system extracts the face embeddings for given faces of the target speakers from the enrollment videos and all detected faces from the test videos 314 in both panel 302 and 304. On the left panel, 302, the cross-modal network provides an association score (i.e. cross-modal verification score) between the target speaker voice in the enrolment video and the faces detected from the test video. Matching pairs between voice and face will give rise to high association, while mismatches, such as age, gender, weight and ethnicity discrepancy, will do otherwise. In panel 304 the audio and visual systems 314, 316 respectively, run in parallel to verify the target person's identity by computing a match between the enrolment and the test embeddings. In the present embodiment, the speaker recognition system 316 considers probabilistic linear discriminant (PLDA) based likelihood scores. In contrast, the face recognition system 318 computes cosine similarity scores. The state-of- the-art baseline for present purposes is then a score level fusion 320 between the two parallel systems 316, 318.
Therefore, while the output of the model 200, namely the cross-modal similarity score, may be used to verify a speaker, the system 300 uses the cross-modal similarity score with the output of one or both of speaker verification model 316 and face verification model 318. In some embodiments, such as when model 200 is used for speaker recognition, a probability that the voice waveform 306 corresponds to the speaker is determined based only on the speaker embeddings, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed probability - i.e. that the voice waveform 306 corresponds to the speaker. In some embodiments, such as when model 200 is used for face recognition, a face similarity score you determine, specifying a similarity between the face in the target faces 322 and the speaker, and the similarity score outputted from the system 300 is then based on the cross-modal similarity score of panel 302 (model 200) and on the computed face similarity - i.e. that the in the target faces face corresponds to the speaker. As mentioned above, the face similarity score may be a cosine similarity.
Score level fusion 324 may be performed using various methods. In one embodiment, score level fusion is performed using logistic regression. Thus, in each of the audio-visual SRE applications, an enrolment video (comprising audio and visual channels) provides the target individual's biometric information (voice and face) and the assignment asks the model to automatically determine whether the target person is present in a given test video. That determination is based on a cross-modal confidence score that specifies whether the voice is likely to match the face, and one or both of face recognition and speaker recognition.
Experiments
Experiments were conducted based on videos in the VoxCeleb2 corpus to derive a set with voices and faces for cross-modal verification. For each video, the entire audio channel is extracted to represent the voice of the speaker. For the face recognition application, face detection was performed on each video (i.e. a series of face images) and the most prominent faces representing an individual were selected. For cross-modal discriminative training, the positive trials are faces and voices come from the same identity, whereas the negative trials are obtained by shuffling the faces and voices belonging to different persons. The model 200 learned the general association between voice and face from VoxCeleb2.
As mentioned above, the embedding extraction for voice and face produces a 512-dimensional output. Although the dimension of speaker (i.e. voice) and face embeddings are same, the back-end scoring for respective individual system is different. Linear discriminant analysis (LDA) was used on speaker embeddings for channel/session compensation and to reduce the dimension of x-vectors - presently to 150. Finally, PLDA is used as a classifier to get the final speaker recognition score. For facial recognition, cosine similarity between face embeddings from the enrolment video and those from the detected faces in the test video are computed. The average of top 20% scores of the number of face embeddings in the test video were taken to derive the final face recognition score. Focusing now on the back-end of audio-visual SRE with VFNet. The VFNet back-end computes the likelihood score between the speaker embedding of the target speaker and all the face embeddings of detected faces in the test video - this score is determined using Equation (1). Finally, the average of all scores, or a predetermined set or proportion of scores - e.g. the top 20% - is taken. This average is combined with the scores generated from one or both of the audio and visual systems. That combination may be achieved using a variety of methods including logistic regression. Notably, cross-modal verification can also be done by considering the all the given faces in the enrolment video and the detected multiple speaker voices in the test audio. This can be achieved using a speaker diarization module that detects the voice belonging to different speakers in the test audio. For example, the model may determine that a mouth of a particular face is moving in a manner corresponding to an audio channel feed. Various processes and libraries can be used to fuse scores. For example, the Bosaris toolkit can be used to calibrate and fuse the scores of the different systems 302, 316, 318. For experimental purposes, the performance of systems is reported in terms of equal error rate (EER), minimum detection cost function (minDCF) and actual detection cost function (actDCF) following the protocol of 2019 NIST SRE.
Notably, the model 200 trained according to the method 100 performed effectively for cross-modal verification. For some applications, such as where there are multiple speakers present in the test videos each of which has to be matched with the target speaker in the enrolment video, the method 100 may comprise formulating and adding one more shared weights sub-branches to the neural network model for selection requirements.
The results of cross-modal audio-visual speaker recognition, fusing the single modality speaker, face recognition systems as shown in Table 4. In particular, Table 4 shows the changes in results when cross-modal audio visual speaker recognition is used as per the model trained according to method 100, when compared with speaker recognition, face recognition and audio-visual recognition systems that do not use cross-modal audio visual speaker recognition.
Figure imgf000023_0001
When examining the effect on single modality systems as reflected in the above Table 4, it is evident that the contribution of VFNet (i.e. the trained model) is greater for speaker recognition system on the evaluation set. Further, the trained model is also able to enhance the audio-visual baseline system performance. This suggests a usefulness for associating audio and visual cues by cross-modal verification for audio-visual SRE. The relative improvements in each case are 16.54%, 2.00% and 8.83% in terms of EER, minDCF and actDCF, respectively.
Thus Figure 1 illustrates method 100 for training, and method 110 for using, a novel framework for audio-visual speaker recognition with a cross- modal discrimination network. The VFNet based cross-modal discrimination network finds the relations between a given pair of human voice and face to generate a confidence score based on a confidence that the voice and face correspond to belong to the same person. While the trained model can perform comparably to existing state-of-the-art cross-modal verification systems, the proposed frame- work of audio-visual speaker recognition with cross-modal verification outperforms the baseline audio-visual system. This highlights the importance of cross-modal verification, in other words, the relation between audio and visual cues for audio-visual speaker recognition.
Figure 4 is a block diagram showing an exemplary computer device 400, in which embodiments of the invention, particularly methods 100 and 110 of Figure 1, may be practiced. The computer device 400 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.
As shown, the mobile computer device 400 includes the following components in electronic communication via a bus 406:
(a) a display 402;
(b) non-volatile (non-transitory) memory 404; (c) random access memory ("RAM") 408;
(d) N processing components 410;
(e) a transceiver component 412 that includes N transceivers; and
(f) user controls 414.
Although the components depicted in Figure 4 represent physical components, Figure 4 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 4 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 4.
The display 402 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro- projector and OLED displays).
In general, the non-volatile data storage 404 (also referred to as non volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 404, or by instructions stored in memory 404 - e.g. memory 404 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 410 cause the processor(s) 410 to perform the methods 100 and/or 110 described with reference to Figure 1.
In some embodiments for example, the non-volatile memory 404 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity. In many implementations, the non-volatile memory 404 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 404, the executable code in the non-volatile memory 404 is typically loaded into RAM 408 and executed by one or more of the N processing components 410.
The N processing components 410 in connection with RAM 408 generally operate to execute the instructions stored in non-volatile memory 404. As one of ordinarily skill in the art will appreciate, the N processing components 410 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
The transceiver component 412 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
The system 400 of Figure 4 may be connected to any appliance 418, such as an external server, database, video feed or other source from which inputs may be obtained.
It should be recognized that Figure 4 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 404. Non-transitory computer-readable medium 404 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer.
It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

Claims
1. A method for training a neural network for speaker verification, comprising: receiving, for each of a plurality of speakers, at least one voice waveform comprising a voice of the respective speaker and at least one face image comprising a face of the respective speaker; extracting, from each voice waveform, one or more speaker embeddings; extracting, from each face image, one or more face embeddings; and training the neural network by performing positive training using positive voice-face pairs, each positive voice-face pair comprising speaker embeddings and face embeddings of the same speaker, to learn one or more associations between the at least one voice waveform and the respective face in the at least one face image, so that the neural network can verify at least one of: a face of a further speaker based on the associations and speaker embeddings; and a voice of a further speaker based on the associations and face embeddings.
2. The method of claim 1, wherein training the neural network further comprises negatively training the neural network on negative voice-face pairs, each negative voice-face pair comprising speaker embeddings and face embeddings of different speakers.
3. The method of claim 1 or 2, wherein training the neural network involves transforming the embeddings into a transformed feature space.
4. The method of claim 3, wherein training the neural network involves applying cosine similarity scoring to the transformed embeddings.
5. The method of claim 4 when dependent on claim 2, wherein the neural network outputs a probability pi that the at least one voice waveform and the face in the at least one face image belong to the same speaker.
6. The method of claim 5, wherein pi is calculated according to:
Figure imgf000029_0001
where ev and ef are the voice embeddings and face embeddings respectively, Tv(ev) and Tfie ) are the transformed voice embeddings and face embeddings respectively, the cosine similarity score is S(Tv(ev), Tfef)) for the positive voice-face pairs and 1 - S(Tv(ev), Tfef)) for the negative voice-face pairs.
7. The method of claim 6, wherein the neural network outputs a probability p2 that the voice waveform and the face in the at least one face image belong to different speakers, wherein p2 is calculated according to:
Figure imgf000029_0002
8. A method for speaker verification, comprising: receiving, via a receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extracting, from the voice waveform using a speaker embedding extractor, one or more speaker embeddings; extracting, from each face image, using a face embedding extractor, one or more face embeddings; determining a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verifying the speaker if the similarity score exceeds a predetermined threshold.
9. The method of claim 8, wherein determining a cross-modal similarity comprises applying a neural network trained according to the method of any one of 1 to 7.
10. The method of claim 8 or 9, further comprising determining, based only on the speaker embeddings, a probability that the voice waveform corresponds to the speaker, wherein the similarity score is also based on said probability.
11. The method of claim 10, wherein determining the probability comprising calculating a probabilistic linear discriminant (PLDA) based likelihood score.
12. The method of any one of claims 8 to 11, further comprising determining, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, wherein the similarity score is also based on said face similarity score.
13. The method of claim 12, wherein the face similarity score comprises a cosine similarity.
14. A system for speaker verification, comprising: memory; a receiver; a speaker embedding extractor; a face embedding extractor; and at least one processor, wherein the memory comprises instructions that, when executed by the at least one processor, cause the system to: receive, via the receiver, for a speaker: a voice waveform; and at least one face image comprising a face of the speaker; extract, from the voice waveform using the speaker embedding extractor, one or more speaker embeddings; extract, from each face image, using the face embedding extractor, one or more face embeddings; determine a similarity score based on a cross-modal similarity between the one or more speaker embeddings and the one or more face embeddings; and verify the speaker if the similarity score exceeds a predetermined threshold.
15. The system of claim 14, wherein the at least one processor is configured to determine a cross-modal similarity by applying a neural network trained according to the method of any one of claims 1 to 7.
16. The system of claim 14 or 15, wherein the at least one processor is further configured to determine, based only on the speaker embeddings, a probability that voice waveform corresponds to the speaker, and to determine the similarity score also based on said probability.
17. The system of any one of claims 14 to 16, wherein the at least one processor is further configured to determine, based only on the face embeddings, a face similarity score specifying a similarity between the face in the at least one face image and the speaker, and to determine the similarity score also based on said face similarity score.
18. A computer-readable medium having instructions stored thereon that, when executed by at least one processor of a computer system, cause the computer system to perform the method for training a neural network for speaker verification in accordance with any one of claims 1 to 7, or to perform the method for speaker verification in accordance with any one of claims 8 to 13.
PCT/SG2021/050358 2020-06-19 2021-06-21 Cross-modal speaker verification WO2021257000A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202005845Y 2020-06-19
SG10202005845Y 2020-06-19

Publications (1)

Publication Number Publication Date
WO2021257000A1 true WO2021257000A1 (en) 2021-12-23

Family

ID=79268734

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2021/050358 WO2021257000A1 (en) 2020-06-19 2021-06-21 Cross-modal speaker verification

Country Status (1)

Country Link
WO (1) WO2021257000A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230215440A1 (en) * 2022-01-05 2023-07-06 CLIPr Co. System and method for speaker verification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790054A (en) * 2016-12-20 2017-05-31 四川长虹电器股份有限公司 Interactive authentication system and method based on recognition of face and Application on Voiceprint Recognition
CN108446674A (en) * 2018-04-28 2018-08-24 平安科技(深圳)有限公司 Electronic device, personal identification method and storage medium based on facial image and voiceprint
US20190213399A1 (en) * 2018-01-08 2019-07-11 Samsung Electronics Co., Ltd. Apparatuses and methods for recognizing object and facial expression robust against change in facial expression, and apparatuses and methods for training
US20190313014A1 (en) * 2015-06-25 2019-10-10 Amazon Technologies, Inc. User identification based on voice and face

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190313014A1 (en) * 2015-06-25 2019-10-10 Amazon Technologies, Inc. User identification based on voice and face
CN106790054A (en) * 2016-12-20 2017-05-31 四川长虹电器股份有限公司 Interactive authentication system and method based on recognition of face and Application on Voiceprint Recognition
US20190213399A1 (en) * 2018-01-08 2019-07-11 Samsung Electronics Co., Ltd. Apparatuses and methods for recognizing object and facial expression robust against change in facial expression, and apparatuses and methods for training
CN108446674A (en) * 2018-04-28 2018-08-24 平安科技(深圳)有限公司 Electronic device, personal identification method and storage medium based on facial image and voiceprint

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MEUTZNER HENDRIK; MA NING; NICKEL ROBERT; SCHYMURA CHRISTOPHER; KOLOSSA DOROTHEA: "Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 5 March 2017 (2017-03-05), pages 5320 - 5324, XP033259426, DOI: 10.1109/ICASSP.2017.7953172 *
NAGRANI ARSHA; CHUNG JOON SON; ALBANIE SAMUEL; ZISSERMAN ANDREW: "Disentangled Speech Embeddings Using Cross-Modal Self-Supervision", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 6829 - 6833, XP033793750, DOI: 10.1109/ICASSP40776.2020.9054057 *
SHON SUWON; OH TAE-HYUN; GLASS JAMES: "Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 3995 - 3999, XP033566026, DOI: 10.1109/ICASSP.2019.8683477 *
SOO-WHAN CHUNG; HONG GOO KANG; JOON SON CHUNG: "Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 April 2020 (2020-04-29), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081655046 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230215440A1 (en) * 2022-01-05 2023-07-06 CLIPr Co. System and method for speaker verification
WO2023132828A1 (en) * 2022-01-05 2023-07-13 CLIPr Co. System and method for speaker verification

Similar Documents

Publication Publication Date Title
WO2020073694A1 (en) Voiceprint identification method, model training method and server
Villalba et al. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations
CN108417217B (en) Speaker recognition network model training method, speaker recognition method and system
US10255922B1 (en) Speaker identification using a text-independent model and a text-dependent model
US10380332B2 (en) Voiceprint login method and apparatus based on artificial intelligence
WO2019179036A1 (en) Deep neural network model, electronic device, identity authentication method, and storage medium
US10068588B2 (en) Real-time emotion recognition from audio signals
Lozano-Diez et al. Analysis and Optimization of Bottleneck Features for Speaker Recognition.
US8416998B2 (en) Information processing device, information processing method, and program
WO2020155584A1 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
WO2017162053A1 (en) Identity authentication method and device
CN110956966B (en) Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment
WO2019179029A1 (en) Electronic device, identity verification method and computer-readable storage medium
JP6464650B2 (en) Audio processing apparatus, audio processing method, and program
Khoury et al. Bi-modal biometric authentication on mobile phones in challenging conditions
US20170294192A1 (en) Classifying Signals Using Mutual Information
Khoury et al. The 2013 speaker recognition evaluation in mobile environment
CN111199741A (en) Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN109119069B (en) Specific crowd identification method, electronic device and computer readable storage medium
Ringeval et al. Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion
EP2879130A1 (en) Methods and systems for splitting a digital signal
Ramos-Castro et al. Speaker verification using speaker-and test-dependent fast score normalization
TW202213326A (en) Generalized negative log-likelihood loss for speaker verification
US11437044B2 (en) Information processing apparatus, control method, and program
WO2021257000A1 (en) Cross-modal speaker verification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21825472

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21825472

Country of ref document: EP

Kind code of ref document: A1