CN111081255B

CN111081255B - Speaker confirmation method and device

Info

Publication number: CN111081255B
Application number: CN201911412555.6A
Authority: CN
Inventors: 俞凯; 钱彦旻; 杨叶新; 王帅; 龚勋
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-06-03
Anticipated expiration: 2039-12-31
Also published as: CN111081255A

Abstract

The invention discloses a speaker confirmation method and a speaker confirmation device, wherein the speaker confirmation method comprises the steps of inputting audio data into a universal feature extractor to extract preset features; inputting the extracted preset features into a speaker classification sub-network and a phoneme distribution prediction sub-network respectively, wherein speaker embedding is obtained by extraction of the speaker classification sub-network, and text embedding is obtained by extraction of the phoneme distribution prediction sub-network; combining the speaker embedding and the text embedding through a combining sub-network to obtain a single embedding of the speaker embedding and the text embedding; and performing speaker verification based on the single embedding.

Description

Speaker confirmation method and device

Technical Field

The invention belongs to the technical field of speaker confirmation, and particularly relates to a speaker confirmation method and device.

Background

In the prior art, Speaker Verification (SV) is intended to verify the claimed identity of a customer from his speech. Speaker verification can be divided into two categories, considering the limitations of speech content: text-dependent and text-independent. The former task requires that the enrollment and testing utterances have the same phonetic content, while the latter does not, thereby providing greater flexibility to the user.

For text independent speaker verification tasks, it is beneficial to train the speaker embedding extractor on a large amount of unconstrained speech data, implicitly normalizing the textual information, since the final speaker embedding should be free of speech variability. Although performing well on text-independent tasks, it is still problematic to apply the same model directly to text-independent tasks, for which textual information is important. A common approach to address this performance degradation is to collect training data that has the same phonetic content as the assessment data, and this approach is commonly used by companies for speaker verification based on wake words. However, re-collecting application-specific training data can be very expensive and lack flexibility.

In practical applications, the challenge comes not only from the text mismatch of the training data and the evaluation data, but also from the text mismatch between the enrollment and test data in the evaluation. For example, in practical applications, it is common for a user to want to wake up a smart device using multiple keywords. For example, the Google device allows "OK Google" and "Hey Google". Some applications involve even more different keywords.

The existing speaker verification system related to the text on the market requires that the registered text of the speaker is consistent with the test text, and whether the text content is matched or not needs to be judged while judging whether the speaker is correct or not. Such scenarios typically occur in speaker verification based on wake words, but text mismatch of the pre-collected data and the actual test data can greatly impact the performance of the speaker verification system. For text mismatches of training data and assessment data, it is common practice to collect data for the target speech content, on top of which a speaker verification model is trained. But for the text mismatch of the registration data and the test data, no mature method for solving the problem exists in the market.

Prior art solutions, in the training phase, use deep neural networks to train a speaker-classified task on data with targeted speech content, so that the network can be defaulted to implicitly learn textual information. In the testing stage, a speaker is required to be registered, the registered voice is input into the neural network, vectors are extracted from the middle layer and used as speaker embedding to model the registrant. And embedding the registered speaker into the speaker with the actual test voice to score by cosine or PLDA, and judging whether the registered speaker is consistent with the tested speaker and the text according to the score.

The inventor finds that the existing scheme has at least the following defects in the process of implementing the application:

a large amount of data with targeted voice content needs to be collected, which typically requires a large expenditure of manpower and material resources. And if the target text changes, new data needs to be collected again, which is not flexible.

Disclosure of Invention

The embodiment of the invention provides a speaker verification method and a speaker verification device, which are used for solving at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a speaker verification method, including: inputting the audio data to a general feature extractor to extract a preset feature; inputting the extracted preset features into a speaker classification sub-network and a phoneme distribution prediction sub-network respectively, wherein speaker embedding is obtained by extraction of the speaker classification sub-network, and text embedding is obtained by extraction of the phoneme distribution prediction sub-network; combining the speaker embedding and the text embedding through a combining sub-network to obtain a single embedding of the speaker embedding and the text embedding; and performing speaker verification based on the single embedding.

In a second aspect, an embodiment of the present invention provides a speaker verification apparatus, including: an extraction module configured to input the audio data to the general feature extractor to extract a preset feature; the parallel sub-network module is configured to input the extracted preset features into a speaker classification sub-network and a phoneme distribution prediction sub-network respectively, wherein speaker embedding is obtained through extraction of the speaker classification sub-network, and text embedding is obtained through extraction of the phoneme distribution prediction sub-network; the merging module is configured to merge the speaker embedding and the text embedding through a merging sub-network to obtain a single embedding of the speaker embedding and the text embedding; and a confirmation module configured to perform speaker confirmation based on the single embedding.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the steps of the speaker verification method according to any of the embodiments of the invention.

In a fourth aspect, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the speaker verification method according to any one of the embodiments of the present invention.

The method and the device not only provide a brand new idea for decomposing the speaker information and the text information for the speaker identification related to the text, but also improve the identification effect of the speaker irrelevant to the original text on the basis of greatly improving the identification effect of the speaker related to the text; in addition, the scheme provides a brand-new speaker recognition task customized by the text, has wide prospect in speaker recognition based on the awakening words, and provides a solution.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a speaker verification method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an embodiment of a speaker verification scheme according to the present invention;

FIG. 3 is a block diagram of a speaker verification device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a speaker verification method according to the present application is shown, and the speaker verification method according to the present embodiment may be applied to devices requiring speaker verification, such as voiceprint recognition used in various intelligent voice devices, for example, and the present application is not limited thereto.

As shown in fig. 1, in step 101, audio data is input to a general feature extractor to extract a preset feature;

in step 102, the extracted preset features are respectively input into a speaker classification sub-network and a phoneme distribution prediction sub-network, wherein speaker embedding is obtained by extraction of the speaker classification sub-network, and text embedding is obtained by extraction of the phoneme distribution prediction sub-network;

in step 103, the speaker embedding and the text embedding are passed through a merging sub-network, and the merging results in a single embedding of the speaker embedding and the text embedding;

in step 104, speaker verification is performed based on the single embedding.

In this embodiment, for step 101, the speaker verification apparatus extracts the preset features in the audio data by inputting the audio data to the general feature extractor. Then, for step 102, the speaker verification apparatus inputs the extracted preset features into two parallel sub-networks, respectively, wherein the speaker classification sub-network can extract speaker embedding, and the phoneme distribution prediction sub-network can extract text embedding. Thereafter, for step 103, the extracted speaker embedding and text embedding input values are merged into a sub-network such that a single embedding of speaker embedding and text embedding can be merged. Finally, subsequent speaker verification can be performed based on the single embedding.

The method of the embodiment not only provides a brand new idea of decomposing the speaker information and the text information for the speaker identification related to the text, but also improves the identification effect of the speaker irrelevant to the original text on the basis of greatly improving the identification effect of the speaker related to the text; in addition, the scheme provides a brand-new speaker recognition task customized by the text, has wide prospect in speaker recognition based on the awakening words, and provides a solution.

In some alternative embodiments, the generic feature model, the classification sub-network and the phoneme distribution prediction sub-network and the merging sub-network comprise training using joint training.

In some alternative embodiments, the training pairs used to train the merging sub-network include training pairs from the same sentence and training pairs from different sentences.

In some optional embodiments, the audio data comprises training data and assessment data, the assessment data being subdivided into enrollment data and test data, the method further comprising: on a text-dependent speaker confirmation, text embedding for coupling with speaker embedding results from the actual enrollment data and the test data if a text mismatch occurs between the training data and the assessment data; if a text mismatch occurs between the enrollment data and the test data, the text embedding includes calculating from pre-collected data; and performing text adaptation on the registered speaker embedding according to the text embedding.

In some alternative embodiments, the speaker classification subnetwork and the phoneme distribution prediction subnetwork each include two time-warping layers, one statistical pooling layer, and two linear layers, and the merging subnetwork includes two linear layers and two output layers.

In a further alternative embodiment, the phoneme distribution prediction subnetwork predicts the number of occurrences of each phoneme class in a normalized sentence.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventors have discovered in the course of practicing the present application that the deficiencies in the prior art are primarily due to the fact that the prior art does not explicitly separate speaker information from textual information.

To solve these drawbacks, practitioners in the industry usually adopt a multi-task training framework, i.e., performing speaker classification task and phoneme recognition task simultaneously during neural network training, but the problem of text mismatch between the enrollment data and the test data cannot be solved by using this method. The reason why this scheme is not easily conceived is the difficulty in defining sentence-level text information and the framework of speaker information and text information decomposition never before.

The solution of the present application proposes a speaker-text decomposition network that decomposes the input speech into speaker embedding and text embedding, which are then synthesized into a single embedded representation at a later stage. Given a small amount of speaker-independent target content adaptive speech, we can extract the text embedding of the target speech content and combine it with the text-independent speaker embedding to generate a text-customized speaker embedding. This allows for explicit addition of textual information to the speaker information without the need to collect large amounts of relevant data. And when the content of the registered sentence is inconsistent with the content of the sentence which needs to be actually tested, the speaker embedding of the text customization can be extracted, so that the problem of text mismatching is solved, and the method has great flexibility.

The model consists of four parts: general feature extractor M_fTwo parallel sub-networks M_sAnd M_tRespectively used for speaker classification and phoneme distribution prediction, andmerging sub-network M for merging speaker information and phoneme information_c。M_tThe predicted number of occurrences of each phoneme class in a normalized sentence, and M_sWhat is predicted is the traditional speaker class. And passes through M_sAnd M_tExtracted speaker embedding ebd_sWith text embedding ebd_tBy merging sub-networks M_cAnd restoring to obtain the speaker category and phoneme distribution. The entire network is jointly trained.

To better decouple speaker information and phoneme information, the input merge sub-network M_cTraining pair of [ ebd ]_s,ebd_t]May be the same or may be two words from different speakers.

Speaker embedding extracted from the speaker subnetwork can be used directly on text independent speaker verification. On text-dependent speaker verification, if a text mismatch occurs between the training data and the assessment data, the text embedding for coupling with speaker embedding is from the actual enrollment and test data; if a text mismatch occurs between the enrollment data and the test data, text embedding may be calculated from a minimal amount of pre-collected data (e.g., 10 utterances from other speakers) and then text-adapted to the enrolled speaker embedding.

In carrying out the present application, the inventors have also tried the following alternatives: an alternative approach includes speaker embedding where the final text customization is obtained by directly concatenating speaker embedding with text embedding without using a final merging subnetwork. The advantages are that the calculation amount can be reduced and the operation is simpler, the disadvantages are that the regularization effect of the merging sub-network on an upper network is omitted, the result is possibly worsened, and the embedding dimension of the speaker obtained by merging is too high.

The scheme of the embodiment of the application not only provides a brand new idea for decomposing the information of the speaker and the text information for the identification of the speaker related to the text, but also improves the identification effect of the speaker unrelated to the original text on the basis of greatly improving the identification effect of the speaker related to the text; in addition, the scheme provides a brand-new speaker recognition task customized by the text, has wide prospect in speaker recognition based on the awakening words, and provides a solution.

The experimental procedures of the inventors are described below to better support the effects of the embodiments of the present application.

Text mismatches between pre-collected data (either training data or enrollment data) and actual test data can severely compromise the performance of text-dependent speaker verification systems. While this problem can be addressed by carefully collecting data with targeted voice content, such data collection is often expensive and inflexible. In this document, we propose a novel text adaptation framework to address the text mismatch problem. Here we propose a speaker-text decomposition network that decomposes the input speech into speaker-embedding and text-embedding, and then synthesizes them into a single embedded representation at a later stage. Given a small amount of speaker-independent target content adaptive speech, we can extract the text embedding of the target speech content and combine it with the text-independent speaker embedding to generate a text-customized speaker embedding. Experiments on RSR2015 show that the text adaptive system we propose can significantly improve the performance of the speaker verification system under text mismatch conditions.

In this context, to avoid the need to re-collect a large amount of application-specific training data, we propose a "text adaptation framework" that can handle well the case of two text mismatches (training text to assessment text mismatch, registration text to test text mismatch). Our proposed "text adaptation framework" can adapt text-independent speaker embedding to text-customized speaker embedding we propose a speaker-text decomposition network that contains four parts: a generic feature learner, a subnet of speakers for text independent "speaker" embedding extraction, a subnet of text for "text" embedding extraction and a "combined" subnet to learn a representation of adapted text based on information provided by "text" embedding.

We customized different evaluation sets that account for different types of text mismatches from the RSR2015 data set. For text-related tasks where traditional text mismatch exists only between training and evaluation data, the "text" embedding comes from the same sentence as the "speaker" embedding used as the computation. For the case where a text mismatch also occurs between enrollment data and test data, we have collected very few utterances from any speaker (not duplicative of the utterances in the assessment set) to compute the text embedding and adjust enrollment speaker embedding. The experimental results show that the performance is obviously improved in both cases. In addition, the "speaker" embedding (no text adaptation) extracted from the speaker subnet is also better than the original x-vector baseline on the standard Voxceleb evaluation set.

Related work

X-VECTOR

The x-vector is a time-delay neural network (TDNN) based speaker-embedded learning framework. The model contains several layers of frame-level time delays, followed by a layer of statistics pool that aggregates the frame-level representations into a single segment-level representation. One or more embedding layers following the convergence layer may be incorporated into the segment level layers to extract speaker embedding.

Segment-level phoneme label definition

Researchers have studied the process of integrating speech information into the speaker modeling process, most of which follow a frame-level multitask learning paradigm. In our previous work, we propose a framework that considers speech information at the segment level, which is more compatible with segment-level trained x-vectors. The key is how to define a phonetic label where a speech segment relates to multiple phonemes. We have adopted a simple approach: for a given segment x with N frames, the corresponding segment-level phoneme label y^tIs denoted by y^t＝{y₁，y₂，…，y_C}，

Where C is the size of the selected phone set. Nc denotes the number of occurrences of the c-th phoneme in x.

Text adaptation framework

A typical deep speaker verification task involves three phases:

training: the speaker embedding extractor is trained using a large amount of pre-collected data.

Evaluation of

-registering: a new speaker is registered by extracting speaker insertions by a trained extractor.

-testing: each test utterance is evaluated for verification using a registration model of the claimed identity.

For text-independent tasks, we do not have any requirement for text matching between training and evaluation data or between enrollment and test data.

For traditional text-based tasks, systems that directly apply training to text-independent speaker verification tasks typically achieve very poor performance due to text mismatch between training and evaluation data. Current state-of-the-art text-dependent speaker verification systems share the same approach as text-independent speaker verification systems, while collecting training data for customized applications. For example, training speech segments may be collected from a large number of speakers sharing the same content "OK Google" or "Hey Cortana". Although good results have been achieved using this approach, it is expensive and inflexible to re-collect training data for each different phrase.

Furthermore, real world applications typically do not follow standard text correlation regimes. In some cases, there may also be a text mismatch between the enrollment and test utterances. For example, in some situations, it is common for a user to wish to wake up a smart device using multiple keywords. For example, Google devices support both "OK Google" and "Hey Google". Some applications may require more different keywords. Can the user be allowed to register only one of them and use a different keyword at the time of testing?

Speaker-embedded text adaptation

To summarize the above problem, we want to solve the following two text mismatch cases:

there is a text mismatch between the training and evaluation data, which is a traditional text-related task.

Text mismatches exist not only between training and assessment data, but also between enrollment and test data.

In the case where the pre-collected training data shares the same text as the assessment data, the text information is modeled implicitly in the speaker extractor. However, to address the two types of text mismatch described above without recollecting a large amount of training data, modeling text information should be explicitly considered. In this context, we propose a framework in which text independent speaker embedding can be adapted to text dependent speaker embedding, and the textual information can be customized according to the input.

Speaker text decomposition network

FIG. 3 illustrates a proposed speaker-text decomposition network.

As shown, the proposed model contains four parts: a general feature extractor Mf, two parallel subnets Ms and Mt for speaker recognition and phoneme distribution learning, respectively, and a "combined" subnet Mc information for synthesizing speaker and speech. The phoneme classifier Mt predicts the distribution of phonemes in an input segment and the speaker classifier Ms is a standard network of predicted speaker classes. The speaker-embedded ebds and the phoneme-based text-embedded ebdt extracted from Ms and Mt are then concatenated as input into the combinatorial network Mc to recover the speaker identity and phonetic information. The model is jointly trained. Given the characteristics of a training segment pair [ xs, xt ] and the corresponding speaker tag ys and phoneme tag yt, the penalty is defined as Ltotal ═ Ls1+ Lt1+ Ls2+ Lt2, where

Ls1＝CE(Ms(Mf(xs))，ys)

Lt1＝KLD(Mt(Mf(xt))，yt)

Ls2＝CE(Mc([ebds，ebdt])，ys)

Lt2＝KLD(Mc([ebds，ebdt])，yt)

Embedded ebds is calculated by using xs of the speaker subnet, and embedded ebdt is calculated from xt using the text subnet. To better separate the speaker and speech information, training pairs [ xs, xt ] are randomly sampled from the training data, and the training pairs may be from the same sentence of speech or two segments of speech from two different speakers.

The "speaker" embedding extracted from the speaker subnet can be used directly for text independent tasks because text adaptation is not required. For the traditional text-dependent speaker verification task, the "text" embedding and the "speaker" embedding come from the same piece of audio, since it can accurately compute the text embedding for enrollment and test data. For other cases where there is a text mismatch between the enrollment and test voices in the target test, we use a very small amount of pre-collected data and the target text (e.g., 10 words from other speakers) to calculate the "text" embedding, which is then used to accommodate the "speaker" embedding calculated from the enrollment voices, and the "text" embedding used by the test voices remains unchanged.

Experimental setup

Data of

The Voxceleb and RSR2015 data sets were used in our experiments. All phoneme labels are generated by a phoneme recognizer. For more details reference may be made to [2 ]. Detailed information on training and assessment data preparation for text fitness assessment and text-independent assessment will be provided below, and a test pair definition is provided online, reproducing our results on a customized set of RSR2015 assessments.

Training set

For our experiments, the Voxceleb2 development set was used to train neural networks and Probabilistic Linear Discriminant Analysis (PLDA) backend. The set of speakers contained 5994 speakers, with 1092009 voices. To train the neural network, we follow the data preparation process in the Kaldi Voxceleb script, which cuts the speech into segments, varying in length from 2s to 4 s. Note that unlike the official script, we do not use any data augmentation.

Text adaptive evaluation set

From the different text mismatches, two separate evaluation sets were created:

mismatch between training and evaluation data: when there is only a mismatch between the training and assessment data, it becomes a traditional text-related task. The evaluation set is derived from the evaluation portion of RSR2015 part I. It contains 30 fixed voices from 106 speakers (57 men, 49 women) for 3-4 s. Each speaker would speak 9 phrases, 3 of which were used for enrollment and the rest for testing. As shown in table 1, for the text-dependent standard task, there are four possible types of trials, where TC represents target-correct and TW, IC, IW represent three non-target-correct.

TABLE 1 test types for text-related tasks

	Correct content	Wrong content
			Target	TAR-correct(TC)	TAR-wrong(TW)
Impostor (Impostor)	IMP-correct(IC)	IMP-wrong(IW)

Since in the original experimental definition, about 90% of the original experiments belong to the very simple IW case, we generated their own experimental list at the following ratio: TC: TW: IC: IW 1: 3: 3: 3.

registration data does not match test data: as previously described, the RSR2015 part1 evaluation was concentrated with 30 different fixed phrases. We randomly selected ten of them and generated ten evaluation subsets. Two registration conditions are considered, text independent and text independent. For the former case, the text of three registration utterances will be randomly selected, while for the latter condition, the text will be shared between the registration utterances and there will be no overlap between the text of the test utterances. The text for adaptive embedding (speakers are other speakers in the evaluation set) is computed using 10 random utterances (same text as the target text) from the development set. To better demonstrate the text-perception capabilities of our speaker-text decomposition network, we used a standard x-vector system with five delay layers and two linear layers as our baseline system. Our proposed decomposed network system is modified from the baseline system, with three time-epitaxial layers extracting common features. Ms and Mt have the same structure, both with two time-lapse layers, one statistical pooling layer and two linear layers, except that one is used for speaker classification and the other is used for phoneme distribution prediction. Mc has two common linear layers and two output layers that accomplish two tasks. Notably, when only speaker embedding is extracted, the decomposition network has exactly the same structure as the baseline TDNN model.

The 40-dimensional Fbank function is used for model training. The neural network was trained on 4 GPUs, with a batch size of 256, and a learning rate of 0.01 per GPU. The momentum is 0.9, the weight decay is 1e-4, and the model is optimized using a random gradient descent method. Bulk normalization is applied after the ReLU activation function.

The PLDA is applied to the Voxceleb1 evaluation set to verify the correctness of our system, and for other evaluation sets, the PLDA compensation is not affected, the learning embedding attribute is focused, and a simple cosine score is used. All architectures are implemented in PyTorch. We report the performance of the model at Equal Error Rate (EER) and minimum detection cost (minDCF) (Ptar set to 0.01).

Verification results for text-independent tasks

The x-vector baseline was used for text-independent tasks to show the correctness of the baseline model and the proposed model, and we first reported the results in table 2 for the standard Voxceleb1 evaluation. Given that we only used clean training data, the baseline results were quite strong. As shown in table 2, the embeddings extracted from the speaker subnet of the proposed model reduced the EER of Voxceleb O, Voxceleb E and Voxceleb H from 2.888%, 3.055% and 5.026% to 2.595%, 2.784% and 4.703%, respectively. Similar improvements can be observed with minDCF. Similar improvements in the performance of embedding speakers from speaker subnets were observed for experiments on the RSR2015 text-independent evaluation set.

TABLE 2 verification experiment of Voxceleb1 evaluation set

Table 3. experiment of RSR2015 text-independent evaluation set

Text adjustment to solve text mismatch problem

Mismatch between training and evaluation data as shown in table 4, the synthesis of textual information into the embedding can significantly improve performance when a mismatch between training and evaluation data occurs. The EER of the system was reduced from 6.671% to 1.542% and minDCF was reduced from 0.5234 to 0.1246.

Table 4.RSR2015 text adaptive evaluation set experiments (training and evaluation data mismatch)

Table 5 shows the results when different error types were analyzed, respectively. Errors TW and IW caused by erroneous text can be greatly reduced. Speaker error (IC) also decreased from 1.919% to 1.101%.

TABLE 5 RSR2015 text adaptive evaluation set EER related to different error types (mismatch between training and evaluation data)

Registration and test data mismatch

As shown in table 6, most systems fail to complete the task when mismatches occur between training and test data and between enrollment and test data. However, by using target text embedding instead of real text embedding to accommodate registered "speaker" embedding, system performance is significantly improved, which demonstrates the effectiveness of our decomposed web-generated text-tailored speaker embedding.

Table 6 RSR2015 text fitness evaluation set text independent/text dependent registration (introduced in the previous embodiment) EER (%) (mismatch between registration and test data)

Conclusion and future work

Text mismatches between training data and assessment data can cause a significant performance degradation of text-dependent speaker verification systems. One common solution is to collect application-specific training data that shares the same textual information as the assessment data. To get rid of the expensive and inflexible data collection process and take advantage of the large amount of unconstrained speech data, we propose a "text adaptive speaker verification" framework in which text independent speaker embedding adapts to text tailored speaker embedding. A speaker-to-text decomposition network is presented that first decomposes speech segments into a text-independent speaker-embedding and a speaker-independent text-embedding, and then recombines them into a single embedding that contains both information. We first validated the proposed method without text adaptation on a standard text-independent Voxceleb evaluation set and observed consistent performance improvement on all three test lists. The results of the three customized evaluation sets obtained from the RSR2015 data set indicate that the proposed method using text adaptation can greatly reduce errors caused by text mismatch between training and evaluation data and between enrollment and test data.

In future work, we will make more effort to allow models to use simple plain text instead of text embedding calculated from specific audio to adapt the text to speaker embedding.

Referring to fig. 3, a block diagram of a speaker verification apparatus according to an embodiment of the invention is shown.

As shown in fig. 3, the speaker verification apparatus 300 includes an extraction module 310, a parallel sub-network module 320, a merging module 330, and a verification module 340.

Wherein, the extracting module 310 is configured to input the audio data to the general feature extractor to extract the preset feature; the parallel subnetwork module 320 is configured to input the extracted preset features into a speaker classification subnetwork and a phoneme distribution prediction subnetwork respectively, wherein speaker embedding is obtained through extraction of the speaker classification subnetwork, and text embedding is obtained through extraction of the phoneme distribution prediction subnetwork; a merging module 330 configured to merge the speaker embedding and the text embedding through a merging subnetwork to obtain a single embedding of the speaker embedding and the text embedding; and a confirmation module 340 configured to perform speaker confirmation based on the single embedding.

In some alternative embodiments, the generic feature model, the classification sub-network, the phoneme distribution prediction sub-network, and the merging sub-network comprise training using joint training.

It should be understood that the modules depicted in fig. 3 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 3, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not intended to limit the scope of the present application, for example, the filtering module may describe a module for filtering out a first article meeting a rule from the at least one article list in response to the rule configured by the user. In addition, the related functional modules may also be implemented by a hardware processor, for example, the screening module may also be implemented by a processor, which is not described herein again.

In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the speaker verification method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

inputting the audio data to a general feature extractor to extract a preset feature;

inputting the extracted preset features into a speaker classification sub-network and a phoneme distribution prediction sub-network respectively, wherein speaker embedding is obtained by extraction of the speaker classification sub-network, and text embedding is obtained by extraction of the phoneme distribution prediction sub-network;

combining the speaker embedding and the text embedding through a combining sub-network to obtain a single embedding of the speaker embedding and the text embedding;

speaker verification is performed based on the single embedding.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speaker verification apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and these remote memories may be connected to the speaker verification device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the speaker verification methods described above.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes: one or more processors 410 and a memory 420, with one processor 410 being an example in fig. 4. The apparatus of the speaker verification method may further include: an input device 430 and an output device 440. The processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or other means, such as the bus connection in fig. 4. The memory 420 is a non-volatile computer-readable storage medium as described above. The processor 410 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 420, so as to implement the speaker verification method of the above-mentioned method embodiment. The input device 430 can receive input numeric or character information and generate key signal inputs related to user settings and function controls of the speaker verification device. The output device 440 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a speaker verification apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

speaker verification is performed based on the single embedding.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speaker verification method, comprising:

speaker verification is performed based on the single embedding,

wherein the audio data comprises training data and assessment data, the assessment data comprising enrollment data and test data, the method further comprising:

on a text-dependent speaker confirmation, text embedding for coupling with speaker embedding results from the actual enrollment data and the test data if a text mismatch occurs between the training data and the assessment data;

if a text mismatch occurs between the enrollment data and the test data, the text embedding includes calculating from pre-collected data;

and performing text self-adaptation on the registered speaker embedding according to the text embedding.

2. The method of claim 1, wherein the generic feature extractor, the classification sub-network, the phoneme distribution prediction sub-network, and the merging sub-network are trained using joint training.

3. The method of claim 2, wherein the training pairs for training the merging subnetwork comprise training pairs from the same sentence and training pairs from different sentences.

4. The method of any of claims 1-3, wherein the speaker classification sub-network and the phoneme distribution prediction sub-network each include two time-lapse layers, one statistical pooling layer, and two linear layers, and the merging sub-network includes two linear layers and two output layers.

5. The method of claim 4, wherein the phoneme distribution prediction subnetwork predicts the number of occurrences of each phoneme class in the normalized sentence.

6. A speaker verification apparatus, comprising:

an extraction module configured to input the audio data to the general feature extractor to extract a preset feature;

the parallel sub-network module is configured to input the extracted preset features into a speaker classification sub-network and a phoneme distribution prediction sub-network respectively, wherein speaker embedding is obtained through extraction of the speaker classification sub-network, and text embedding is obtained through extraction of the phoneme distribution prediction sub-network;

the merging module is configured to merge the speaker embedding and the text embedding through a merging sub-network to obtain a single embedding of the speaker embedding and the text embedding;

a confirmation module configured to perform speaker confirmation based on the single embedding,

wherein the audio data comprises training data and assessment data, the assessment data comprising enrollment data and test data, the validation module further configured to:

if a text mismatch occurs between the registration data and the test data, the text embedding includes calculating from pre-collected data;

7. The apparatus of claim 6, wherein the generic feature extractor, the classification sub-network, the phoneme distribution prediction sub-network, and the merging sub-network are trained using joint training.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.

9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the steps of the method of any one of claims 1 to 5.