CN113362829B

CN113362829B - Speaker verification method, electronic device and storage medium

Info

Publication number: CN113362829B
Application number: CN202110623701.0A
Authority: CN
Inventors: 钱彦旻; 韩冰; 陈正阳; 周之恺
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-05-24
Anticipated expiration: 2041-06-04
Also published as: CN113362829A

Abstract

The invention discloses a speaker verification method, which comprises the following steps: preprocessing training data in a training sample set; performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is accomplished based on the speaker-embedded feature. The invention carries out fine tuning training on the speaker embedded extractor obtained by pre-training by adopting the training data after pre-processing, thereby improving the performance of verifying the speaker related to the text.

Description

Speaker verification method, electronic device and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a speaker verification method, electronic equipment and a storage medium.

Background

With the development of deep learning, speaker verification systems have improved greatly. Different architectures, different losses and different training strategies have been proposed in the prior art to improve system performance under different conditions. However, there are still some challenges that have not been solved when applying speaker verification systems, such as short audio times and cross-language issues.

Disclosure of Invention

An embodiment of the present invention provides a speaker verification method, an electronic device and a storage medium, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a speaker verification method, including:

preprocessing training data in a training sample set;

performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing;

processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics;

speaker verification is accomplished based on the speaker-embedded feature.

In some embodiments, the training sample set includes training data and test data, and the preprocessing the training data in the training sample set includes: and segmenting the training data and the test data into audio segments with preset lengths.

In some embodiments, the preprocessing training data in the set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.

In some embodiments, the method further comprises: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes:

and carrying out speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode so as to finish fine adjustment processing.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification.

the speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases.

In a second aspect, the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any speaker verification method of the present invention.

In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speaker verification methods of the present invention as described above.

In a fourth aspect, an embodiment of the present invention further provides a computer program product, which includes a computer program stored on a storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform any one of the speaker verification methods described above.

The embodiment of the invention preprocesses the training data in the training sample set; carrying out fine tuning training on the speaker embedded extractor obtained by pre-training based on the training sample after the pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of one embodiment of a speaker verification method of the present invention;

FIG. 2 is a flow chart of another embodiment of a speaker verification method of the present invention;

FIG. 3a is a diagram illustrating an embodiment of fine tuning of a speaker embedding extractor according to the present invention;

FIG. 3b is a diagram illustrating another embodiment of fine-tuning a speaker embedding extractor according to the present invention;

FIG. 4a is a schematic diagram of another embodiment of fine tuning of a speaker embedding extractor in accordance with the present invention;

FIG. 4b is a schematic diagram of another embodiment of fine tuning of a speaker embedding extractor in accordance with the present invention;

fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1 (the picture needs to be changed), an embodiment of the present invention provides a speaker verification method, including:

s10, preprocessing training data in the training sample set;

s20, carrying out fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; illustratively, the speaker embedding extractor is pre-trained based on large-scale text-independent data.

And S30, processing the audio of the speaker to be verified by adopting the speaker embedding extractor obtained by fine tuning training to obtain speaker embedding characteristics.

And S40, finishing the speaker verification based on the speaker embedding characteristics.

The embodiment of the invention preprocesses the training data in the training sample set; performing fine tuning training based on text information on a speaker embedding extractor obtained by pre-training large-scale text-independent data based on the preprocessed training sample; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode so as to finish fine adjustment processing. Illustratively, 2 classifiers are utilized, 1 for speaker classification and one for text classification.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification. Illustratively, 1 classifier is used to classify the "speaker x text" joint label.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: the speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases. Illustratively, 10 classifiers may be utilized to classify the speakers of different textual sentences, respectively. Or a strategy of contrast learning is added, the similarity of the same speaker in the same text is improved, and the similarity of different speakers in the same text is reduced.

As shown in fig. 2, an embodiment of the present invention provides a speaker verification system, including:

a data preprocessing program module 10, configured to preprocess training data in a training sample set;

an extractor pre-training program module 20, configured to perform fine-tuning training based on text information on the speaker embedded extractor obtained through pre-training based on the training sample after the pre-processing;

the feature extraction program module 30 is configured to process the speaker audio to be verified by using the speaker embedding extractor obtained through the fine tuning training to obtain speaker embedding features;

a verification program module 40 for completing speaker verification based on the speaker-embedded feature.

The embodiment of the invention preprocesses the training data in the training sample set; performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.

In some embodiments, the system is further configured to: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.

In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification. Illustratively, 1 classifier is used to classify the speaker x text joint label.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In order to more clearly describe the technical solutions of the present invention and to more directly prove the real-time performance and the benefits of the present invention over the prior art, the technical background, the technical solutions, the experiments performed, etc. of the present invention will be described in more detail below.

And (3) abstract: this document describes a system for text-dependent and text-independent speaker verification that we submit to a short-duration speaker verification (SdsV) challenge 2021. In this challenge, we explore different embedding extractors to extract the effect of robust speaker embedding. For tasks unrelated to texts, adaptive fractional regularization is adopted, so that the system performance under the cross-language verification condition is improved. For text-related tasks, we focus mainly on the strategy of fine-tuning on in-domain data for large-scale out-of-domain data pre-trained models. In order to improve the distinction between different speakers who speak the same phrase, we propose several novel fine-tuning strategies using text information and neural network probabilistic linear discriminant analysis. Through the training strategies, the system performance is further improved. Finally, scores based on different training strategy systems are fused to obtain a fusion system, and the evaluation index reaches 0.0473 in task1 and reaches 0.0581 in task 2.

1. Introduction to

In the present invention, the SdSV challenge 2021 includes two tasks. Task1 is a text-related task and the speaker verification system should verify both the identity and the spoken phrase of the test speaker. Task2 is a text independent task and the system should only consider the speaker identity. In particular, the SdSV challenge introduces a new challenging verification condition, namely cross-language verification of Task2, where one speaker can speak different languages during the registration and testing phases.

SdSV 2021 is the second challenge of the SdSV series, and many competing systems have been proposed in the last challenge. Jenthe et al for text-independent tasks in the last challenge. A new data mining strategy HPM is proposed and adaptive breathing is introduced to improve the robustness of cross-language verification of the system. Peng, and the like. A greedy fusion algorithm is introduced to further improve the performance of the fusion system. In addition, teams are primarily focused on back-end optimization of text-related tasks.

In this challenge, we first explore different network structures that perform well and train all available data. Then, we focus on the intra-domain data fine-tuning strategy to further improve the system performance. To solve the cross-language verification problem in text-independent tasks, we train another language identification network, introducing language information into an adaptive fractional regularization process. For text-dependent tasks, we implement different approaches to increase the distinction between target trials and different non-target validation pairs. We use the ASR system to classify speaker phrases during the testing phase and directly filter out phrase mismatch (the speaker uttered the wrong verification phrase) verification pairs. In order to better distinguish different speakers who speak the same phrase, we propose several novel phase-aware fine-tuning strategies and phrase-aware neural network probabilistic linear discriminant analysis. Based on such a training strategy, the performance of our system will be further improved.

The rest of the text is arranged as follows: section 2 introduces the data set used in this challenge. Section 3 introduces our embedded extractor network structure and proposed fine-tuning strategy. The experimental results and corresponding analyses are given in section 4. Finally, a conclusion is reached in section 5.

2. Data set

The SdsV challenge sets constraints that the system can only be trained using a specified data set. The primary training and assessment data for the SdSV challenge is the "DeepMine" dataset recorded in the real environment of iran. The purpose of the acquisition protocol of the data set is to add various real noises during the recording process. The primary language of the data set is in Persian, and most participants have also participated in the English partition.

Data within task1 domain: the data set can be used to construct a text-dependent speaker verification system. It contains 101k voices from 963 different speakers. The content of all utterances is fixed, including five gaussian phrases and five english phrases.

Data within task2 domain: there is no limited data set for the content of the utterance. It contains 125k voices collected from 588 speakers, some of which have only the phrase of the gaussian.

In addition to intra-domain training data, other open datasets that are allowed to be used in the training process are as follows:

voxceleb: voxceleb 1&2 contains over one million audios from 7245 speakers, which were collected from videos uploaded to YouTube.

Librispeech: a data set containing 281k voices from 2338 speakers. It comes from audio books, most of the language is american english.

Common Voice Farsi: this is a set of transcribed speech that is composed of multiple languages. Only a portion of the gaussian is used in this challenge.

2.2 evaluation

The evaluation data for tasks 1&2 are all part of the data within the deepMind domain.

In task1, each verification pair includes a test segment and a model identifier that indicates the three enrollment audios and the phrase ID spoken in the utterances. These tests can be divided into the four basic types TC, TW, IC and IW. The text-dependent speaker verification system should accept the TC verification pair and reject the other three types of masquerading verification pairs.

In task2, the enrollment data consists of one to several Persian variable length utterances, while the test utterances may come from other languages (English). For this task, if both the registered and tested utterances are from the same speaker, and the system does not consider language mismatch, the system should accept the trial.

The main indicator adopted by SdSV is normalized minimum detection cost (minDCF), which is defined as a weighted sum of false alarm and miss probabilities.

3. Method of producing a composite material

In this section we will introduce the embedded extractor, the fine-tuning strategy and several post-processing methods used in the system. In our experiment, the embedding extractor was first trained in a text-independent manner on all available data for task1 and task 2. We then fine-tune the pre-trained model using the intra-domain data. Finally, system performance can be further improved using post-processing methods.

3.1 speaker embedding extractor

To construct a robust speaker verification system for SdSV challenges, all datasets (including data in the Voxceleb, Common Voice Farsi, Librisipeech, and DeepMind domains) were combined to train the speaker embedding extractor. To reduce the duration mismatch between training data and test data, all utterances were randomly divided into 2-second segments during the training phase. We use an acoustic signature of 40-dimensional Fbank with a frame length of 25ms and a shift of 10 ms. To increase the amount and diversity of training data, we apply online data enhancement during the training process. The impulse response of additive noise and RIR from the MUSAN corpus is used for enhancement.

In our system, we mainly use three different speaker verification architectures, including ResNet34, ECAPATDNN and DPN 68.

ResNet 34: ResNet achieves excellent performance in speaker verification by virtue of its efficient modeling of complex data structures. We use the ResNet34 introduced in the prior art as a ResNet based network structure. In this network structure, the input features are processed by the initial convolutional layer and 4 residual blocks, and then the next statistical pool layer aggregates the frame-level features into a segment-level representation. Finally, a 256-dimensional fully connected layer converts it into a fixed vector to represent the speaker.

ECAPA-TDNN: ECAPA-TDNN has achieved very good results in the speaker verification system and has been used in the VoxSRC2020 winning system. In the experiment, we set the number of channels of ECAPA-TDNN to 1024. Both channel attention with and without global context was tried, and we represented the corresponding architectures as Ecapa and Ecapa-Glob, respectively.

DPN 68: DPN (dual path network) was first applied in the speaker verification task of the prior art, which takes advantage of both ResNet and DenseNet. Here we use the DPN68 architecture as one of the embedded extractors in our system.

Additive angle margin softmax loss (AAM) is used to optimize all embedded extractors. The scaling parameter and AAM loss interval are set to 32 and 0.2, respectively. For each model we trained 165 epochs and the learning rate dropped exponentially from 0.1 to 1e-5 during the training.

3.2, in-Domain Fine tuning

In this section, we will introduce our fine-tuning strategy based on the pre-trained model introduced in the previous section to further improve system performance on the intra-domain evaluation set.

3.2.1 text-related Pattern trimming for task1

To encode the phrase information as speaker insertions and further extend the distance of different speakers who speak the same phrase, we fine tune the insertion extractor in a text-dependent manner for task 1. The strategy is introduced as follows:

speaker + phrase: as shown in the figure. As shown in FIG. 3a, in the fine-tuning stage, where there are two separate heads for speaker and phrase classification, we fine-tune the training of the embedding extractor in a multitasking manner.

Speaker x phrase: speech in different phrases spoken by the same speaker are considered to be in different categories. As shown in fig. 3b, there is only one classification header, but both speaker and phrase information are considered.

Since the classification of phrases requires all the information of a sentence, the input is not blocked during the training process and the variable length inputs in the same batch are zero-padded to the same length.

3.2.2 text-independent mode Fine tuning of Task1 and Task2

In our experiments, phrase mismatch validation pairs (IW and TW) in task1 can be filtered out by the ASR system described in section 3.3.3. Thus, the model only needs to verify the speaker identity of the two pieces of audio, so task1 can also be viewed as a text-independent task. For the conventional fine tuning method of task1 and task2, we optimized the pre-trained model of the data in the domain using AAM softmax.

In particular for task1, to enhance the ability of the model to distinguish speakers with the same phrase, we propose two text-independent fine-tuning strategies for perceptible phrases, including multi-head training (PMT) of perceptible Phrases and Comparative Training (PCT) of perceptible phrases.

PMT: for task1, all phrases were extracted from a fixed set of 10 phrases consisting of 5 Persian and 5 English phrases. As shown in fig. 4a, different speaker headers are used for utterances of different phrases, and the distance between different speakers in the same phrase can be extended by this training strategy.

PCT: as shown in fig. 4b, in this fine tuning strategy we introduce a comparative learning penalty that can be co-optimized with the AAM softmax penalty. In our experiment, we used generalized end-to-end loss to calculate the contrast loss, and we sampled two utterances for each speaker in the training batch. To improve the distinction of different speakers speaking the same phrase, we restrict all utterances in the same batch to be from the same phrase.

3.3, post-processing

3.3.1 adaptive fractional regularization of languages

Task2 is most difficult to verify across languages. To minimize the degree of language mismatch between different utterances, we introduce the language information into adaptive score regularization (AS-Norm), which is defined by equation 1. Each registration model

Is comprised of a registration model and a language of the test utterance, wherein the language of the homogeneous group is the same as the language of the test utterance. To is coming toThe language of the test speech is detected and a TDNN-based language identification is trained on the task2 intra-domain data.

3.3.2 phrase-aware neural PLDA

Neural plda (nplda) has been successfully applied to the NICT system of SdSV challenge in 2020. To enhance the effect of NPLDA on text-related tasks, we constrain pairs of NPLDA inputs within the same phrase to improve the discriminative ability phrases of different speakers in the same phrase. Our NPLDA is initialized by a phrase-dependent PLDA model trained on data in the Task2 domain, with the same settings for the parameters as the default configuration, except that the learning rate is 5e-5 and the number of training rounds is 5.

3.3.3 ASR System

For task1, which is text related, an ASR system was trained to filter trials of enrollment and test utterances from different phrases. We used the ESPnet Librisipech Conformar-based federated CTC attention-directed automatic Speech recognition model (ASR). The ASR system was first trained on the Librispeech dataset and then fine-tuned on the data within the task1 domain. During the evaluation, we use the ASR system to recognize these phrases and classify the textual content of each speech segment according to its Levenshtein edit distance from the reference. From the phrase labels generated by ASR, we filter out the IW and TW validation pairs directly by setting the score to a very low value. Note that all results of task1 provided in the experiment were modified based on this ASR system.

4. Experiment of

4.1, task 1: text dependent speaker verification

4.1.1 Pre-training model

All of the embedded extractors introduced in section 3.1 were first pre-trained on all available data sets and the corresponding results are listed in table 1. It can be seen from the results that the best performance is obtained using the cosine similarity metric of ResNet 34. DPN68 also had better performance than the Ecapa-TDNN-based model, but was inferior to ResNet 34. Furthermore, in most cases, the performance of the cosine scoring method is superior to the PLDA, and we only provide cosine results in the following section for analysis.

Table 1: comparison of results for the Pre-trained model of task1

4.1.2, Intra-Domain Fine tuning

Text-related mode: table 2 illustrates a comparison of different text dependent pattern fine tuning strategies introduced in 3.2.1. As can be seen from the experimental results based on ResNet34, fine tuning will improve text-related tasks. In particular, the "speaker + phrase" performed best in these results.

Table 2: text-dependent mode hinting for task1

Table 3: the main result of task 2. The data in the domain of task2 is fine-tuned using AAM softmax.

Text independent mode: this section studied the phrase-aware hinting strategy for our proposed text-independent model, with the results shown in table 4. It is clear that the performance of all the fine tuning systems is superior to the pre-trained model. In this table we also list the fine tuning results for PCT (phrase-aware contrast training) and PMT (phrase-aware multi-head training) for all models. Both PCT and PMT achieved excellent performance improvements in both EER and minDCF compared to conventional intra-domain data trimming methods. Furthermore, ResNet34 using the PCT strategy performed best in all models.

Table 4: text-independent phrase-aware hinting of task1 conventional hinting without PCT and PMT was also performed on task1 for comparison, where data within the domain was trained using AAM softmax in a text-independent mode.

4.1.3 phrase-aware neural PLDA

As shown in table 1, the PLDA fails to provide satisfactory performance compared to the cosine similarity. To further improve the performance of the fusion system, we also trained NPLDA as described in section 3.3.2 to improve the results of the PLDA that can be used in the final fusion system. We investigated a model based on the PCT strategy fine-tuned to obtain the best results in section 4.1.2, and the corresponding NPLDA results are shown in table 5 on minDCF, comparable to the cosine back-end results.

Table 5: results of NPLDA task1

4.2, task 2: text independent speaker verification

The main results of task2 are listed in table 2. In task2, all embedded extractors are also pre-trained first on all available datasets and then fine-tuned on the in-domain data. From the results we can see that DPN68 achieved the best results in all models. Furthermore, fine-tuning using in-domain data can significantly improve performance compared to pre-trained models. In addition, we have also performed experiments on the asnorm. The best minDCF, 0.0752, was obtained using ASnorm's DPN 68.

4.3 fusion results

Table 6: fused results on Dev and Eval sets

Finally, the scores from all systems (including different models, back-end and fine-tuning strategies) are weighted and summed to obtain a fused system, and then we use the development set for fusion weight adjustment. The results of the fusion system on the development set and the evaluation set are shown in table 6. As can be seen from the table, the fusion system can further improve performance. Our main paper was generated by the fusion system, and on minDCF Task1 (rank 3) reached 0.0473 and Task2 (rank 8) reached 0.0581.

5. Conclusion

Herein, we present the system of tasks 1 and 2 we submit to SdSV Challenge 2021 in detail. We explored several powerful embedded extractors in the experiment. For text-independent tasks, language identifiers are used for introducing language information into the asnorm. For text-related tasks, we filter the IW and TW validation pairs using an ASR system. We propose several text-information-based trimming strategies and post-processing methods to enhance the model's ability to verify different speakers of the same phrase. Based on these powerful systems, our final fusion system obtained the third and eighth names on task1 and task2, respectively.

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the speaker verification methods of the present invention.

In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speaker verification methods described above.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speaker verification method.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the speaker verification method.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device for performing a speaker verification method according to another embodiment of the present application, and as shown in fig. 5, the electronic device includes:

one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5.

The apparatus for performing the speaker verification method may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.

The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speaker verification method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, so as to implement the speaker verification method of the above-described method embodiment.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speaker verification apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 520 may optionally include memory located remotely from the processor 510, and these remote memories may be connected to the speaker verification device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate signals related to user settings and functional control of the speaker verification device. The output device 540 may include a display device such as a display screen.

The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform the speaker verification method in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A speaker verification method, comprising:

preprocessing training data in a training sample set;

speaker verification is accomplished based on the speaker embedded feature,

the fine tuning training based on the text information is carried out on the speaker embedding extractor obtained by the pre-training based on the training sample after the pre-processing, and the fine tuning training comprises the following steps:

performing speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode to finish fine adjustment processing; or

Adopting a combined label of speaker classification and phrase classification to carry out fine adjustment processing on the pre-trained speaker embedded extractor; or

2. The method of claim 1, wherein the set of training samples includes training data and test data, and wherein preprocessing the training data in the set of training samples includes: and segmenting the training data and the test data into audio segments with preset lengths.

3. The method of claim 1, wherein preprocessing training data in a set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.

4. A method as claimed in any one of claims 1 to 3, characterized in that the method further comprises: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.

5. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

6. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.