CN113362829B - Speaker verification method, electronic device and storage medium - Google Patents

Speaker verification method, electronic device and storage medium Download PDF

Info

Publication number
CN113362829B
CN113362829B CN202110623701.0A CN202110623701A CN113362829B CN 113362829 B CN113362829 B CN 113362829B CN 202110623701 A CN202110623701 A CN 202110623701A CN 113362829 B CN113362829 B CN 113362829B
Authority
CN
China
Prior art keywords
training
speaker
data
extractor
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110623701.0A
Other languages
Chinese (zh)
Other versions
CN113362829A (en
Inventor
钱彦旻
韩冰
陈正阳
周之恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202110623701.0A priority Critical patent/CN113362829B/en
Publication of CN113362829A publication Critical patent/CN113362829A/en
Application granted granted Critical
Publication of CN113362829B publication Critical patent/CN113362829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a speaker verification method, which comprises the following steps: preprocessing training data in a training sample set; performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is accomplished based on the speaker-embedded feature. The invention carries out fine tuning training on the speaker embedded extractor obtained by pre-training by adopting the training data after pre-processing, thereby improving the performance of verifying the speaker related to the text.

Description

Speaker verification method, electronic device and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a speaker verification method, electronic equipment and a storage medium.
Background
With the development of deep learning, speaker verification systems have improved greatly. Different architectures, different losses and different training strategies have been proposed in the prior art to improve system performance under different conditions. However, there are still some challenges that have not been solved when applying speaker verification systems, such as short audio times and cross-language issues.
Disclosure of Invention
An embodiment of the present invention provides a speaker verification method, an electronic device and a storage medium, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a speaker verification method, including:
preprocessing training data in a training sample set;
performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing;
processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics;
speaker verification is accomplished based on the speaker-embedded feature.
In some embodiments, the training sample set includes training data and test data, and the preprocessing the training data in the training sample set includes: and segmenting the training data and the test data into audio segments with preset lengths.
In some embodiments, the preprocessing training data in the set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.
In some embodiments, the method further comprises: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes:
and carrying out speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode so as to finish fine adjustment processing.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes:
the speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases.
In a second aspect, the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any speaker verification method of the present invention.
In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speaker verification methods of the present invention as described above.
In a fourth aspect, an embodiment of the present invention further provides a computer program product, which includes a computer program stored on a storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform any one of the speaker verification methods described above.
The embodiment of the invention preprocesses the training data in the training sample set; carrying out fine tuning training on the speaker embedded extractor obtained by pre-training based on the training sample after the pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of one embodiment of a speaker verification method of the present invention;
FIG. 2 is a flow chart of another embodiment of a speaker verification method of the present invention;
FIG. 3a is a diagram illustrating an embodiment of fine tuning of a speaker embedding extractor according to the present invention;
FIG. 3b is a diagram illustrating another embodiment of fine-tuning a speaker embedding extractor according to the present invention;
FIG. 4a is a schematic diagram of another embodiment of fine tuning of a speaker embedding extractor in accordance with the present invention;
FIG. 4b is a schematic diagram of another embodiment of fine tuning of a speaker embedding extractor in accordance with the present invention;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1 (the picture needs to be changed), an embodiment of the present invention provides a speaker verification method, including:
s10, preprocessing training data in the training sample set;
s20, carrying out fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; illustratively, the speaker embedding extractor is pre-trained based on large-scale text-independent data.
And S30, processing the audio of the speaker to be verified by adopting the speaker embedding extractor obtained by fine tuning training to obtain speaker embedding characteristics.
And S40, finishing the speaker verification based on the speaker embedding characteristics.
The embodiment of the invention preprocesses the training data in the training sample set; performing fine tuning training based on text information on a speaker embedding extractor obtained by pre-training large-scale text-independent data based on the preprocessed training sample; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.
In some embodiments, the training sample set includes training data and test data, and the preprocessing the training data in the training sample set includes: and segmenting the training data and the test data into audio segments with preset lengths.
In some embodiments, the preprocessing training data in the set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.
In some embodiments, the method further comprises: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode so as to finish fine adjustment processing. Illustratively, 2 classifiers are utilized, 1 for speaker classification and one for text classification.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification. Illustratively, 1 classifier is used to classify the "speaker x text" joint label.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: the speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases. Illustratively, 10 classifiers may be utilized to classify the speakers of different textual sentences, respectively. Or a strategy of contrast learning is added, the similarity of the same speaker in the same text is improved, and the similarity of different speakers in the same text is reduced.
As shown in fig. 2, an embodiment of the present invention provides a speaker verification system, including:
a data preprocessing program module 10, configured to preprocess training data in a training sample set;
an extractor pre-training program module 20, configured to perform fine-tuning training based on text information on the speaker embedded extractor obtained through pre-training based on the training sample after the pre-processing;
the feature extraction program module 30 is configured to process the speaker audio to be verified by using the speaker embedding extractor obtained through the fine tuning training to obtain speaker embedding features;
a verification program module 40 for completing speaker verification based on the speaker-embedded feature.
The embodiment of the invention preprocesses the training data in the training sample set; performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing; processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics; speaker verification is completed based on the speaker embedding features, and performance of speaker verification is improved.
In some embodiments, the training sample set includes training data and test data, and the preprocessing the training data in the training sample set includes: and segmenting the training data and the test data into audio segments with preset lengths.
In some embodiments, the preprocessing training data in the set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.
In some embodiments, the system is further configured to: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode so as to finish fine adjustment processing. Illustratively, 2 classifiers are utilized, 1 for speaker classification and one for text classification.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: and carrying out fine adjustment processing on the pre-trained speaker embedding extractor by adopting the joint labels of speaker classification and phrase classification. Illustratively, 1 classifier is used to classify the speaker x text joint label.
In some embodiments, the performing text information-based fine-tuning training on the pre-trained speaker embedding extractor based on the training samples after the preprocessing includes: the speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases. Illustratively, 10 classifiers may be utilized to classify the speakers of different textual sentences, respectively. Or a strategy of contrast learning is added, the similarity of the same speaker in the same text is improved, and the similarity of different speakers in the same text is reduced.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In order to more clearly describe the technical solutions of the present invention and to more directly prove the real-time performance and the benefits of the present invention over the prior art, the technical background, the technical solutions, the experiments performed, etc. of the present invention will be described in more detail below.
And (3) abstract: this document describes a system for text-dependent and text-independent speaker verification that we submit to a short-duration speaker verification (SdsV) challenge 2021. In this challenge, we explore different embedding extractors to extract the effect of robust speaker embedding. For tasks unrelated to texts, adaptive fractional regularization is adopted, so that the system performance under the cross-language verification condition is improved. For text-related tasks, we focus mainly on the strategy of fine-tuning on in-domain data for large-scale out-of-domain data pre-trained models. In order to improve the distinction between different speakers who speak the same phrase, we propose several novel fine-tuning strategies using text information and neural network probabilistic linear discriminant analysis. Through the training strategies, the system performance is further improved. Finally, scores based on different training strategy systems are fused to obtain a fusion system, and the evaluation index reaches 0.0473 in task1 and reaches 0.0581 in task 2.
1. Introduction to
In the present invention, the SdSV challenge 2021 includes two tasks. Task1 is a text-related task and the speaker verification system should verify both the identity and the spoken phrase of the test speaker. Task2 is a text independent task and the system should only consider the speaker identity. In particular, the SdSV challenge introduces a new challenging verification condition, namely cross-language verification of Task2, where one speaker can speak different languages during the registration and testing phases.
SdSV 2021 is the second challenge of the SdSV series, and many competing systems have been proposed in the last challenge. Jenthe et al for text-independent tasks in the last challenge. A new data mining strategy HPM is proposed and adaptive breathing is introduced to improve the robustness of cross-language verification of the system. Peng, and the like. A greedy fusion algorithm is introduced to further improve the performance of the fusion system. In addition, teams are primarily focused on back-end optimization of text-related tasks.
In this challenge, we first explore different network structures that perform well and train all available data. Then, we focus on the intra-domain data fine-tuning strategy to further improve the system performance. To solve the cross-language verification problem in text-independent tasks, we train another language identification network, introducing language information into an adaptive fractional regularization process. For text-dependent tasks, we implement different approaches to increase the distinction between target trials and different non-target validation pairs. We use the ASR system to classify speaker phrases during the testing phase and directly filter out phrase mismatch (the speaker uttered the wrong verification phrase) verification pairs. In order to better distinguish different speakers who speak the same phrase, we propose several novel phase-aware fine-tuning strategies and phrase-aware neural network probabilistic linear discriminant analysis. Based on such a training strategy, the performance of our system will be further improved.
The rest of the text is arranged as follows: section 2 introduces the data set used in this challenge. Section 3 introduces our embedded extractor network structure and proposed fine-tuning strategy. The experimental results and corresponding analyses are given in section 4. Finally, a conclusion is reached in section 5.
2. Data set
The SdsV challenge sets constraints that the system can only be trained using a specified data set. The primary training and assessment data for the SdSV challenge is the "DeepMine" dataset recorded in the real environment of iran. The purpose of the acquisition protocol of the data set is to add various real noises during the recording process. The primary language of the data set is in Persian, and most participants have also participated in the English partition.
Data within task1 domain: the data set can be used to construct a text-dependent speaker verification system. It contains 101k voices from 963 different speakers. The content of all utterances is fixed, including five gaussian phrases and five english phrases.
Data within task2 domain: there is no limited data set for the content of the utterance. It contains 125k voices collected from 588 speakers, some of which have only the phrase of the gaussian.
In addition to intra-domain training data, other open datasets that are allowed to be used in the training process are as follows:
voxceleb: voxceleb 1&2 contains over one million audios from 7245 speakers, which were collected from videos uploaded to YouTube.
Librispeech: a data set containing 281k voices from 2338 speakers. It comes from audio books, most of the language is american english.
Common Voice Farsi: this is a set of transcribed speech that is composed of multiple languages. Only a portion of the gaussian is used in this challenge.
2.2 evaluation
The evaluation data for tasks 1&2 are all part of the data within the deepMind domain.
In task1, each verification pair includes a test segment and a model identifier that indicates the three enrollment audios and the phrase ID spoken in the utterances. These tests can be divided into the four basic types TC, TW, IC and IW. The text-dependent speaker verification system should accept the TC verification pair and reject the other three types of masquerading verification pairs.
In task2, the enrollment data consists of one to several Persian variable length utterances, while the test utterances may come from other languages (English). For this task, if both the registered and tested utterances are from the same speaker, and the system does not consider language mismatch, the system should accept the trial.
The main indicator adopted by SdSV is normalized minimum detection cost (minDCF), which is defined as a weighted sum of false alarm and miss probabilities.
3. Method of producing a composite material
In this section we will introduce the embedded extractor, the fine-tuning strategy and several post-processing methods used in the system. In our experiment, the embedding extractor was first trained in a text-independent manner on all available data for task1 and task 2. We then fine-tune the pre-trained model using the intra-domain data. Finally, system performance can be further improved using post-processing methods.
3.1 speaker embedding extractor
To construct a robust speaker verification system for SdSV challenges, all datasets (including data in the Voxceleb, Common Voice Farsi, Librisipeech, and DeepMind domains) were combined to train the speaker embedding extractor. To reduce the duration mismatch between training data and test data, all utterances were randomly divided into 2-second segments during the training phase. We use an acoustic signature of 40-dimensional Fbank with a frame length of 25ms and a shift of 10 ms. To increase the amount and diversity of training data, we apply online data enhancement during the training process. The impulse response of additive noise and RIR from the MUSAN corpus is used for enhancement.
In our system, we mainly use three different speaker verification architectures, including ResNet34, ECAPATDNN and DPN 68.
ResNet 34: ResNet achieves excellent performance in speaker verification by virtue of its efficient modeling of complex data structures. We use the ResNet34 introduced in the prior art as a ResNet based network structure. In this network structure, the input features are processed by the initial convolutional layer and 4 residual blocks, and then the next statistical pool layer aggregates the frame-level features into a segment-level representation. Finally, a 256-dimensional fully connected layer converts it into a fixed vector to represent the speaker.
ECAPA-TDNN: ECAPA-TDNN has achieved very good results in the speaker verification system and has been used in the VoxSRC2020 winning system. In the experiment, we set the number of channels of ECAPA-TDNN to 1024. Both channel attention with and without global context was tried, and we represented the corresponding architectures as Ecapa and Ecapa-Glob, respectively.
DPN 68: DPN (dual path network) was first applied in the speaker verification task of the prior art, which takes advantage of both ResNet and DenseNet. Here we use the DPN68 architecture as one of the embedded extractors in our system.
Additive angle margin softmax loss (AAM) is used to optimize all embedded extractors. The scaling parameter and AAM loss interval are set to 32 and 0.2, respectively. For each model we trained 165 epochs and the learning rate dropped exponentially from 0.1 to 1e-5 during the training.
3.2, in-Domain Fine tuning
In this section, we will introduce our fine-tuning strategy based on the pre-trained model introduced in the previous section to further improve system performance on the intra-domain evaluation set.
3.2.1 text-related Pattern trimming for task1
To encode the phrase information as speaker insertions and further extend the distance of different speakers who speak the same phrase, we fine tune the insertion extractor in a text-dependent manner for task 1. The strategy is introduced as follows:
speaker + phrase: as shown in the figure. As shown in FIG. 3a, in the fine-tuning stage, where there are two separate heads for speaker and phrase classification, we fine-tune the training of the embedding extractor in a multitasking manner.
Speaker x phrase: speech in different phrases spoken by the same speaker are considered to be in different categories. As shown in fig. 3b, there is only one classification header, but both speaker and phrase information are considered.
Since the classification of phrases requires all the information of a sentence, the input is not blocked during the training process and the variable length inputs in the same batch are zero-padded to the same length.
3.2.2 text-independent mode Fine tuning of Task1 and Task2
In our experiments, phrase mismatch validation pairs (IW and TW) in task1 can be filtered out by the ASR system described in section 3.3.3. Thus, the model only needs to verify the speaker identity of the two pieces of audio, so task1 can also be viewed as a text-independent task. For the conventional fine tuning method of task1 and task2, we optimized the pre-trained model of the data in the domain using AAM softmax.
In particular for task1, to enhance the ability of the model to distinguish speakers with the same phrase, we propose two text-independent fine-tuning strategies for perceptible phrases, including multi-head training (PMT) of perceptible Phrases and Comparative Training (PCT) of perceptible phrases.
PMT: for task1, all phrases were extracted from a fixed set of 10 phrases consisting of 5 Persian and 5 English phrases. As shown in fig. 4a, different speaker headers are used for utterances of different phrases, and the distance between different speakers in the same phrase can be extended by this training strategy.
PCT: as shown in fig. 4b, in this fine tuning strategy we introduce a comparative learning penalty that can be co-optimized with the AAM softmax penalty. In our experiment, we used generalized end-to-end loss to calculate the contrast loss, and we sampled two utterances for each speaker in the training batch. To improve the distinction of different speakers speaking the same phrase, we restrict all utterances in the same batch to be from the same phrase.
3.3, post-processing
3.3.1 adaptive fractional regularization of languages
Task2 is most difficult to verify across languages. To minimize the degree of language mismatch between different utterances, we introduce the language information into adaptive score regularization (AS-Norm), which is defined by equation 1. Each registration model
Figure BDA0003101191020000111
Is comprised of a registration model and a language of the test utterance, wherein the language of the homogeneous group is the same as the language of the test utterance. To is coming toThe language of the test speech is detected and a TDNN-based language identification is trained on the task2 intra-domain data.
Figure BDA0003101191020000112
3.3.2 phrase-aware neural PLDA
Neural plda (nplda) has been successfully applied to the NICT system of SdSV challenge in 2020. To enhance the effect of NPLDA on text-related tasks, we constrain pairs of NPLDA inputs within the same phrase to improve the discriminative ability phrases of different speakers in the same phrase. Our NPLDA is initialized by a phrase-dependent PLDA model trained on data in the Task2 domain, with the same settings for the parameters as the default configuration, except that the learning rate is 5e-5 and the number of training rounds is 5.
3.3.3 ASR System
For task1, which is text related, an ASR system was trained to filter trials of enrollment and test utterances from different phrases. We used the ESPnet Librisipech Conformar-based federated CTC attention-directed automatic Speech recognition model (ASR). The ASR system was first trained on the Librispeech dataset and then fine-tuned on the data within the task1 domain. During the evaluation, we use the ASR system to recognize these phrases and classify the textual content of each speech segment according to its Levenshtein edit distance from the reference. From the phrase labels generated by ASR, we filter out the IW and TW validation pairs directly by setting the score to a very low value. Note that all results of task1 provided in the experiment were modified based on this ASR system.
4. Experiment of
4.1, task 1: text dependent speaker verification
4.1.1 Pre-training model
All of the embedded extractors introduced in section 3.1 were first pre-trained on all available data sets and the corresponding results are listed in table 1. It can be seen from the results that the best performance is obtained using the cosine similarity metric of ResNet 34. DPN68 also had better performance than the Ecapa-TDNN-based model, but was inferior to ResNet 34. Furthermore, in most cases, the performance of the cosine scoring method is superior to the PLDA, and we only provide cosine results in the following section for analysis.
Table 1: comparison of results for the Pre-trained model of task1
Figure BDA0003101191020000121
4.1.2, Intra-Domain Fine tuning
Text-related mode: table 2 illustrates a comparison of different text dependent pattern fine tuning strategies introduced in 3.2.1. As can be seen from the experimental results based on ResNet34, fine tuning will improve text-related tasks. In particular, the "speaker + phrase" performed best in these results.
Table 2: text-dependent mode hinting for task1
Figure BDA0003101191020000131
Table 3: the main result of task 2. The data in the domain of task2 is fine-tuned using AAM softmax.
Figure BDA0003101191020000132
Text independent mode: this section studied the phrase-aware hinting strategy for our proposed text-independent model, with the results shown in table 4. It is clear that the performance of all the fine tuning systems is superior to the pre-trained model. In this table we also list the fine tuning results for PCT (phrase-aware contrast training) and PMT (phrase-aware multi-head training) for all models. Both PCT and PMT achieved excellent performance improvements in both EER and minDCF compared to conventional intra-domain data trimming methods. Furthermore, ResNet34 using the PCT strategy performed best in all models.
Table 4: text-independent phrase-aware hinting of task1 conventional hinting without PCT and PMT was also performed on task1 for comparison, where data within the domain was trained using AAM softmax in a text-independent mode.
Figure BDA0003101191020000133
4.1.3 phrase-aware neural PLDA
As shown in table 1, the PLDA fails to provide satisfactory performance compared to the cosine similarity. To further improve the performance of the fusion system, we also trained NPLDA as described in section 3.3.2 to improve the results of the PLDA that can be used in the final fusion system. We investigated a model based on the PCT strategy fine-tuned to obtain the best results in section 4.1.2, and the corresponding NPLDA results are shown in table 5 on minDCF, comparable to the cosine back-end results.
Table 5: results of NPLDA task1
Figure BDA0003101191020000141
4.2, task 2: text independent speaker verification
The main results of task2 are listed in table 2. In task2, all embedded extractors are also pre-trained first on all available datasets and then fine-tuned on the in-domain data. From the results we can see that DPN68 achieved the best results in all models. Furthermore, fine-tuning using in-domain data can significantly improve performance compared to pre-trained models. In addition, we have also performed experiments on the asnorm. The best minDCF, 0.0752, was obtained using ASnorm's DPN 68.
4.3 fusion results
Table 6: fused results on Dev and Eval sets
Figure BDA0003101191020000142
Finally, the scores from all systems (including different models, back-end and fine-tuning strategies) are weighted and summed to obtain a fused system, and then we use the development set for fusion weight adjustment. The results of the fusion system on the development set and the evaluation set are shown in table 6. As can be seen from the table, the fusion system can further improve performance. Our main paper was generated by the fusion system, and on minDCF Task1 (rank 3) reached 0.0473 and Task2 (rank 8) reached 0.0581.
5. Conclusion
Herein, we present the system of tasks 1 and 2 we submit to SdSV Challenge 2021 in detail. We explored several powerful embedded extractors in the experiment. For text-independent tasks, language identifiers are used for introducing language information into the asnorm. For text-related tasks, we filter the IW and TW validation pairs using an ASR system. We propose several text-information-based trimming strategies and post-processing methods to enhance the model's ability to verify different speakers of the same phrase. Based on these powerful systems, our final fusion system obtained the third and eighth names on task1 and task2, respectively.
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the speaker verification methods of the present invention.
In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speaker verification methods described above.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speaker verification method.
In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the speaker verification method.
Fig. 5 is a schematic diagram of a hardware structure of an electronic device for performing a speaker verification method according to another embodiment of the present application, and as shown in fig. 5, the electronic device includes:
one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5.
The apparatus for performing the speaker verification method may further include: an input device 530 and an output device 540.
The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.
The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speaker verification method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, so as to implement the speaker verification method of the above-described method embodiment.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speaker verification apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 520 may optionally include memory located remotely from the processor 510, and these remote memories may be connected to the speaker verification device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may receive input numeric or character information and generate signals related to user settings and functional control of the speaker verification device. The output device 540 may include a display device such as a display screen.
The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform the speaker verification method in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (6)

1. A speaker verification method, comprising:
preprocessing training data in a training sample set;
performing fine tuning training based on text information on the speaker embedding extractor obtained by pre-training based on the training sample after pre-processing;
processing the audio of the speaker to be verified by adopting a speaker embedded extractor obtained by fine tuning training to obtain speaker embedded characteristics;
speaker verification is accomplished based on the speaker embedded feature,
the fine tuning training based on the text information is carried out on the speaker embedding extractor obtained by the pre-training based on the training sample after the pre-processing, and the fine tuning training comprises the following steps:
performing speaker classification and phrase classification on the speaker embedding extractor by adopting a multi-task mode to finish fine adjustment processing; or
Adopting a combined label of speaker classification and phrase classification to carry out fine adjustment processing on the pre-trained speaker embedded extractor; or
The speaker embedding extractor after the pre-training is subjected to multi-head training of the perceptible phrases and comparative training of the perceptible phrases.
2. The method of claim 1, wherein the set of training samples includes training data and test data, and wherein preprocessing the training data in the set of training samples includes: and segmenting the training data and the test data into audio segments with preset lengths.
3. The method of claim 1, wherein preprocessing training data in a set of training samples further comprises: and performing online data enhancement processing on the training data in the training sample set.
4. A method as claimed in any one of claims 1 to 3, characterized in that the method further comprises: and pre-training by adopting text-independent data to obtain the speaker embedded extractor.
5. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.
6. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
CN202110623701.0A 2021-06-04 2021-06-04 Speaker verification method, electronic device and storage medium Active CN113362829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110623701.0A CN113362829B (en) 2021-06-04 2021-06-04 Speaker verification method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110623701.0A CN113362829B (en) 2021-06-04 2021-06-04 Speaker verification method, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN113362829A CN113362829A (en) 2021-09-07
CN113362829B true CN113362829B (en) 2022-05-24

Family

ID=77532167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110623701.0A Active CN113362829B (en) 2021-06-04 2021-06-04 Speaker verification method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113362829B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782034A (en) * 2021-09-27 2021-12-10 镁佳(北京)科技有限公司 Audio identification method and device and electronic equipment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9443508B2 (en) * 2013-09-11 2016-09-13 Texas Instruments Incorporated User programmable voice command recognition based on sparse features
US10706856B1 (en) * 2016-09-12 2020-07-07 Oben, Inc. Speaker recognition using deep learning neural network
CN107610709B (en) * 2017-08-01 2021-03-19 百度在线网络技术(北京)有限公司 Method and system for training voiceprint recognition model
CN107680600B (en) * 2017-09-11 2019-03-19 平安科技(深圳)有限公司 Sound-groove model training method, audio recognition method, device, equipment and medium
CN110310647B (en) * 2017-09-29 2022-02-25 腾讯科技(深圳)有限公司 Voice identity feature extractor, classifier training method and related equipment
CN108281146B (en) * 2017-12-29 2020-11-13 歌尔科技有限公司 Short voice speaker identification method and device
CN108417217B (en) * 2018-01-11 2021-07-13 思必驰科技股份有限公司 Speaker recognition network model training method, speaker recognition method and system
CN111243602B (en) * 2020-01-06 2023-06-06 天津大学 Voiceprint recognition method based on gender, nationality and emotion information
CN111429923B (en) * 2020-06-15 2020-09-29 深圳市友杰智新科技有限公司 Training method and device of speaker information extraction model and computer equipment

Also Published As

Publication number Publication date
CN113362829A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN110136749B (en) Method and device for detecting end-to-end voice endpoint related to speaker
CN108109613B (en) Audio training and recognition method for intelligent dialogue voice platform and electronic equipment
CN108417217B (en) Speaker recognition network model training method, speaker recognition method and system
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
EP2410514B1 (en) Speaker authentication
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
CN108766445A (en) Method for recognizing sound-groove and system
Zhang et al. Seq2seq attentional siamese neural networks for text-dependent speaker verification
WO2017162053A1 (en) Identity authentication method and device
CN110706692A (en) Training method and system of child voice recognition model
CN111081255B (en) Speaker confirmation method and device
CN110299142A (en) A kind of method for recognizing sound-groove and device based on the network integration
CN110223678A (en) Audio recognition method and system
CN110379433A (en) Method, apparatus, computer equipment and the storage medium of authentication
CN111191787B (en) Training method and device of neural network for extracting speaker embedded features
CN113362829B (en) Speaker verification method, electronic device and storage medium
Lee et al. Imaginary voice: Face-styled diffusion model for text-to-speech
CN110232928B (en) Text-independent speaker verification method and device
CN116564330A (en) Weak supervision voice pre-training method, electronic equipment and storage medium
CN116705008A (en) Training method and device for intention understanding model for protecting privacy
Han et al. The sjtu system for short-duration speaker verification challenge 2021
Chakroun et al. An improved approach for text-independent speaker recognition
CN114267334A (en) Speech recognition model training method and speech recognition method
CN111081256A (en) Digital string voiceprint password verification method and system
Kalantari et al. Cross database training of audio-visual hidden Markov models for phone recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant