CN110232928B

CN110232928B - Text-independent speaker verification method and device

Info

Publication number: CN110232928B
Application number: CN201910511775.8A
Authority: CN
Inventors: 俞凯; 钱彦旻; 杨叶新; 王帅; 黄厚军
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2021-05-25
Anticipated expiration: 2039-06-13
Also published as: CN110232928A

Abstract

The invention discloses a method and a device for verifying a text independent speaker, wherein the method for verifying the text independent speaker comprises the following steps: extracting amplitude characteristics of the voice to be verified and phase characteristics corresponding to the amplitude characteristics; processing the amplitude feature and the phase feature to obtain a phase perception feature; carrying out speaker classification on the phase perception characteristics to obtain speaker embedding; and carrying out probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the voice to be verified. According to the scheme provided by the method and the device, the amplitude characteristic and the phase characteristic are combined in deep speaker embedding learning, so that the noise robustness of the speaker verification system can be improved. Further, the scheme of the application not only provides a new scheme for a noise robust speaker verification system, but also shows various possibilities of using phase characteristics to improve performance.

Description

Text-independent speaker verification method and device

Technical Field

The invention belongs to the technical field of speaker verification, and particularly relates to a text-independent speaker verification method and device.

Background

In the related art, the existing speaker verification system is roughly divided into two groups: 1) based on a traditional i-vector model; 2) based on a deep learning framework. However, the speaker verification systems currently available in the market usually require the training and testing environments to be consistent, and if the testing environment is noisy, the performance of the speaker verification systems is greatly reduced. At present, speaker verification systems which are robust to noise exist in the market, and most of the speaker verification systems are trained by constructing noisy data sets. Existing speaker verification systems that incorporate phase information are also based on traditional speaker verification system frameworks (gaussian mixture models, etc.).

The conventional i-vector system models a speaker by GMM (gaussian mixture model) and obtains speaker embedding by factor analysis. And a deep learning framework based speaker verification system uses neural networks to model speaker embedding. The speaker verification system combined with the phase information combines the phase characteristics and the amplitude characteristics together and carries out modeling through a traditional speaker verification model.

The inventor finds that the existing scheme has at least the following defects in the process of implementing the application:

speaker verification systems that are not specifically optimized for noisy environments typically require a consistent training and testing environment, which can significantly degrade performance if the testing environment is noisy. And if the noisy training set is reconstructed, it takes much labor and time to record new audio. Systems that use a traditional speaker verification framework in conjunction with phase information are inferior in performance to deep learning based frameworks. These defects are mainly caused by the model performance, the data set, and the like.

Disclosure of Invention

The embodiment of the invention provides a method and a device for verifying a text-independent speaker, which are used for solving at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a method for verifying a text-independent speaker, including: extracting amplitude characteristics of the voice to be verified and phase characteristics corresponding to the amplitude characteristics; processing the amplitude feature and the phase feature to obtain a phase perception feature; carrying out speaker classification on the phase perception characteristics to obtain speaker embedding; and carrying out probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the voice to be verified.

In a second aspect, an embodiment of the present invention provides a device for verifying a text-independent speaker, including: the extraction module is configured to extract amplitude features of the voice to be verified and phase features corresponding to the amplitude features; a processing module configured to process the amplitude feature and the phase feature to obtain a phase perception feature; the classification module is configured to classify the speakers according to the phase perception characteristics so as to obtain speaker embedding; and the verification module is configured to perform probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the to-be-verified voice.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the text independent speaker verification method according to any of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the steps of the text irrelevant speaker verification method according to any embodiment of the present invention.

The method and the device provided by the application process the extracted amplitude characteristic and the corresponding phase characteristic, then acquire the speaker embedding of the processed phase perception characteristic, then verify the voice to be verified according to the speaker embedding, and improve the noise robustness of the speaker verification system by combining the amplitude characteristic and the phase characteristic in deep speaker embedding learning. Further, the scheme of the application not only provides a new scheme for a noise robust speaker verification system, but also shows various possibilities of using phase characteristics to improve performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a method for text independent speaker verification according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for text independent speaker verification according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for text independent speaker verification according to an embodiment of the present invention;

FIG. 4 is a flowchart of yet another method for text independent speaker verification according to an embodiment of the present invention;

FIG. 5 is a system architecture diagram illustrating an embodiment of a text independent speaker verification scheme in accordance with the present invention;

FIG. 6 is a flowchart illustrating the extraction of amplitude and phase features of a text independent speaker verification scheme according to an embodiment of the present invention;

FIG. 7 is a DET plot evaluated on a Voxceleb1 test device under "noisy" noise conditions, according to an embodiment of the present invention;

FIG. 8 is a block diagram of a text independent speaker verification apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a text-independent speaker verification method according to the present application is shown, and the text-independent speaker verification method according to the present embodiment can be applied to terminals with language models, such as smart voice tvs, smart speakers, smart dialogue toys, and other existing smart terminals with speaker verification functions.

As shown in fig. 1, in step 101, extracting an amplitude feature and a phase feature corresponding to the amplitude feature of a voice to be verified;

in step 102, processing the amplitude characteristic and the phase characteristic to obtain a phase perception characteristic;

in step 103, speaker classification is performed on the phase perception features to obtain speaker embedding;

in step 104, a probability linear judgment analysis is performed on the speaker embedding to obtain a speaker verification result of the voice to be verified.

In this embodiment, for step 101, after the text-independent speaker verification apparatus receives the to-be-verified speech submitted by the user for verifying the user identity, at least one amplitude feature of the to-be-verified speech and a phase feature corresponding to the at least one amplitude feature are extracted, for example, the amplitude feature of the Fbank (filter bank feature) is extracted, and then the phase feature thereof may be extracted again, where the specific extraction manner is disclosed in the prior art, for example, a fourier transform manner, and is not described herein again. Further, the phase characteristics may include a direct fourier transform derived phase characteristic and a sine-cosine processed phase (Sin) and phase (Cos), and the present application is not limited thereto.

For step 102, the text-independent speaker verification apparatus processes the amplitude features and the corresponding phase features, such as sine and cosine operations, inputting convolution layers and/or residual layers, adding, and the like, and then obtains the phase perception features. Then, in step 103, the text-independent speaker verification apparatus performs speaker classification learning on the processed phase perception features to obtain speaker embeddings corresponding to the phase perception features.

Finally, in step 104, the text-independent speaker verification apparatus performs improved linear judgment analysis on the speaker embedding corresponding to the voice to be verified so as to obtain a speaker verification result for the voice to be verified, if the verification result passes the verification, the voice to be verified is indicated as the same person as the user submitting the voice to be verified, and if the verification result does not pass the verification, the voice to be verified is indicated as not the same person as the user submitting the voice to be verified, so that subsequent operations, such as allowing the user to log in, can be performed.

The scheme of the embodiment processes the extracted amplitude characteristic and the corresponding phase characteristic, then obtains the speaker embedding of the processed phase perception characteristic, then verifies the voice to be verified according to the speaker embedding, and improves the noise robustness of the speaker verification system by combining the amplitude characteristic and the phase characteristic in deep speaker embedding learning. Further, the scheme of the application not only provides a new scheme for a noise robust speaker verification system, but also shows various possibilities of using phase characteristics to improve performance. In addition, the architecture of the scheme can not only combine amplitude and phase characteristics, but also combine various different characteristics (such as Fbank and MFCC (mel-frequency cepstral coefficients, Mel-cepstral frequency characteristics) by changing the input).

With further reference to FIG. 2, a flow chart of another text independent speaker verification method provided by an embodiment of the present application is shown. The flow chart is primarily a flow chart of the steps further defined for step 102 in fig. 1.

As shown in fig. 2, in step 201, the amplitude characteristic, the sine value of the phase characteristic, and the cosine value of the phase characteristic are spliced into a three-channel input;

in step 202, the three channel input is passed through the convolutional layer and the residual layer to fuse the amplitude feature and the phase feature to obtain the phase perception feature.

In this embodiment, for step 201, the text-independent speaker verification apparatus calculates the obtained phase characteristic to obtain a sine value of the phase characteristic and a cosine value of the phase characteristic, and then splices the amplitude characteristic, the sine value of the phase characteristic, and the cosine value of the phase characteristic to generate three-channel input. For step 202, for three-channel input obtained by splicing, the three-channel input is input into at least one convolution layer and at least one residual layer, so that the amplitude characteristic, the sine value of the phase characteristic and the cosine value of the phase characteristic are fused, and finally the fused phase perception characteristic is obtained.

According to the scheme of the embodiment, the sine value and the cosine value of the phase characteristic are calculated firstly and then spliced with the amplitude characteristic to form three-channel input, and then the three-channel input is fused through the convolution layer and the residual layer to obtain the phase perception characteristic, so that the phase perception characteristic can be integrated with phase information and deep speaker embedding learning, noise training data does not need to be recorded again, and the robustness of the system to noise can be improved. And the scheme is simple to realize, and the existing system is not required to be greatly modified.

With further reference to FIG. 3, a flow chart of another text independent speaker verification method provided by an embodiment of the present application is shown. The flow chart is primarily a flow chart of the steps further defined for step 102 in fig. 1.

In step 301, passing the amplitude characteristic, the sine value of the phase characteristic, and the cosine value of the phase characteristic through an independent convolution layer and an independent residual error layer respectively to obtain processed characteristics;

in step 302, the processed features are subjected to addition processing to obtain phase perception features.

In this embodiment, for step 302, the text-independent speaker verification apparatus calculates to obtain a sine value of the phase characteristic and a cosine value of the phase characteristic, and then obtains processed characteristics by passing the amplitude characteristic, the sine value of the phase characteristic, and the cosine value of the phase characteristic through at least one independent convolution layer and at least one independent residual layer, i.e., obtains the processed amplitude characteristic, the processed sine value characteristic, and the processed cosine value characteristic by independent processing. Then, in step 302, the individual features obtained after the independent processing are subjected to the addition processing to obtain the phase perception features.

The scheme of the embodiment provides a processing scheme different from that of fig. 2, the sine value and the cosine value of the phase characteristic are calculated firstly, then the characteristics are independently processed by the convolution layer and the residual layer, and finally the characteristics after independent processing are fused to obtain the phase perception characteristic, so that the phase perception characteristic can be integrated with the embedding and learning of the phase information and the deep speaker, the noisy training data does not need to be recorded again, and the robustness of the system to noise can be improved. And the scheme is simple to realize, and the existing system is not required to be greatly modified.

In some alternative embodiments, the amplitude signature may include: perceptual linear prediction PLP, mel-frequency cepstrum frequency features MFCC, and filter bank features Fbank.

In some alternative embodiments, the speaker classifying the phase perception features to derive speaker embedding includes: and carrying out speaker classification task learning on the phase perception characteristics through a residual error layer with preset number of layers, wherein the characteristics of the middle layer of the residual error layer are output to enable a speaker to be embedded.

With further reference to FIG. 4, a flow chart of yet another text independent speaker verification method provided by an embodiment of the present application is shown. The flow chart is primarily a flow chart of the steps further defined for step 104 in fig. 1.

As shown in fig. 4, in step 401, a probabilistic linear decision analysis is performed on the obtained speaker embedding to generate a score;

in step 402, if the score is greater than or equal to the preset threshold, the speaker of the voice to be verified passes verification;

in step 403, if the score is smaller than the preset threshold, the speaker of the voice to be verified fails verification.

In this embodiment, for step 401, the text-independent speaker verification apparatus performs probabilistic linear judgment analysis on the obtained speaker embedding, and generates a score, and then for step 402, if the score is greater than or equal to a preset threshold, it indicates that the speaker of the voice to be verified passes verification, further, the client is allowed to perform operations such as login, and for step 403, if the score is less than the preset threshold, it indicates that the speaker of the voice to be verified fails verification, further, the operations such as login of the client are prevented.

According to the method, probability linear judgment analysis is carried out on embedding of the speaker, whether the speaker verification of the voice to be verified passes or not can be determined according to the analyzed scoring result, and therefore the user is allowed or not allowed to carry out follow-up operation.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and a specific embodiment of the finally determined solution.

In some optional embodiments, the extracting the amplitude feature of the speech to be verified and the phase feature corresponding to the amplitude feature includes: segmenting a signal of a voice to be verified into a frame signal sequence by using a sliding window; applying a short-time Fourier transform or a fast Fourier transform to the sequence of frame signals, wherein the output of the short-time Fourier transform or the fast Fourier transform is an amplitude feature and a phase feature corresponding to the amplitude feature. Thus, the amplitude characteristic and the corresponding phase characteristic of the voice signal can be extracted in the above mode.

In order to solve the defects in the prior art, the following schemes may be adopted by those skilled in the art: in order to enhance the robustness of the speaker verification system, an industry practitioner usually records a sound in a specific noise scene, and then obtains a model trained by using noisy data. Since phase information is generally ignored by humans and its robustness to noise has only been verified on conventional models, phase-aware deep speaker embedding learning is generally not thought by practitioners.

The scheme of the application creatively combines the phase information and deep speaker embedding learning into a whole, and the robustness of the system to noise can be improved without recording noisy training data again. And the model is simple to realize, and the existing system is not required to be greatly modified.

Referring to fig. 5, which shows a flowchart of a specific embodiment of the solution of the present application, it should be noted that although some specific examples are mentioned in the following embodiments, the solution of the present application is not limited thereto.

As shown in fig. 5, the detailed steps are as follows:

the system is divided into two parts, namely a phase information sensor and a speaker embedded learning device. The phase information sensor can be realized in two ways, namely, the first one (Arch #1) directly splices amplitude characteristics and phase characteristics (Sin and Cos) into a three-channel input, and fuses amplitude and phase information together through a convolution layer and a residual layer. The second (Arch #2) passes the amplitude and phase features separately through separate convolutional and residual layers, and adds them together to obtain the phase-aware feature. The speaker embedding learning device carries out speaker classification task learning on the phase-sensed characteristics through a plurality of residual error neural network layers, wherein the characteristics of the middle layer are the embedding of the speaker to be obtained. And finally, scoring the obtained speaker embedding by using a PLDA (probabilistic linear discriminant analysis), and judging through a threshold value to complete the speaker verification task.

The inventors have also adopted the following alternatives in the course of carrying out the present application and summarized the advantages and disadvantages of the alternatives.

The phase signature used is not represented using Sin and Cos, but is represented directly by the imaginary part of the fourier transform and trained. This has the advantage of simplifying the computational effort, but also has the disadvantage that the training is more difficult to converge.

The scheme proves that the phase characteristics contain a large amount of speaker information, and the noise robustness of the speaker verification system can be improved by combining the amplitude characteristics and the phase characteristics in deep speaker embedding learning. The scheme not only provides a new scheme for a noise robust speaker verification system, but also shows various possibilities of improving performance by using phase characteristics. In addition, the framework of the scheme can combine amplitude and phase characteristics, and can combine various different characteristics (such as Fbank, MFCC and the like) by changing input.

The procedures and experiments performed by the inventors in order to enable those skilled in the art to better understand the scheme of the present application are described below.

Speaker embedded learning based on deep neural networks has become the main modeling strategy for speaker recognition. Deep speaker embedding learning is typically performed only on amplitude features, such as PLP (perceptual linear prediction), MFCC (mel-frequency cepstralcoefficients, mel-frequency cepstral frequency features), and Fbank, and ignores all phase information in speech. In this application, we have devised a new architecture that uses phase information to improve deep speaker embedding. Firstly, we prove that the phase characteristics really encode the identity information of the speaker to a great extent, and then propose a phase perception deep speaker embedding learning framework which simultaneously utilizes the amplitude and phase characteristics. The experiments were performed on a standard Voxceleb Part 1 data set and several noisy test sets were additionally constructed. Experimental results show that the system provided by the application has stronger robustness under different noise conditions.

Speaker Verification (SV) aims at verifying the claimed identity of a client by his speaker's voice. SV systems are typically used for security purposes, such as access control, which typically requires the system to be robust to different application environments. A typical SV system includes three main steps: front-end functional processing, speaker modeling and back-end scoring. During the decades of development of SV systems, many different features have been explored, with the most popular being short-term spectral features. The speech signal is typically divided into short frames of about 20-30 milliseconds and smoothed using a windowing function. A Short Time Fourier Transform (STFT) is then applied to each window and decomposes the speech signal into its frequency components, which can be further decomposed into a magnitude spectrum and a phase spectrum. Although the magnitude spectrum values may be used directly as features, they are typically processed to obtain features such as filter banks (fbanks) and mel-frequency cepstral coefficients (MFCCs).

For speaker modeling, embedded representations have become the dominant method in the past decade. The i-vector uses Factor Analysis (FA) to project a Gaussian Mixture Model (GMM) supervector into a more compact and speaker differential embedding, where the speaker and channel factors are modeled in the same low-dimensional total variation space. Probabilistic Linear Discriminant Analysis (PLDA) is typically used as the back-end for channel compensation and scoring in i-vector space. The i-vector/PLDA paradigm represents the latest technology over the past several years. The success of Deep Learning (DL) in other areas of image recognition and speech recognition has prompted researchers to apply deep neural networks to speaker verification tasks. In recent years, a number of studies have been conducted on a DNN-based speaker embedding learning method. In a typical deep speaker embedding system, a speaker discriminative neural network will first be trained on data from a large number of speakers, and then speaker embedding will be extracted from a particular layer of the trained network. A pooling layer is typically included in the network to aggregate frame-level samples into a speech-level representation. Different architectures, a penalty function and a pool function have been studied to enhance deep speaker embedding. To our knowledge, no relevant research has incorporated phase information into deep speaker-embedded learning, and most work uses only amplitude features.

In this application, we are studying the integration of phase information into deep speaker embedding learning. 257 dimensional spectral amplitude features and their corresponding phase features are extracted. To represent the phase information, the sine and cosine of the phase shift are calculated to obtain two other 257 dimensional features, called phase (sin) and phase (cos). Two kinds of architectures based on the ResNet (Residual Neural Network) are proposed to combine amplitude and phase characteristics simultaneously. For the first architecture, there are three input channels to obtain the three features mentioned, and the parameters in all layers are shared. For the second architecture, each feature is modeled using a different set of parameters for the first residual layer, and then the output feature maps are aggregated and shared with the rest of the network. The system was trained on the training set of Voxceleb part 1 and evaluated on its noise-corrupted test device. Experimental results show that the system achieves better performance under the noise condition.

Amplitude and phase extraction

As shown in fig. 6, the speech signal is first segmented into frames using a sliding window, and then STFT (or FFT) is applied to the windowed speech signal sequence. The output of the STFT may be decomposed into a magnitude spectrum and a phase spectrum.

For the t-th frame, the nth element of the short-term spectrum S (n, t) is obtained by STFT of the input speech signal sequence in the corresponding frame window.

For conventional features such as spectrograms and MFCCs, only the power spectrum representing the amplitude information is used:

the phase information θ (n, t) is usually ignored. In this application, we calculate PhaseCos (n, t) and PhaseSin (n, t) in different ways as follows,

thus we will now obtain three feature vectors for each frame, with the dimensions of the feature vectors being determined by the number of points used in the FFT. In the present example we used a 512-bin FFT in all experiments, with one additional DC component dimension, and the final dimension of all three types of features per frame is 257.

Phase aware depth speaker embedding

Deep neural networks are known for their powerful representation learning capabilities, and related technologies have been successfully applied to tasks such as image recognition, speech recognition, and speaker recognition. In recent years, DNN-based speaker embedding learning has attracted the attention of many researchers, and different approaches have been studied to learn discriminative speaker embedding for speaker recognition. Different neural network architectures have been studied, such as TDNN and ResNet, as well as various loss functions. However, most works only exploit the amplitude-related features and ignore the phase information encoded in the speech signal.

Deep speaker embedded learning

In the deep speaker-embedded learning framework, a neural network is trained on a large number of utterances from many speakers in a training set, with the optimization goal being to enable the neural network to distinguish between different speakers. Training may be performed at the frame level or the speech level, which is used more frequently and yields better results.

In the inference phase, deep speaker insertions are extracted from a particular one of the trained layers and scored using Probabilistic Linear Discriminant Analysis (PLDA).

In the embodiment of the present application, the ResNet network architecture is adopted because it has stable and powerful performance. The network is a 34-layer ResNet with cross entropy training criteria. With the utterance (sequence of frames) as input, the convolution will be performed on the time and frequency axes. After several residual blocks, the feature maps for each utterance will be aggregated into a single embedding, which will serve as a speaker embedding in the inference phase. Detailed parameter settings can be found in subsequent experiments.

Phase aware deep speaker embedded learning

The inventors believe that phase information may facilitate robust speaker modeling. To integrate phase information into the deep speaker modeling process, we modify ResNet to be able to combine amplitude and phase characteristics. Continuing with reference to FIG. 5, the overall frame can be divided into two main sections: phase information perceptron and speaker embedding learner. The phase information perceptron fuses amplitude and phase information, and the speaker embedding learner extracts phase-aware embedding from the output of the phase information perceptron.

Referring to fig. 5, a framework of a deep speaker-embedded extractor with phase information is shown. For the phase information perceptron, two different approaches are proposed. The first method (denoted as Arch #1) concatenates the amplitude, phase sine and phase cosine together into a 3-channel profile, followed by a residual layer to extract the high-level representation. In the second approach (denoted as Arch #2), the different features would first pass through the residual block separately, and then aggregate the feature maps for subsequent speaker embedding learning.

The latter speaker embedding learner receives the output of the phase information perceptron and encodes it as a more speaker discriminative embedding using several other residual blocks. Since the phase information perceptron encodes knowledge from amplitude and phase, the final learned speaker embedding should be more robust than the original embedding.

Experiment of

Data set

These experiments were conducted on VoxCeleb1, a large-scale text-independent speaker identification data set published by oxford university. VoxCeleb1 contains over 150,000 utterances collected from YouTube for 1,251 different celebrities. After the setup, 148,642 utterances from 1,211 speakers were separated as a training set, while the remaining 4,874 utterances from 40 celebrities were retained for evaluation. The validation trial list contains 37,720 pairs.

To simulate speech in noisy conditions, additional noise and reverberation are added to the evaluation audio. In addition to the standard clean VoxCeleb1 test device, there were four additional test devices with different noise types, including noisy, music, noise and reverberation.

Table 1: detailed configuration of the proposed model: conv Layer denotes a convolutional Layer, ResLayer denotes a residual Layer, Blocks denote residual Blocks, the filter sizes are all set to 3 × 3, and the numbers following the layers denote the Layer configuration, e.g., ResLayer (16 → 32,4 Blocks) denotes that the residual Layer is composed of 4 residual Blocks, the number of input channels is set to 16, and the number of output channels is set to 32. Base represents the Base model. Arch #1 and Arch #2 denote the first and second methods, respectively.

System setup

As previously described, ResNet, which contains 16 residual blocks ({3,4,6,3}) is used as the basic model for our learning speaker embedding. For phase-aware speaker embedding learning, the first two convolutional layers and the residual layer are modified. The detailed configuration is shown in table 1. All models were 34 layers deep.

As previously described, all amplitude and phase characteristics are 257-dim, a 25ms frame length and a 10ms frame shift. Although the training utterances vary in length, they are all cut to the same length of 400 frames to simplify the training process. The neural network was trained on 4 GPUs with a batch size of 64. The random gradient is decreased, the learning rate is 0.01, the momentum is 0.9, and the weight attenuation is 1e-4, which is used for optimizing the model.

Standard PLDA was used to score learned embeddings. When Ptar is 0.01, the result is reported as the minimum value of the equal error rate (EER, equal error rate) and the normalized detection cost function (minDCF), and Cmiss and Cfa are both set to 1.0.

Speaker identity encoding stage

To explore any speaker identity-related knowledge of the presence or absence of phase encoding, experiments were first conducted on the VoxCeleb1 dataset. The BASE system described in table 1 was trained only on phase signatures.

As shown in table 2, Magnitude represents the amplitude characteristic and Phase represents the Phase characteristic, and when amplitude is used as the input characteristic, an EER of 5.016% is obtained. Promising results are obtained by switching the input from the conventional spectral amplitude signature to the phase signature. Achieving an EER of 7.906% and 8.659% with sine and cosine phase characteristics, respectively, is a competitive score, especially considering that the input contains only phase information.

Table 2: performance comparisons of amplitude and phase characteristics.

Table 3: the performance of the different systems is compared. The first three columns are the results of the basic model, where only the amplitude, the phase, comes from the sine and the phase comes from the cosine as input features. The following four are all noise environments, Clean: cleaning; a ribbon: noisy; music: music; noise: noise; reverb: reverberation; the differences are as follows: babble is noise of pure human voice, i.e. there are still multiple people speaking in the background; music is Music noise; noise is a variety of mixed noises such as animal sounds, car sounds, etc.; reverb is reverberation, such as echo, etc. Fusion (Fusion) represents a score-level Fusion of the above three systems. Arch #1 and Arch #2 are phase-aware noise robust systems as described hereinbefore.

As can be seen from table 2, a significant portion of the speaker characteristics are still retained in the speech phase. The integration phase knowledge may be useful for learning better speaker discrimination embedding, which is the motivation for building a phase-aware deep speaker embedding learning framework.

Results and analysis

The results for the different systems can be found in table 3. Using amplitude as an input feature, an EER of 5.016% is achieved under clean conditions. However, for noisy test utterances, a huge performance degradation was found, resulting in 13.61%, 9.099%, 12.83% and 10.52% under noisy, music, noise and reverberation conditions, respectively. When phase features are involved, although performance is degraded compared to amplitude features, it can be observed that phase features still retain speaker discriminant embedding, which is the motivation we have to excite speaker information to a large extent. By fusing the amplitude and phase characteristics, a minimum EER of 4.979% was achieved under clean conditions. However, under noisy conditions, the improvement in score fusion is not consistent. As can be seen from table 3, under noisy and noisy conditions, the score fusion achieved EER of 12.97% and 12.6%, respectively, indicating that the combination of phase information can improve noise robustness, but fails under music and reverberation conditions. The reason may be that the fusion system is hampered by poor phase performance in certain scenarios.

The performance of our proposed phase-aware system in noisy environments is better than the amplitude-only and fusion system, while the performance in clean conditions remains comparable. Unlike the fusion system, the system of the present application learns speaker identity from model-level phase information, which can make better use of the information to optimize the network. By combining the amplitude and phase signatures into a 3-channel signature graph as input, the robustness of the speaker verification system is significantly improved, with EER reduced to 12.77%, 8.977%, 12.03% under noisy, music, and noise conditions, respectively. Connecting the residual layer after each feature channel may further improve performance. EER under noisy conditions can consistently improve the system compared to amplitude and fusion only systems. The result is reasonable because the residual layer of each channel extracts a higher level representation of the amplitude/phase features and better merges them together.

Fig. 7 shows DET plots evaluated on a Voxceleb1 test device under "noisy" noise conditions.

Conclusion

In recent years, deep speaker embedded learning has achieved a remarkable level of speaker verification. However, it is always performed only on the amplitude feature space and does not use phase information. In the embodiments of the present application, the speaker characteristics of phase signature encoding are explored for speaker verification. A phase-aware speaker embedded learning framework is designed. Two-phase bonding methods are proposed: 1) the amplitude and phase signatures are combined into a 3-channel signature graph, where the three channels share one residual layer to be combined. 2) Each feature is connected to a residual layer, where features are learned through a single residual layer and merged through element addition. Experiments were performed on Voxceleb1 with four additional noise test devices. The result shows that the framework of the application can continuously improve the system performance under the noisy condition.

Referring to FIG. 8, a block diagram of a text independent speaker verification apparatus according to an embodiment of the invention is shown.

As shown in FIG. 8, the text independent speaker verification apparatus 800 includes an extraction module 810, a processing module 820, a classification module 830, and a verification module 840.

The extracting module 810 is configured to extract an amplitude feature of the voice to be verified and a phase feature corresponding to the amplitude feature; a processing module 820 configured to process the amplitude and phase characteristics to obtain a phase perception characteristic; a classification module 830 configured to classify the phase perception characteristics to obtain speaker embedding; and a verification module 840 configured to perform probabilistic linear decision analysis on the speaker embedding to obtain a speaker verification result of the speech to be verified.

It should be understood that the modules recited in fig. 8 correspond to various steps in the methods described with reference to fig. 1,2, 3, and 4. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 8, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not limited to the scheme of the present application, and for example, the template generating module may be described as a module that extracts the gaussian posterior features of the speech segment corresponding to each word and generates the feature template of the entire enrollment speech based on the gaussian posterior features of each speech segment. In addition, the related functional module may also be implemented by a hardware processor, for example, the template generating module may also be implemented by a processor, which is not described herein again.

In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may perform the text-independent speaker verification method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

extracting amplitude characteristics of the voice to be verified and phase characteristics corresponding to the amplitude characteristics;

processing the amplitude feature and the phase feature to obtain a phase perception feature;

carrying out speaker classification on the phase perception characteristics to obtain speaker embedding;

and carrying out probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the voice to be verified.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the text-independent speaker verification apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and these remote memories may be connected to the text independent speaker verification device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above-described text irrelevant speaker verification methods.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 9, the electronic device includes: one or more processors 910 and a memory 920, one processor 910 being illustrated in fig. 9. The apparatus of the text independent speaker verification method may further include: an input device 930 and an output device 940. The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and fig. 9 illustrates an example of a connection by a bus. The memory 920 is a non-volatile computer-readable storage medium as described above. The processor 910 executes various functional applications and data processing of the server by executing the non-volatile software programs, instructions and modules stored in the memory 920, namely, implements the text independent speaker verification method of the above-mentioned method embodiment. The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the text-independent speaker verification device. The output device 940 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a text-independent speaker verification apparatus, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of text independent speaker verification, comprising:

performing probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the voice to be verified;

wherein the processing the amplitude feature and the phase feature to obtain a phase perception feature comprises: splicing the amplitude characteristic, the sine value of the phase characteristic and the cosine value of the phase characteristic into three-channel input; inputting the three-channel input into a deep neural network with a residual error network result, and fusing the amplitude characteristic and the phase characteristic through a convolution layer and a residual error layer to obtain a phase perception characteristic.

2. The method of claim 1, wherein the processing the amplitude and phase signatures to derive a phase perception signature comprises:

respectively enabling the amplitude characteristic, the sine value of the phase characteristic and the cosine value of the phase characteristic to pass through a deep neural network with a residual error network result, and obtaining processed characteristics through an independent convolution layer and an independent residual error layer;

and adding the processed features to obtain phase perception features.

3. The method of claim 1 or 2, wherein the amplitude characteristic comprises: perceptual linear prediction, mel-frequency cepstral features and filterbank features.

4. The method of claim 3, wherein said speaker classifying the phase-aware features for speaker embedding comprises:

and carrying out speaker classification task learning on the phase perception characteristics through a residual error network with preset layers, inputting audio characteristics containing phase information after the training of the residual error network is finished, and outputting speaker embedding by the middle layer of the residual error network.

5. The method of claim 4, wherein the performing a probabilistic linear decision analysis on the speaker embedding to obtain a speaker verification result for the speech to be verified comprises:

performing probability linear judgment analysis on the obtained speaker embedding to generate a score;

if the score is larger than or equal to a preset threshold value, the speaker of the voice to be verified passes verification;

and if the score is smaller than a preset threshold value, the speaker verification of the voice to be verified is not passed.

6. The method of claim 5, wherein the extracting amplitude features and phase features corresponding to the amplitude features of the speech to be verified comprises:

segmenting a signal of a voice to be verified into a frame signal sequence by using a sliding window;

applying a short-time Fourier transform or a fast Fourier transform to the sequence of frame signals, wherein the output of the short-time Fourier transform or the fast Fourier transform is an amplitude feature and a phase feature corresponding to the amplitude feature.

7. A text independent speaker verification apparatus comprising:

the extraction module is configured to extract amplitude features of the voice to be verified and phase features corresponding to the amplitude features;

a processing module configured to process the amplitude feature and the phase feature to obtain a phase perception feature;

the classification module is configured to classify the speakers according to the phase perception characteristics so as to obtain speaker embedding;

the verification module is configured to perform probability linear judgment analysis on the speaker embedding to obtain a speaker verification result of the to-be-verified voice;

wherein the processing module is further configured to splice the amplitude feature, the sine value of the phase feature, and the cosine value of the phase feature into a three-channel input; inputting the three-channel input into a deep neural network with a residual error network result, and fusing the amplitude characteristic and the phase characteristic through a convolution layer and a residual error layer to obtain a phase perception characteristic.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.

9. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.