CN111191787A

CN111191787A - Training method and device for neural network for extracting speaker embedded features

Info

Publication number: CN111191787A
Application number: CN201911391244.6A
Authority: CN
Inventors: 钱彦旻; 俞凯; 陈正阳; 王帅
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-22
Anticipated expiration: 2039-12-30
Also published as: CN111191787B

Abstract

The invention discloses a training method and a device of a neural network for extracting speaker embedded features, wherein the neural network comprises a plurality of frame level layers, a statistical pooling layer and a plurality of segment level layers, and the method comprises the following steps: receiving and processing an input audio clip via a plurality of frame-level layers; aggregating, via a statistical pooling layer, frame-level spectral features into segment-level spectral features; splitting a first multilayer linear layer on the basis of the statistical pooling layer to calculate a first channel loss of the segment-level spectral characteristics; merging the segment-level spectral features into utterance-level spectral features via a plurality of segment-level layers and calculating a speaker loss for the utterance-level spectral features; re-splitting a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature; and training the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss. The neural network trained by the scheme of the application can extract the embedding characteristics of the speaker independent of the channel.

Description

Training method and device for neural network for extracting speaker embedded features

Technical Field

The invention belongs to the technical field of neural network training, and particularly relates to a training method and a training device for a neural network for extracting speaker embedded features.

Background

The purpose of Speaker Verification (SV) is to verify the identity of a user's requirements based on his voice segment. Recently, Deep Neural Network (DNN) based speaker embedded learning has become the main method in this field. Researchers have studied different architectures, different loss functions, and different model compensation methods, which greatly improve the performance of SV systems.

Although deep learning techniques have enjoyed great success in the field of SV research, it is still very difficult to construct SV systems for practical use. It is well known that speaker verification is more vulnerable than speech recognition in terms of system robustness. To improve the robustness of SV systems, two sources of variability need to be addressed: voice content and channel variability. For text independent speaker verification, which requires that two utterances from the same speaker with different phonetic content be grouped together, it is important to cope with phoneme variations in the speaker modeling process. For text-dependent and text-independent speaker verification tasks in the real world using different devices and recording environments, system performance will drop dramatically due to this channel mismatch.

In the related art, for the characteristics or noise of the channel different from the speaker, front-end processing is generally used to remove the characteristics or noise of the channel different from the speaker, or countertraining is used to remove the characteristics or noise.

The purpose of these similar techniques is to eliminate the differences or noise that different channels cause to the characteristics of the same speaker, but in different ways.

In the process of implementing the present application, the inventor finds that although the speaker-independent characteristics caused by the channel difference need to be eliminated, the information about the channel can be utilized, and the utilization of the information can help the neural network to extract better acoustic features. None of the above-mentioned related art makes good use of the information of the channel.

Disclosure of Invention

The embodiment of the invention provides a training method and a training device for a neural network for extracting speaker embedded features, which are used for solving at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a training method for a neural network for extracting speaker embedded features, where the neural network includes a plurality of frame-level layers, a statistical pooling layer, and a plurality of segment-level layers, and the method includes: receiving and processing an input audio segment via the plurality of frame-level layers, wherein the plurality of frame-level layers are used to extract frame-level spectral features; aggregating, via the statistical pooling layer, the frame-level spectral features into segment-level spectral features; splitting a first multilayer linear layer on the basis of the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature; merging, via the plurality of segment-level layers, the segment-level spectral features into utterance-level spectral features and calculating a speaker loss for the utterance-level spectral features; re-splitting a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature; and training the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss.

In a second aspect, an embodiment of the present invention provides a training apparatus for a neural network for extracting speaker-embedded features, wherein the neural network includes a plurality of frame-level layers, a statistical pooling layer and a plurality of segment-level layers, and the apparatus includes: a receive processing module configured to receive and process an input audio segment via the plurality of frame-level layers, wherein the plurality of frame-level layers are used to extract frame-level spectral features; an aggregation module configured to aggregate the frame-level spectral features into segment-level spectral features via the statistical pooling layer; a first branching module configured to further split a first multi-layered linear layer based on the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature; a merging module configured to merge the segment-level spectral features into utterance-level spectral features via the plurality of segment-level layers and to compute a speaker loss for the utterance-level spectral features; a second branching module configured to re-split a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature; and a training module configured to train the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss.

In a third aspect, an electronic device is provided, comprising: the training method comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the training method for extracting the neural network of the embedded characteristic of the speaker according to any embodiment of the invention.

In a fourth aspect, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to execute the steps of the training method for extracting a neural network of speaker-embedded features according to any one of the embodiments of the present invention.

The method and the device provided by the application can help the neural network to extract better acoustic characteristics and eliminate the influence of the channel characteristics on the embedding characteristics of the speaker, and have better effect than the previous method for directly eliminating the channel characteristics.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a training method for extracting a neural network of speaker-embedded features according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method for extracting a neural network of speaker-embedded features according to an embodiment of the present invention;

FIG. 3 is a diagram of a network architecture of a neural network model of a neural network-based text classification method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an embodiment of a training method for extracting a neural network of speaker-embedded features according to the present invention;

FIG. 5 is a block diagram of a training apparatus for neural network extraction of speaker-embedded features according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which shows a flowchart of an embodiment of a training method for a neural network for extracting speaker-embedded features according to the present application, the training method for a neural network for extracting speaker-embedded features of the present embodiment may be applied to training a neural network for extracting speaker-embedded features, where the neural network may include a plurality of frame-level layers, a statistical pooling layer, and a plurality of segment-level layers.

As shown in fig. 1, in step 101, an input audio clip is received and processed via the plurality of frame level layers;

aggregating, in step 102, the frame-level spectral features into segment-level spectral features via the statistical pooling layer;

in step 103, splitting a first multi-layer linear layer based on the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature;

in step 104, merging the segment-level spectral features into utterance-level spectral features via the plurality of segment-level layers and calculating a speaker loss for the utterance-level spectral features;

in step 105, re-splitting a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature;

in step 106, the neural network is trained by controlling the sum of the first channel loss, the second channel loss, and the speaker loss.

In this embodiment, for step 101, the training apparatus of the neural network for extracting speaker-embedded features receives and processes the input audio segment via a plurality of frame-level layers in the neural network, wherein the plurality of frame-level layers are used for extracting frame-level spectral features. Thereafter, for step 102, the frame-level spectral features are aggregated into segment-level spectral features via the statistical pooling layer. For step 103, a first multi-layered linear layer is split based on the re-statistical pooling layer to calculate a first channel loss of the segment-level spectral feature.

Then, for step 104, segment-level spectral features are merged into utterance-level spectral features via a plurality of segment-level layers, and speaker loss for the utterance-level spectral features is calculated at that layer. Thereafter, for step 105, a second multi-layered linear layer is split on the basis of the subsequent multi-layered layers for calculating a second channel loss of the speech-level spectral feature. Finally, for step 106, the neural network is trained by controlling the values of the first channel loss, the second channel loss, and the speaker loss.

The method of the embodiment calculates different losses by splitting branches at different layers respectively in the process of processing the audio, and then can train out the channel-independent speaker embedding characteristics by controlling the losses.

In some optional embodiments, the method further comprises: inserting a gradient inversion layer before the first multi-layer linear layer for counter training; and/or inserting a gradient inversion layer before the second multi-layer linear layer for counter training. For example, a gradient inversion layer may be inserted just before the second multi-layered linear layer to help the network eliminate the information of the channel during the competitive training, thereby extracting the channel-independent speaker-embedded features. The gradient inversion layer may be inserted only before the first multilayer linear layer, or the gradient inversion layer may be inserted before both the first multilayer linear layer and the second multilayer linear layer.

In fact, the inventors conducted separate experiments on the branch corresponding to the speaker loss plus the channel loss 1 (corresponding to the second channel loss) (at this time, the branch corresponding to the channel loss 2 (corresponding to the first channel loss) is removed) or the branch corresponding to the speaker loss plus the channel loss 2 in the experiment.

In these two separate experiments, it was found that it would be better to insert a gradient inversion layer at channel loss 1 and to add no gradient inversion at channel loss 2.

Therefore, we consider that the speaker penalty plus the channel penalty 1 with the gradient flipping layer inserted and the channel penalty 2 without the gradient flipping layer are better (at this time, the three penalties coexist), and the same is true.

Since it has been verified that the channel loss 2 without gradient inversion is better in the separate experiment of the speaker loss plus the channel loss 2, the combination of the speaker loss plus the channel loss 1 of the gradient inversion layer plus the channel loss 2 of the gradient inversion layer is not tested when the three are combined. We speculate that the results are not the best, but not too bad.

Therefore, in the shallow part of the network, the scheme of the embodiment can help the network to learn the information of the channel, so that better acoustic features can be extracted, and in the embedding feature level of the speaker, namely the deep layer of the neural network, the scheme of the embodiment can help the neural network to eliminate the features of the channel, so that the embedding features of the speaker irrelevant to the channel are extracted.

In some optional embodiments, the first channel loss and the second channel loss comprise using cross entropy calculation.

In other alternative embodiments, the speaker loss comprises using an additional angular margin loss calculation. This additional loss of angular margin imposes a more stringent constraint that forces the similarity of the correct category to be greater than the similarity of the incorrect category by a margin m.

Further optionally, the plurality of frame-level layers include a time-delay neural network feature extractor, and the plurality of segment-level layers include a linear embedding layer. Therefore, the TDNN (Time Delay neural network) feature extractor is adopted, and the frame-level features can be extracted better. The speaker embedding features can be better extracted by utilizing the linear embedding layer.

Further optionally, the neural network comprises a deep neural network, and the deep neural network is used as a main method for embedding and learning of the speaker, and has the capability of better extracting embedding characteristics of the speaker.

The inventors have found that the drawbacks of the prior art are caused by the implementation means of these techniques, which aim to eliminate the channel differences and do not realize the utilization of the channel information.

Those skilled in the art are faced with the deficiencies of the prior art because the characteristics of the channel are detrimental to the final speaker-embedded feature, and it is common practice to eliminate the effects of the channel characteristics.

The characteristics of the channel are detrimental to the final speaker-embedded feature, and it is the most straightforward idea to eliminate this characteristic. The scheme of the present application using channel information is not easily conceivable.

In the scheme of the embodiment of the application, when the neural network is used for extracting the speaker embedded features, the neural network can be helped to learn the channel information at the bottom layer of the network, so that the network can learn better acoustic features. Then, at the speaker embedding feature level, the network is helped to eliminate the characteristics related to the channel, so that the extracted speaker features are independent of the channel.

FIG. 2 illustrates a flow diagram of a neural network for extracting channel independent speaker embedding features.

As shown in fig. 2, a header for classifying channels is first introduced in the statistical pooling layer, i.e. a shallow part of the neural network, to generate channel loss 2, which helps the network to learn the information of the channels, thereby helping the network to extract better acoustic features.

The channel loss 1 is also the loss generated by classifying the channels, but a gradient turning layer is added in the front to help the network eliminate the information of the channels, thereby extracting the embedding characteristics of the speaker which are not related to the channels.

Speaker loss helps the network learn speaker related information.

That is, in the shallow part of the neural network, our design will help the network to learn the information of the channel, so as to extract better acoustic features, and in the speaker embedded feature level (deep layer of the neural network), our design will help the neural network to eliminate the characteristics of the channel, so as to extract the speaker embedded features irrelevant to the channel.

The embodiment of the application can directly achieve the following effects: the method can help the neural network to extract better acoustic characteristics and eliminate the influence of the channel characteristics on the embedding characteristics of the speaker, and has better effect than the previous method for directly eliminating the channel characteristics.

The embodiment of the application can realize the deeper effect: from this experiment we can conclude that even though some acoustic properties are not useful in the final task, we can still use this existing information to help the network learn better acoustic features.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

Recently, much work has been done to mitigate the interference properties in certain tasks. Countermeasure training has been used in the speech domain to suppress speaker information in speech recognition, for example, domain countermeasure training for speaker anti-spoofing. Or use GAN and GRL strategies and do the challenge training to cope with channel variations in SV tasks.

All the above prior art aims at eliminating the interference information in the main task, and the adversarial training is the most common method. However, it should be appreciated that there are two methods that can take advantage of existing interference information, namely multitask learning and counterstudy. Our previous work shows the possibility of combining two approaches to better exploit speech information in text independent SV tasks: speech information is encouraged in early frame level layers, while such information is suppressed in later speaker embedding layers.

In the present application example, we follow a similar idea in the previous work. We assume that even if we wish to obtain channel independent embedding, the available channel information can be used for better generic acoustic feature learning in the shallow model layer and then suppressed in later speaker embedding. As our experiments demonstrate, it is beneficial to apply multitask learning and confrontational training for speaker embedding before embedding in the extraction layer. When we combine multitask learning with opponent training, two training strategies were designed, including a joint mode and a progressive mode. Experiments were performed on TD-SV data sets based on the wake words. The best systems achieve a relative boost of 10.77% and 9.37% when using recording equipment and environment as channel information, respectively.

Related work

Deep Neural Networks (DNNs) are known for their powerful modeling capabilities and flexibility, and DNN-based speaker embedding has become the dominant speaker identification method. x-vector is a typical method used by many researchers. In the x-vector framework, a time-delay neural network (TDNN) is trained to distinguish different speakers in training data. The frame-level spectral features will first go through several frame-level layers, followed by a statistics pooling layer that aggregates the frame-level representations into a single segment-level representation. One or more embedding layers may be incorporated into the utterance level layer to extract speaker embedding. In our experiments, an x-vector extractor was used as the baseline and stem for our proposed framework.

Multitasking and confrontational training

Our previous work has successfully combined multitasking and counter-training to better use speech information in a text-independent speaker verification task. The whole system is shown in fig. 3.

FIG. 3 illustrates a structure that combines multitasking and antagonism training to better utilize phonetic knowledge in a text-independent speaker verification task.

The main idea is to integrate speech information at the shallow layer of the model, which facilitates generic feature learning and suppresses phoneme changes in the final speaker-embedded layer. As shown in fig. 3, in addition to the main speaker branch in the original x-vector framework, our proposed architecture also includes a frame-level multitask phoneme branch and a segment-level antagonistic phoneme branch. By combining the three branch supervisory signals, we observe a significant improvement in performance over the text independent speaker verification task.

Description of the model

Inspired by the success of previous work, we wanted to employ a similar strategy to learn channel independent speaker embedding, whereas channel implies recording equipment and environment in the experiment. With this new architecture, we wish to enhance the learning of channel variability for different training utterances on the shallow layer of the neural network, which are then suppressed in the latter embedding layer, ultimately resulting in better channel independent speaker embedding.

Model architecture

Unlike speech information, channel information is at the segment level, so unlike fig. 3, we will explore two locations for multitask and adversarial learning at the segment level.

Fig. 4 shows the proposed structure of applying channel level multitasking and countermeasure training at different positions of the model. For the first type, we now do not perform the multitasking/countermeasure training on the output of the last frame level layer, but rather split the branches after the merge layer. The second type is the same as in fig. 3, executed directly on the embedding layer. When countermeasure training is employed, a Gradient Reverse Layer (GRL) will be inserted into the normal multi-tasking branch to reverse the sign of the computed gradient.

Loss function

For having speaker tag y_sAnd channel label y_cInput segment x, total loss of model optimization by speaker loss L_sAnd channel loss L_c1，L_c2Composition of

L_c1And L_c2Indicating channel loss with branches inserted into the x-vector embedding and statistics layers, respectively. Cross entropy will be used for channel classification branching

i ∈ (1,2), l denotes the number of channel classes.

For the speaker classification module, the key component of the model, we use the recently proposed additional angular margin loss as our main speaker loss. The additional angular margin loss imposes a more stringent constraint that forces the similarity of the correct category to be greater than the similarity of the incorrect category by a margin m.

Wherein

Is the normalized second linear layer output of the x-vector architecture.

Representing weights

Normalized j column (j). n represents the number of speakers. The additional angular margin loss also increases the scale parameter s, which helps the model converge faster. For all experiments we chose m 0.2 and s 30.

Training strategy

Assuming that channel information facilitates generic feature learning at the shallow layers of the model, but this information is not required for final speaker embedding, we have investigated two different training strategies to combine multitask learning and countertraining and obtain final channel independent speaker embedding.

Joint multi-task confrontation training

In this strategy, the entire architecture and all parameters are optimized simultaneously using multitask learning and antagonistic training, and three loss functions are used for model training. Multitask learning will be applied to the statistics summary layer and countermeasure training will be applied to the embedding layer.

Progressive multi-task confrontation training

In this case, we divide the model optimization into two phases and in the first phase first apply multitask learning over several training cycles. The multitask learning branch is then discarded, while the opposing training branch is in the second phase.

Experimental setup

Data set

The embodiment of the application uses a data set based on the awakening words. The average duration of all segments is about 1.0 second. Each person is required to use a particular device in different circumstances to repeatedly wake up words. The data set is labeled with a device. But the environmental label is not well labeled on all devices. We selected 1.6M speech from 2k different speakers as the training set. To produce more training data, utterances recorded in quiet environments are enhanced with the noise of the MUSAN dataset, resulting in a 5.2M utterance training set.

For experiments using channel information, we consider here different recording devices and environments as available channel information. The environment represents a scene where recordings are collected, such as quiet, office and automobile, etc. Furthermore, we also consider the enhancement noise as a different environmental type. Experiments will be performed as channel information for the device or environment, respectively.

Data preparation using device type as channel information

For experiments that consider the device type as channel variability, all of the above training data was used for training, for a total of 5 device types. The other 20543 utterances from 94 speakers were not included in the training set for enrollment and testing. For each speaker, we selected 4 clean voices as enrollment data and used the remaining voices of that speaker to generate the target trials. In addition, for each registered speaker, all utterances of other speakers are used to generate non-targeted test samples. Finally, we obtained 20167 target and 636798 non-target tests.

Table 1 shows a basic x-vector extractor configuration

Data preparation using recording environment category as channel information

Since the environmental label is not consistent across all devices, we only performed this experiment on the data set on certain specific devices. We selected the data recorded by both devices from all training data for experimentation (hereinafter they will be referred to as Device1 (Device 1) and Device2 (Device 2) datasets, respectively). The number of environment types in both Device1 and Device2 datasets is 6. The Device1 data set consisted of 352 speakers and 594583 speech, while the Device2 data set consisted of 512 speakers and 841450 speech.

The present embodiment uses the same strategy as before to generate test trials. Finally, the Device1 test set contained 35 speakers, 8732 target trials and 324888 non-target trials. Device2 test set contained 29 speakers, 5555 target trials and 158788 non-target trials. The results on both devices will be reported separately.

System configuration

The basic speaker-embedding extractor is an x-vector system with fewer parameters than the original ones, a more detailed configuration is provided in table 1. All architectures are implemented using a pytorech. The 40-dimensional Fbank features were extracted using the Kaldi toolkit and silence frames were deleted using an energy-based voice activity detector. The extracted embeddings are first length normalized and then a score is calculated using the covariance PLDA.

Baseline system

Our baseline system is a normal x-vector, whose structure is shown in table 1, and only speaker classification loss will be used as an optimization goal. The margin of the additional angular margin loss m will increase linearly from 0.0 to 0.2 along the training iteration. We use the SGD optimizer to optimize our network and set the momentum and learning rate to 0.9 and 0.0001, respectively.

The proposed system

The channel classification block consists of three linear layers with a batchnorm layer inserted. The three linear layers have the size of (inputdiim) x (channel class number). When a multitasking or antagonistic principal is added to our reference network, the speaker and channel classification tasks will be jointly trained from scratch. Results and discussion

Using environmental tags as channel information

Exploring environmental information at different locations of a model

The logical location of the multitasking and antagonistic branches is explored first. As shown in table 2, performing the training of confrontation at the embedding layer can improve the performance of SV tasks and, furthermore, the result of multi-task training at the statistics pool layer is better than that of confrontation training, which verifies our previous assumption that channel information may contribute to the general feature learning of shallower model layers.

Table 2 shows a comparison of the results of multitasking or antagonistic training performed at different locations of the model using environmental information, with STAMT and STA-ADV indicating multitasking or antagonistic training performed at the statistical summary level, and EMB-MT and EMB-ADV indicating that relevant learning was performed at the embedding level.

System data set (EER (%))

Joint and progressive multi-task confrontation training using environmental information

The experimental results before the embodiment of the application show that the environmental information is encouraged in the shallow layer (namely, the statistical pool layer) of the model, and the environmental information is inhibited in the embedded layer of the latter, so that the performance of the model can be improved. We then combine multitask learning and antagonistic training in the single architecture presented herein. Two training strategies were performed and compared, with the results shown in table 3. It can be seen that the proposed new architecture can be further improved and is always better under all conditions. For both training strategies, the progressive mode seems to be better than the joint mode. On average, the optimal system can achieve a relative improvement of 10.77% compared to baseline.

Table 3 shows a comparison of two training strategies using context information for the proposed architecture, join for JOINT multitask confrontation training mode and progress for PROGRESSIVE multitask confrontation training mode for the proposed architecture.

TABLE 3 comparison of two training strategies using environmental information for the proposed architecture

Using device tags as channel information

In the present embodiment, we describe the results obtained when the device tag is used as channel information. In this embodiment, similar experiments as in the previous embodiments will be performed, and the device tag for each utterance will be used as a channel tag instead of an environment tag.

Table 4. comparison of different systems using device information.

System for controlling a power supply	EER(％)
		Base line	4.27
EMB-MT	4.12
		STA-MT	4.09
EMB-ADV	4.10
		STA-ADV	4.23
JOINT	3.93
		PROGRESSIVE	3.87

The results of the multitask training and the countermeasure training using the device information are shown in table 4. As can be seen from the middle part of the table, it is preferable to insert multitask training and countermeasure training into the statistics pool and embedding layer, respectively, i.e. the conclusions in the previous embodiment. Furthermore, as with the results in the previous embodiments, the architecture integrates multitask learning and opponent training, which can be further improved. For both training strategies, the progressive mode was still slightly better, with a relative improvement of 9.37% compared to baseline.

From the above results, consistent observations can be obtained in both environment-based and device-based channel-independent training. It demonstrates the effectiveness of our proposed new framework and better channel independent speaker embedding can be achieved by the new approach.

In the embodiment of the application, a framework combining multitask and countermeasure training at different positions of a model is provided so as to better utilize channel information. Two independent experiments were performed to validate the proposed model, considering different devices or recording environments as channel labels. Consistent performance improvements were observed under both experimental conditions. The results show that enhancing channel information in the shallower layers of the model helps in generic feature learning, while suppressing such information in the higher layers helps in learning better channel independent speaker embedding. Two training strategies were designed to optimize the entire model and the new framework could achieve better performance. The progressive learning mode is slightly better than the joint learning mode. The optimal system achieved a relative improvement in EER of about 10.0% for both environmental and equipment levels compared to baseline.

Referring to fig. 5, a block diagram of a training apparatus for a neural network for extracting speaker-embedded features is shown, wherein the neural network includes a plurality of frame-level layers, a statistical pooling layer and a plurality of segment-level layers.

As shown in fig. 5, the training apparatus 500 for extracting a neural network of speaker-embedded features includes a receiving processing module 510, an aggregation module 520, a first branch module 530, a merging module 540, a second branch module 550, and a training module 560.

Wherein the receiving processing module 510 is configured to receive and process the input audio segment via the plurality of frame-level layers, wherein the plurality of frame-level layers are used for extracting frame-level spectral features; an aggregation module 520 configured to aggregate the frame-level spectral features into segment-level spectral features via the statistical pooling layer; a first splitting module 530 configured to split a first multi-layered linear layer based on the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature; a merging module 540 configured to merge the segment-level spectral features into utterance-level spectral features via the plurality of segment-level layers and to calculate a speaker loss for the utterance-level spectral features; a second branching module 550 configured to further split a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature; and a training module 560 configured to train the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss.

In some optional embodiments, the apparatus further comprises: inserting a gradient inversion layer before the first multi-layer linear layer for counter training; and/or inserting a gradient inversion layer before the second multi-layer linear layer for counter training.

It should be understood that the modules recited in fig. 5 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not intended to limit the scheme of the present application, and for example, the receiving processing module may be described as a module that receives and processes the input audio clip via the plurality of frame-level layers. In addition, the related functional modules may also be implemented by a hardware processor, for example, the receiving module may also be implemented by a processor, which is not described herein again.

In still other embodiments, an embodiment of the present invention further provides a non-transitory computer storage medium, where computer-executable instructions are stored, where the computer-executable instructions may perform a training method for a neural network for extracting speaker-embedded features in any of the above method embodiments, where the neural network includes a plurality of frame-level layers, a statistical pooling layer, and a plurality of segment-level layers;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

receiving and processing an input audio segment via the plurality of frame-level layers, wherein the plurality of frame-level layers are used to extract frame-level spectral features;

aggregating, via the statistical pooling layer, the frame-level spectral features into segment-level spectral features;

splitting a first multilayer linear layer on the basis of the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature;

merging, via the plurality of segment-level layers, the segment-level spectral features into utterance-level spectral features and calculating a speaker loss for the utterance-level spectral features;

re-splitting a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature;

training the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of a training device that extracts a neural network of speaker-embedded features, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and these remote memories may be connected over a network to a training device of a neural network that extracts speaker-embedded features. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any one of the above-mentioned methods for training a neural network for extracting speaker-embedded features.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, with one processor 610 being an example in fig. 6. The device for the training method of the neural network for extracting the speaker embedded feature may further include: an input device 630 and an output device 640. The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6. The memory 620 is a non-volatile computer-readable storage medium as described above. The processor 610 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory 620, namely, implementing the training method of the neural network for extracting speaker embedded features of the above method embodiments. The input device 630 may receive input numeric or character information and generate key signal inputs related to user settings and functional controls of the training device of the neural network that extracts speaker-embedded features. The output device 640 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied in a training apparatus of a neural network for extracting speaker-embedded features, wherein the neural network includes a plurality of frame-level layers, a statistical pooling layer and a plurality of segment-level layers, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A training method for a neural network for extracting speaker-embedded features, wherein the neural network comprises a plurality of frame-level layers, a statistical pooling layer, and a plurality of segment-level layers, the method comprising:

2. The method of claim 1, wherein the method further comprises: inserting a gradient inversion layer before the first multi-layer linear layer for counter training; and/or

A gradient inversion layer is inserted before the second multi-layer linear layer for counter training.

3. The method of claim 1, wherein the first channel penalty and the second channel penalty comprise using cross entropy calculations.

4. The method of claim 1, wherein the speaker loss comprises using an additional angular margin loss calculation.

5. The method of any of claims 1-4, wherein the plurality of frame level layers comprise a time-lapse neural network feature extractor and the plurality of segment level layers comprise a linear embedding layer.

6. The method of claim 5, wherein the neural network comprises a deep neural network.

7. A training apparatus for a neural network for extracting speaker-embedded features, wherein the neural network includes a plurality of frame-level layers, a statistical pooling layer, and a plurality of segment-level layers, the apparatus comprising:

a receive processing module configured to receive and process an input audio segment via the plurality of frame-level layers, wherein the plurality of frame-level layers are used to extract frame-level spectral features;

an aggregation module configured to aggregate the frame-level spectral features into segment-level spectral features via the statistical pooling layer;

a first branching module configured to further split a first multi-layered linear layer based on the statistical pooling layer for calculating a first channel loss of the segment-level spectral feature;

a merging module configured to merge the segment-level spectral features into utterance-level spectral features via the plurality of segment-level layers and to compute a speaker loss for the utterance-level spectral features;

a second branching module configured to re-split a second multi-layered linear layer on the basis of the plurality of segment-level layers for calculating a second channel loss of the speech-level spectral feature;

a training module configured to train the neural network by controlling a sum of the first channel loss, the second channel loss, and the speaker loss.

8. The apparatus of claim 7, further comprising:

inserting a gradient inversion layer before the first multi-layer linear layer for counter training; and/or

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.