CN114357414B

CN114357414B - Emotion speaker authentication method based on cross-gradient training

Info

Publication number: CN114357414B
Application number: CN202111483807.1A
Authority: CN
Inventors: 贺前华; 危卓; 田颖慧
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2024-04-02
Anticipated expiration: 2041-12-07
Also published as: CN114357414A

Abstract

The invention discloses a cross-gradient training-based emotion speaker authentication method, which comprises the following steps: building a network model based on an x-vector system and combining multitask learning; extracting acoustic features from the training speech; randomly selecting a group of feature sequences of training voice samples as network input, simultaneously carrying out emotion classification and speaker classification, and adjusting network parameters through joint loss of two tasks; updating the feature sequence by using a loss function of the emotion classification part; performing cross-gradient training, and readjusting network parameters of the speaker classification part; after the network training is finished, an authentication threshold is set to authenticate the speaker. Aiming at the problem that the performance of the speaker authentication system is reduced when the emotion of the registered voice and the emotion of the test voice are not matched, and combining with multi-task learning, the invention utilizes cross-gradient training to expand the emotion information of training data, improves the speaker authentication performance of the emotion voice and relieves the degree of overfitting on a training set with small data volume.

Description

Emotion speaker authentication method based on cross-gradient training

Technical Field

The invention relates to the technical field of biological feature recognition, in particular to a cross-gradient training-based emotion speaker authentication method.

Background

Currently, many biometric identification technologies are developed more mature, and common biometric features include: speech, fingerprints, retina, iris, face, signature, etc., where speech is the most natural and one of the most direct ways people communicate and communicate in daily life. The voice signal also has the advantages of convenient collection, lower cost, multiple acquisition channels and the like. In a laboratory environment, the performance of speaker authentication is ideal, but many problems still exist in practical application, such as the presence of emotion factors can significantly compromise the performance of the speaker authentication system. In the process of speaker registration, neutral voice is generally adopted for registration in consideration of the user friendliness of the system, but when authentication is carried out, speakers are difficult to be in a neutral state in some specific occasions or at specific moments, so that the emotion mismatch problem of the registered voice and the voice to be authenticated can lead to the rapid reduction of the system performance.

At present, there are two main types of emotion speaker authentication processing methods, one is based on traditional machine learning methods, such as GMM, GMM-UBM, HMM, i-vector, etc., for example, the paper Atom Aligned Sparse Representation Approach for Indonesian Emotional Speaker Recognition System by Kusuma et al (2020 7th International Conference on Advance Informatics:Concepts,Theory and Applications,2020). The method generally establishes a model for each specific emotion, so that in practical application, not only a plurality of groups of model parameters are needed to be stored, but also accurate emotion recognition is required to be carried out on the voice, and once the emotion recognition accuracy is low, the speaker authentication effect is also affected. Another class is based on deep neural networks, such as AANN, CNN, RNN, et al, e.g., meftah et al, in paper X-vectors Meet Emotions: A Study on Dependencies Between Emotion and Speaker Recognition (ICASSP 2020.) using an X-vector system for speaker authentication. The method generally directly uses a common network model in a speaker authentication task, and when the test voice is emotion voice, the generalization capability of the speaker authentication system is insufficient.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention provides a cross-gradient training-based emotion speaker authentication method, which can not be directly separated due to partial correlation between speaker information and emotion information, and utilizes multi-task learning to comprehensively model voiceprint information of a speaker by combining with emotion information, and then considers task-related uncertainty to balance loss values of two tasks of emotion classification and speaker classification; meanwhile, based on the continuity of the spatial distribution of the voice features, disturbance related to the emotion field is added to the voice sample, and the richness of emotion information is improved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a cross-gradient training-based emotion speaker authentication method, which comprises the following steps:

based on an x-vector system architecture, a network model for extracting voice print characteristics of a speaker is built by combining a TDNN-BiLSTM network with multi-task learning;

extracting acoustic characteristics from voice samples required by network training;

randomly screening voice samples with emotion colors, inputting the voice samples into a network model for training, carrying out emotion classification training tasks and speaker classification training tasks, balancing the loss of the two training tasks, constructing joint loss of multi-task learning, and adjusting parameters of the network model;

performing cross-gradient training, namely obtaining a new feature sequence by superposing disturbance in the emotion field on the feature sequence of the original voice sample, inputting the obtained new feature sequence into a network model, and adjusting the network parameters of the speaker classification task again;

repeating the training steps until the training is finished, reserving a network structure of the speaker classification task for speaker authentication after the training of the network model is finished, selecting a network structure middle layer of the speaker classification task to output as voiceprint characteristics of a speaker, and calculating cosine similarity of the voiceprint template of a registrant and the voiceprint characteristics of a tester for speaker authentication.

As a preferable technical solution, the network structure of the network model specifically includes:

the front three layers adopt a shared TDNN layer to process the characteristics of the frame level, the first layer and the second layer process the characteristics of the front and rear 2 frames taking the current frame as the center, and the third layer processes the characteristics of the front and rear 3 frames taking the current frame as the center;

dividing network branches of emotion classification and speaker classification, adopting two layers of Bi-LSTM, finally outputting and reserving an output result at each moment, and aggregating the characteristic representation of the frame level output by the Bi-LSTM layer by using a statistical pooling layer to obtain the characteristic representation of the segment level;

the statistical pooling layers of the two branches are followed by three full-connection layers, and the output dimension of the last full-connection layer is respectively consistent with the classification category number of the corresponding task.

As a preferred technical solution, the balancing the loss of two training tasks, and constructing a joint loss of multi-task learning, specifically includes the steps of:

taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:

wherein x is a feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is an integral network model parameter, and the speaker label corresponding to x is y _s The corresponding emotion label is y _e ，f ^W (x) Is the predicted output result of the network model when the input is x,is the speaker classification result predicted by the network model when the input is x,>the emotion classification result is input as the prediction of the network model when x is the input;

after different parameters delta are introduced into different tasks to change the scales of output results of two branches of the neural network, the method comprises the following steps:

the joint loss of multitasking learning is:

wherein delta _s Weight parameters, delta, for speaker classification penalty _e And (5) a weight parameter for emotion classification loss.

automatic learning of parameters μ=logδ using a network ² Is used for constructing a joint loss function, and the joint loss function is constructed as follows:

wherein mu _s Is the relevant parameter of speaker classification task, mu _e Is a parameter related to emotion classification tasks.

As a preferred technical solution, the step of obtaining a new feature sequence by superimposing disturbance in the emotion field on the feature sequence of the original speech sample includes the following specific steps:

taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and calculating cross entropy loss corresponding to emotion classification branches:

wherein x is a feature sequence corresponding to the extracted fbank feature of the voice sample, namely the input of the neural network, W is an integral network model parameter, and the emotion label corresponding to x is y _e ，f ^W (x) Is the predicted output result of the network model when the input is x,the emotion classification result is input as the prediction of the network model when x is the input;

the disturbance in the emotion field is obtained by calculating a jacobian matrix of cross entropy loss values of an emotion classification part relative to a characteristic sequence of an original input sample, and the characteristic sequence x of the original input sample is superimposed with a new characteristic sequence x obtained by the disturbance _e ：

Wherein, CELoss _e (x) Representing the loss value of x as the network input in the emotion classification section, e represents the perturbation coefficient.

As a preferred technical solution, the feature sequence x of the original speech sample is used as a network input, and the joint loss of emotion classification and speaker classification and the loss of speaker classification after updating are used as the network input as the total loss of the network model, specifically expressed as:

TLoss(x)＝MLoss(x)+α·CELoss _s (x _e )

wherein alpha represents a new feature sequence x _e TLoss (x) represents total loss, MLoss (x) represents joint loss of multi-task learning, and CELSs as the proportion of gradient return of speaker classification loss corresponding to network input _s (x _e ) Is x _e And adopting a cross entropy loss function to obtain a result at the loss value of the speaker classification part.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) The invention adopts multitask learning, improves the robustness of the model by paying attention to the information of different tasks, reduces the overfitting problem caused by less data volume of the emotion voice data set, utilizes emotion information to carry out more comprehensive modeling on the acoustic space of a speaker, improves the system performance, simultaneously avoids leading position of a certain task in the training process by distributing different weights to the loss values of the two tasks, and can accelerate convergence during training.

(2) The invention adopts cross-gradient training, avoids the defect that different models are needed when the speaker authentication is carried out aiming at different types of emotion voices, improves the richness of emotion field information while ensuring the invariable type of the speaker tag, and improves the generalization capability of the speaker authentication network in different emotion fields.

Drawings

FIG. 1 is a schematic flow chart of an emotion speaker authentication method based on cross-gradient training;

FIG. 2 is a block diagram of a network architecture under multitasking of the present invention;

FIG. 3 is a schematic flow chart of the cross-gradient training of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Examples

As shown in fig. 1, the embodiment provides a method for authenticating emotion speakers based on cross-gradient training, which includes the following steps:

s1, building a network model for extracting voice print characteristics of a speaker by utilizing a TDNN-BiLSTM network based on an x-vector system architecture in combination with multi-task learning;

s2, extracting acoustic features from voice samples required by network training;

s3, randomly selecting a group of feature sequences of training voice samples as network input, simultaneously carrying out two tasks of emotion classification and speaker classification, balancing loss values of the two tasks so as to construct joint loss of multi-task learning, and adjusting network model parameters;

s4, cross-gradient training, namely, obtaining a new feature sequence by superposing disturbance related to emotion field on the feature sequence of the original voice sample, inputting the new feature sequence obtained in the step S4 into a neural network to adjust the network parameters of the speaker classification part again, wherein the emotion classification part does not participate in the training process of the new feature sequence;

s5, randomly selecting a group of feature sequences of training voice samples to train, and repeating the step of cross-gradient training until the training is finished;

and S6, during authentication, removing the network structure of the emotion classification branch part, only reserving the complete structure of the speaker classification part, selecting the output of the network middle layer of the speaker classification part as the voiceprint characteristic of the speaker, and calculating the cosine similarity of the voiceprint template of the registrant and the voiceprint characteristic of the tester to perform speaker authentication.

As shown in fig. 2, the first three layers of the deep neural network structure of the present embodiment are the shared TDNN layers, and process the features at the frame level, where the first and second layers consider the features of the front and rear 2 frames centered on the current frame, and the third layer considers the features of the front and rear 3 frames centered on the current frame; then dividing two branches, wherein one branch carries out speaker classification and the other branch carries out emotion classification; two layers of Bi-LSTM are adopted for the two branches with similar structures, the final output keeps the output result at each moment, the number of nodes of the first layer of hidden layers is 256, and the number of nodes of the second layer of hidden layers is 512; then, aggregating the characteristic representation of the frame level output by the Bi-LSTM layer by using the statistical pooling layer to obtain the characteristic representation of the segment level; finally, the three layers of full-connection layers are adopted, the number of output nodes of the first layer of full-connection layer is 512, the number of output nodes of the second layer of output layer is 256, the unique difference of the two branch network structures is that the output dimensions of the last full-connection layer are different, the output dimensions are respectively consistent with the classification category number of the corresponding task, the number of output nodes of the last layer of the speaker classification part is 65, and the output nodes of the last layer of speaker classification part are consistent with the speaker category number in training data; the number of output nodes of the last layer of the emotion classification part is 6, and the number of output nodes is consistent with the number of emotion categories in the training data.

In this embodiment, the training data set adopts emotion voice data set stream_d, which contains 6 emotions: neutral, happy, sad, aversion, anger and fear. The dataset consisted of a total of 91 persons, 7442 voices, of which 5311 voices of 65 persons were selected as training data. For each speech sample, 40-dimensional fbank features were extracted as network inputs, and during training, batch_size was chosen to be 128, epoch was 100.

In the training process, firstly, taking an original characteristic sequence corresponding to each batch of voice samples as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:

wherein x is the feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is the integral network model parameter, and the speaker label corresponding to x isThe corresponding emotion label is->Y _S For the speaker class collection in the training set, N is the number of speakers contained in the training set, the preferred value is 65, Y _E For the emotion type collection in the training set, M is the total emotion type number of the training set, and the optimal value is 6,f ^W (x) Is the predicted output result of the network model when the input is x, < >>Is the speaker classification result predicted by the network model when the input is x,>the emotion classification result predicted by the network model when the input is x is input, the emotion classification result of the embodiment can influence parameter adjustment of a shared network layer, and the input gradient is calculated by using emotion classification loss as disturbance of the emotion field.

wherein, let x correspond to the speaker tag specifically be y _sn Then:

thus balancing the loss values of the two parts to construct a joint loss function:

In the present embodiment, the introduced δ -parameter may not be directly used, but the network automatic learning parameter μ=logδ may be utilized ² Avoiding complex manual tuning, the joint loss function can be changed to:

For the parameters of network automatic learning, a network layer is independently set, corresponding initial values are set, and the attributes of the parameters are set to be learnable, and are continuously adjusted in the training process as other network parameters. The initial value of mu is set to 0, which is equivalent to the loss function of the mu and the loss function of the mu in the initial state, and the loss function plays an equal role in network training. The mu parameter can avoid the situation that the delta parameter is automatically adjusted by the network to cause the calculation loss to be divided by 0.

For the training data of each batch, after the joint loss of speaker classification and emotion classification is calculated, gradient feedback is carried out once, and the whole network parameters are adjusted.

As shown in fig. 3, for the original input samplesAnd obtaining disturbance related to the emotion field by calculating a jacobian matrix of the cross entropy loss value of the emotion classification part relative to the characteristic sequence of the input sample. Feature sequence x of original input sample is superimposed with new feature sequence x obtained by disturbance _e ：

Wherein, CELoss _e (x) Is the loss value of x as the network input in the emotion classification part, and epsilon is the disturbance coefficient. In order to avoid complex network parameter tuning, the disturbance coefficient epsilon is automatically learned through the network instead of manually set, and the initialization value is set to be 1.

As shown in fig. 3, for a feature sequence x of speech samples, the overall loss function of the neural network includes two parts: in the step S2, the feature sequence of the original voice sample is used as network input, and the joint loss of emotion classification and speaker classification and the updated feature sequence in the step S4 are used as network input to carry out the loss of speaker classification at the same time:

TLoss(x)＝MLoss(x)+α·CELoss _s (x _e )

wherein alpha represents a new feature sequence x _e Gradient feedback proportion of speaker classification loss corresponding to network input _s (x _e ) Is x _e And adopting a cross entropy loss function to obtain a result at the loss value of the speaker classification part. In order to avoid complex network parameter tuning, the coefficient alpha is automatically learned through a network instead of being manually set by a person, and the initialization value of the coefficient alpha is set to be 0.5.

After training 100 epochs, when the trained network model is utilized to carry out speaker authentication, a shallow shared TDNN layer and a speaker classification branch are reserved, an emotion classification part and a last full-connection layer of the speaker classification branch are removed, and the output of a second full-connection layer is used as the voiceprint characteristic of a speaker, so that the voiceprint characteristic of the speaker corresponding to the voice is a 256-dimensional feature vector.

2131 voice data of 26 persons are drawn from the stream_d data set as a test set (the 26 persons do not overlap with 65 persons in the training set), and the emotion voice speaker authentication threshold is obtained by using the test data set, and the steps are as follows:

1, traversing all speakers in a test data set, wherein each person selects a piece of neutral voice as registration data, other voice samples as test data, and combining the registration data and the test data in pairs to form a sample pair, wherein the sample pair belonging to the same speaker is a positive sample pair, and the sample pair belonging to different speakers is a negative sample pair;

2, calculating the corresponding voiceprint characteristics of each sample pair and cosine similarity thereof;

and 3, calculating EER according to cosine similarity of the positive and negative sample pairs, and taking a threshold value corresponding to EER as an authentication threshold value.

When the speaker authenticates, the cosine similarity of the voiceprint template of the registrant and the voiceprint characteristics corresponding to the test voice is calculated, then the cosine similarity is compared with an authentication threshold, if the similarity is larger than the authentication threshold, the test voice belongs to the registrant, otherwise, the test voice does not belong to the registrant.

The invention adopts multitask learning, improves the robustness of the model by paying attention to information of different tasks, reduces the overfitting problem caused by less data volume of emotion voice data sets, utilizes emotion information to more comprehensively model the acoustic space of a speaker, and improves the system performance. Meanwhile, a certain task is prevented from being dominant in the training process by assigning different weights to loss values of the two tasks.

The cross-gradient training adopted by the invention improves the richness of emotion domain information while ensuring the invariable category of the speaker label, improves the generalization capability of the speaker authentication network in different emotion domains, and avoids the defect that different models are needed when the speaker authentication is carried out for different categories of emotion voices.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The emotion speaker authentication method based on cross-gradient training is characterized by comprising the following steps of:

the method balances the loss of two training tasks to construct the joint loss of multi-task learning, and comprises the following specific steps:

wherein x is a feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is an integral network model parameter, and the speaker label corresponding to x is y _s The corresponding emotion label is y _e ，f ^W (x) Is the predicted output result of the network model when the input is x,speaker classification junction with input of x-time network model predictionFruit of (Bu)>The emotion classification result is input as the prediction of the network model when x is the input;

when different parameters delta are introduced, the joint loss of the multi-task learning is as follows:

wherein delta _s Weight parameters, delta, for speaker classification penalty _e Weight parameters for emotion classification loss;

automatic learning of parameters μ=logδ using a network ² The final joint loss function is constructed as follows:

wherein mu _s Is the relevant parameter of speaker classification task, mu _e Is a parameter related to emotion classification tasks;

the method for obtaining the new feature sequence by superimposing disturbance in the emotion field on the feature sequence of the original voice sample comprises the following specific steps:

Wherein, CELoss _e (x) Representing x as a loss value of the network input in the emotion classification part, and epsilon represents a disturbance coefficient;

2. The emotion speaker authentication method based on cross-gradient training of claim 1, wherein the network structure of the network model is specifically:

3. The emotion speaker authentication method based on cross-gradient training according to claim 1, wherein a feature sequence x of an original speech sample is used as a network input, and a joint loss of emotion classification and speaker classification and a loss of speaker classification after updating are used as network input as a total loss of a network model, specifically expressed as:

TLoss(x)＝MLoss(x)+α·CELoss _s (x _e )