CN114357414B - Emotion speaker authentication method based on cross-gradient training - Google Patents

Emotion speaker authentication method based on cross-gradient training Download PDF

Info

Publication number
CN114357414B
CN114357414B CN202111483807.1A CN202111483807A CN114357414B CN 114357414 B CN114357414 B CN 114357414B CN 202111483807 A CN202111483807 A CN 202111483807A CN 114357414 B CN114357414 B CN 114357414B
Authority
CN
China
Prior art keywords
emotion
speaker
classification
training
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111483807.1A
Other languages
Chinese (zh)
Other versions
CN114357414A (en
Inventor
贺前华
危卓
田颖慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202111483807.1A priority Critical patent/CN114357414B/en
Publication of CN114357414A publication Critical patent/CN114357414A/en
Application granted granted Critical
Publication of CN114357414B publication Critical patent/CN114357414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a cross-gradient training-based emotion speaker authentication method, which comprises the following steps: building a network model based on an x-vector system and combining multitask learning; extracting acoustic features from the training speech; randomly selecting a group of feature sequences of training voice samples as network input, simultaneously carrying out emotion classification and speaker classification, and adjusting network parameters through joint loss of two tasks; updating the feature sequence by using a loss function of the emotion classification part; performing cross-gradient training, and readjusting network parameters of the speaker classification part; after the network training is finished, an authentication threshold is set to authenticate the speaker. Aiming at the problem that the performance of the speaker authentication system is reduced when the emotion of the registered voice and the emotion of the test voice are not matched, and combining with multi-task learning, the invention utilizes cross-gradient training to expand the emotion information of training data, improves the speaker authentication performance of the emotion voice and relieves the degree of overfitting on a training set with small data volume.

Description

Emotion speaker authentication method based on cross-gradient training
Technical Field
The invention relates to the technical field of biological feature recognition, in particular to a cross-gradient training-based emotion speaker authentication method.
Background
Currently, many biometric identification technologies are developed more mature, and common biometric features include: speech, fingerprints, retina, iris, face, signature, etc., where speech is the most natural and one of the most direct ways people communicate and communicate in daily life. The voice signal also has the advantages of convenient collection, lower cost, multiple acquisition channels and the like. In a laboratory environment, the performance of speaker authentication is ideal, but many problems still exist in practical application, such as the presence of emotion factors can significantly compromise the performance of the speaker authentication system. In the process of speaker registration, neutral voice is generally adopted for registration in consideration of the user friendliness of the system, but when authentication is carried out, speakers are difficult to be in a neutral state in some specific occasions or at specific moments, so that the emotion mismatch problem of the registered voice and the voice to be authenticated can lead to the rapid reduction of the system performance.
At present, there are two main types of emotion speaker authentication processing methods, one is based on traditional machine learning methods, such as GMM, GMM-UBM, HMM, i-vector, etc., for example, the paper Atom Aligned Sparse Representation Approach for Indonesian Emotional Speaker Recognition System by Kusuma et al (2020 7th International Conference on Advance Informatics:Concepts,Theory and Applications,2020). The method generally establishes a model for each specific emotion, so that in practical application, not only a plurality of groups of model parameters are needed to be stored, but also accurate emotion recognition is required to be carried out on the voice, and once the emotion recognition accuracy is low, the speaker authentication effect is also affected. Another class is based on deep neural networks, such as AANN, CNN, RNN, et al, e.g., meftah et al, in paper X-vectors Meet Emotions: A Study on Dependencies Between Emotion and Speaker Recognition (ICASSP 2020.) using an X-vector system for speaker authentication. The method generally directly uses a common network model in a speaker authentication task, and when the test voice is emotion voice, the generalization capability of the speaker authentication system is insufficient.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention provides a cross-gradient training-based emotion speaker authentication method, which can not be directly separated due to partial correlation between speaker information and emotion information, and utilizes multi-task learning to comprehensively model voiceprint information of a speaker by combining with emotion information, and then considers task-related uncertainty to balance loss values of two tasks of emotion classification and speaker classification; meanwhile, based on the continuity of the spatial distribution of the voice features, disturbance related to the emotion field is added to the voice sample, and the richness of emotion information is improved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention provides a cross-gradient training-based emotion speaker authentication method, which comprises the following steps:
based on an x-vector system architecture, a network model for extracting voice print characteristics of a speaker is built by combining a TDNN-BiLSTM network with multi-task learning;
extracting acoustic characteristics from voice samples required by network training;
randomly screening voice samples with emotion colors, inputting the voice samples into a network model for training, carrying out emotion classification training tasks and speaker classification training tasks, balancing the loss of the two training tasks, constructing joint loss of multi-task learning, and adjusting parameters of the network model;
performing cross-gradient training, namely obtaining a new feature sequence by superposing disturbance in the emotion field on the feature sequence of the original voice sample, inputting the obtained new feature sequence into a network model, and adjusting the network parameters of the speaker classification task again;
repeating the training steps until the training is finished, reserving a network structure of the speaker classification task for speaker authentication after the training of the network model is finished, selecting a network structure middle layer of the speaker classification task to output as voiceprint characteristics of a speaker, and calculating cosine similarity of the voiceprint template of a registrant and the voiceprint characteristics of a tester for speaker authentication.
As a preferable technical solution, the network structure of the network model specifically includes:
the front three layers adopt a shared TDNN layer to process the characteristics of the frame level, the first layer and the second layer process the characteristics of the front and rear 2 frames taking the current frame as the center, and the third layer processes the characteristics of the front and rear 3 frames taking the current frame as the center;
dividing network branches of emotion classification and speaker classification, adopting two layers of Bi-LSTM, finally outputting and reserving an output result at each moment, and aggregating the characteristic representation of the frame level output by the Bi-LSTM layer by using a statistical pooling layer to obtain the characteristic representation of the segment level;
the statistical pooling layers of the two branches are followed by three full-connection layers, and the output dimension of the last full-connection layer is respectively consistent with the classification category number of the corresponding task.
As a preferred technical solution, the balancing the loss of two training tasks, and constructing a joint loss of multi-task learning, specifically includes the steps of:
taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:
wherein x is a feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is an integral network model parameter, and the speaker label corresponding to x is y s The corresponding emotion label is y e ,f W (x) Is the predicted output result of the network model when the input is x,is the speaker classification result predicted by the network model when the input is x,>the emotion classification result is input as the prediction of the network model when x is the input;
after different parameters delta are introduced into different tasks to change the scales of output results of two branches of the neural network, the method comprises the following steps:
the joint loss of multitasking learning is:
wherein delta s Weight parameters, delta, for speaker classification penalty e And (5) a weight parameter for emotion classification loss.
As a preferred technical solution, the balancing the loss of two training tasks, and constructing a joint loss of multi-task learning, specifically includes the steps of:
taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:
wherein x is a feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is an integral network model parameter, and the speaker label corresponding to x is y s The corresponding emotion label is y e ,f W (x) Is the predicted output result of the network model when the input is x,is the speaker classification result predicted by the network model when the input is x,>the emotion classification result is input as the prediction of the network model when x is the input;
automatic learning of parameters μ=logδ using a network 2 Is used for constructing a joint loss function, and the joint loss function is constructed as follows:
wherein mu s Is the relevant parameter of speaker classification task, mu e Is a parameter related to emotion classification tasks.
As a preferred technical solution, the step of obtaining a new feature sequence by superimposing disturbance in the emotion field on the feature sequence of the original speech sample includes the following specific steps:
taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and calculating cross entropy loss corresponding to emotion classification branches:
wherein x is a feature sequence corresponding to the extracted fbank feature of the voice sample, namely the input of the neural network, W is an integral network model parameter, and the emotion label corresponding to x is y e ,f W (x) Is the predicted output result of the network model when the input is x,the emotion classification result is input as the prediction of the network model when x is the input;
the disturbance in the emotion field is obtained by calculating a jacobian matrix of cross entropy loss values of an emotion classification part relative to a characteristic sequence of an original input sample, and the characteristic sequence x of the original input sample is superimposed with a new characteristic sequence x obtained by the disturbance e
Wherein, CELoss e (x) Representing the loss value of x as the network input in the emotion classification section, e represents the perturbation coefficient.
As a preferred technical solution, the feature sequence x of the original speech sample is used as a network input, and the joint loss of emotion classification and speaker classification and the loss of speaker classification after updating are used as the network input as the total loss of the network model, specifically expressed as:
TLoss(x)=MLoss(x)+α·CELoss s (x e )
wherein alpha represents a new feature sequence x e TLoss (x) represents total loss, MLoss (x) represents joint loss of multi-task learning, and CELSs as the proportion of gradient return of speaker classification loss corresponding to network input s (x e ) Is x e And adopting a cross entropy loss function to obtain a result at the loss value of the speaker classification part.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) The invention adopts multitask learning, improves the robustness of the model by paying attention to the information of different tasks, reduces the overfitting problem caused by less data volume of the emotion voice data set, utilizes emotion information to carry out more comprehensive modeling on the acoustic space of a speaker, improves the system performance, simultaneously avoids leading position of a certain task in the training process by distributing different weights to the loss values of the two tasks, and can accelerate convergence during training.
(2) The invention adopts cross-gradient training, avoids the defect that different models are needed when the speaker authentication is carried out aiming at different types of emotion voices, improves the richness of emotion field information while ensuring the invariable type of the speaker tag, and improves the generalization capability of the speaker authentication network in different emotion fields.
Drawings
FIG. 1 is a schematic flow chart of an emotion speaker authentication method based on cross-gradient training;
FIG. 2 is a block diagram of a network architecture under multitasking of the present invention;
FIG. 3 is a schematic flow chart of the cross-gradient training of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Examples
As shown in fig. 1, the embodiment provides a method for authenticating emotion speakers based on cross-gradient training, which includes the following steps:
s1, building a network model for extracting voice print characteristics of a speaker by utilizing a TDNN-BiLSTM network based on an x-vector system architecture in combination with multi-task learning;
s2, extracting acoustic features from voice samples required by network training;
s3, randomly selecting a group of feature sequences of training voice samples as network input, simultaneously carrying out two tasks of emotion classification and speaker classification, balancing loss values of the two tasks so as to construct joint loss of multi-task learning, and adjusting network model parameters;
s4, cross-gradient training, namely, obtaining a new feature sequence by superposing disturbance related to emotion field on the feature sequence of the original voice sample, inputting the new feature sequence obtained in the step S4 into a neural network to adjust the network parameters of the speaker classification part again, wherein the emotion classification part does not participate in the training process of the new feature sequence;
s5, randomly selecting a group of feature sequences of training voice samples to train, and repeating the step of cross-gradient training until the training is finished;
and S6, during authentication, removing the network structure of the emotion classification branch part, only reserving the complete structure of the speaker classification part, selecting the output of the network middle layer of the speaker classification part as the voiceprint characteristic of the speaker, and calculating the cosine similarity of the voiceprint template of the registrant and the voiceprint characteristic of the tester to perform speaker authentication.
As shown in fig. 2, the first three layers of the deep neural network structure of the present embodiment are the shared TDNN layers, and process the features at the frame level, where the first and second layers consider the features of the front and rear 2 frames centered on the current frame, and the third layer considers the features of the front and rear 3 frames centered on the current frame; then dividing two branches, wherein one branch carries out speaker classification and the other branch carries out emotion classification; two layers of Bi-LSTM are adopted for the two branches with similar structures, the final output keeps the output result at each moment, the number of nodes of the first layer of hidden layers is 256, and the number of nodes of the second layer of hidden layers is 512; then, aggregating the characteristic representation of the frame level output by the Bi-LSTM layer by using the statistical pooling layer to obtain the characteristic representation of the segment level; finally, the three layers of full-connection layers are adopted, the number of output nodes of the first layer of full-connection layer is 512, the number of output nodes of the second layer of output layer is 256, the unique difference of the two branch network structures is that the output dimensions of the last full-connection layer are different, the output dimensions are respectively consistent with the classification category number of the corresponding task, the number of output nodes of the last layer of the speaker classification part is 65, and the output nodes of the last layer of speaker classification part are consistent with the speaker category number in training data; the number of output nodes of the last layer of the emotion classification part is 6, and the number of output nodes is consistent with the number of emotion categories in the training data.
In this embodiment, the training data set adopts emotion voice data set stream_d, which contains 6 emotions: neutral, happy, sad, aversion, anger and fear. The dataset consisted of a total of 91 persons, 7442 voices, of which 5311 voices of 65 persons were selected as training data. For each speech sample, 40-dimensional fbank features were extracted as network inputs, and during training, batch_size was chosen to be 128, epoch was 100.
In the training process, firstly, taking an original characteristic sequence corresponding to each batch of voice samples as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:
wherein x is the feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is the integral network model parameter, and the speaker label corresponding to x isThe corresponding emotion label is->Y S For the speaker class collection in the training set, N is the number of speakers contained in the training set, the preferred value is 65, Y E For the emotion type collection in the training set, M is the total emotion type number of the training set, and the optimal value is 6,f W (x) Is the predicted output result of the network model when the input is x, < >>Is the speaker classification result predicted by the network model when the input is x,>the emotion classification result predicted by the network model when the input is x is input, the emotion classification result of the embodiment can influence parameter adjustment of a shared network layer, and the input gradient is calculated by using emotion classification loss as disturbance of the emotion field.
After different parameters delta are introduced into different tasks to change the scales of output results of two branches of the neural network, the method comprises the following steps:
wherein, let x correspond to the speaker tag specifically be y sn Then:
thus balancing the loss values of the two parts to construct a joint loss function:
wherein delta s Weight parameters, delta, for speaker classification penalty e And (5) a weight parameter for emotion classification loss.
In the present embodiment, the introduced δ -parameter may not be directly used, but the network automatic learning parameter μ=logδ may be utilized 2 Avoiding complex manual tuning, the joint loss function can be changed to:
wherein mu s Is the relevant parameter of speaker classification task, mu e Is a parameter related to emotion classification tasks.
For the parameters of network automatic learning, a network layer is independently set, corresponding initial values are set, and the attributes of the parameters are set to be learnable, and are continuously adjusted in the training process as other network parameters. The initial value of mu is set to 0, which is equivalent to the loss function of the mu and the loss function of the mu in the initial state, and the loss function plays an equal role in network training. The mu parameter can avoid the situation that the delta parameter is automatically adjusted by the network to cause the calculation loss to be divided by 0.
For the training data of each batch, after the joint loss of speaker classification and emotion classification is calculated, gradient feedback is carried out once, and the whole network parameters are adjusted.
As shown in fig. 3, for the original input samplesAnd obtaining disturbance related to the emotion field by calculating a jacobian matrix of the cross entropy loss value of the emotion classification part relative to the characteristic sequence of the input sample. Feature sequence x of original input sample is superimposed with new feature sequence x obtained by disturbance e
Wherein, CELoss e (x) Is the loss value of x as the network input in the emotion classification part, and epsilon is the disturbance coefficient. In order to avoid complex network parameter tuning, the disturbance coefficient epsilon is automatically learned through the network instead of manually set, and the initialization value is set to be 1.
As shown in fig. 3, for a feature sequence x of speech samples, the overall loss function of the neural network includes two parts: in the step S2, the feature sequence of the original voice sample is used as network input, and the joint loss of emotion classification and speaker classification and the updated feature sequence in the step S4 are used as network input to carry out the loss of speaker classification at the same time:
TLoss(x)=MLoss(x)+α·CELoss s (x e )
wherein alpha represents a new feature sequence x e Gradient feedback proportion of speaker classification loss corresponding to network input s (x e ) Is x e And adopting a cross entropy loss function to obtain a result at the loss value of the speaker classification part. In order to avoid complex network parameter tuning, the coefficient alpha is automatically learned through a network instead of being manually set by a person, and the initialization value of the coefficient alpha is set to be 0.5.
After training 100 epochs, when the trained network model is utilized to carry out speaker authentication, a shallow shared TDNN layer and a speaker classification branch are reserved, an emotion classification part and a last full-connection layer of the speaker classification branch are removed, and the output of a second full-connection layer is used as the voiceprint characteristic of a speaker, so that the voiceprint characteristic of the speaker corresponding to the voice is a 256-dimensional feature vector.
2131 voice data of 26 persons are drawn from the stream_d data set as a test set (the 26 persons do not overlap with 65 persons in the training set), and the emotion voice speaker authentication threshold is obtained by using the test data set, and the steps are as follows:
1, traversing all speakers in a test data set, wherein each person selects a piece of neutral voice as registration data, other voice samples as test data, and combining the registration data and the test data in pairs to form a sample pair, wherein the sample pair belonging to the same speaker is a positive sample pair, and the sample pair belonging to different speakers is a negative sample pair;
2, calculating the corresponding voiceprint characteristics of each sample pair and cosine similarity thereof;
and 3, calculating EER according to cosine similarity of the positive and negative sample pairs, and taking a threshold value corresponding to EER as an authentication threshold value.
When the speaker authenticates, the cosine similarity of the voiceprint template of the registrant and the voiceprint characteristics corresponding to the test voice is calculated, then the cosine similarity is compared with an authentication threshold, if the similarity is larger than the authentication threshold, the test voice belongs to the registrant, otherwise, the test voice does not belong to the registrant.
The invention adopts multitask learning, improves the robustness of the model by paying attention to information of different tasks, reduces the overfitting problem caused by less data volume of emotion voice data sets, utilizes emotion information to more comprehensively model the acoustic space of a speaker, and improves the system performance. Meanwhile, a certain task is prevented from being dominant in the training process by assigning different weights to loss values of the two tasks.
The cross-gradient training adopted by the invention improves the richness of emotion domain information while ensuring the invariable category of the speaker label, improves the generalization capability of the speaker authentication network in different emotion domains, and avoids the defect that different models are needed when the speaker authentication is carried out for different categories of emotion voices.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (3)

1. The emotion speaker authentication method based on cross-gradient training is characterized by comprising the following steps of:
based on an x-vector system architecture, a network model for extracting voice print characteristics of a speaker is built by combining a TDNN-BiLSTM network with multi-task learning;
extracting acoustic characteristics from voice samples required by network training;
randomly screening voice samples with emotion colors, inputting the voice samples into a network model for training, carrying out emotion classification training tasks and speaker classification training tasks, balancing the loss of the two training tasks, constructing joint loss of multi-task learning, and adjusting parameters of the network model;
the method balances the loss of two training tasks to construct the joint loss of multi-task learning, and comprises the following specific steps:
taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and respectively calculating cross entropy loss corresponding to a speaker classification branch and an emotion classification branch:
wherein x is a feature sequence corresponding to the voice sample after extracting fbank features, namely the input of a neural network, W is an integral network model parameter, and the speaker label corresponding to x is y s The corresponding emotion label is y e ,f W (x) Is the predicted output result of the network model when the input is x,speaker classification junction with input of x-time network model predictionFruit of (Bu)>The emotion classification result is input as the prediction of the network model when x is the input;
after different parameters delta are introduced into different tasks to change the scales of output results of two branches of the neural network, the method comprises the following steps:
when different parameters delta are introduced, the joint loss of the multi-task learning is as follows:
wherein delta s Weight parameters, delta, for speaker classification penalty e Weight parameters for emotion classification loss;
automatic learning of parameters μ=logδ using a network 2 The final joint loss function is constructed as follows:
wherein mu s Is the relevant parameter of speaker classification task, mu e Is a parameter related to emotion classification tasks;
performing cross-gradient training, namely obtaining a new feature sequence by superposing disturbance in the emotion field on the feature sequence of the original voice sample, inputting the obtained new feature sequence into a network model, and adjusting the network parameters of the speaker classification task again;
the method for obtaining the new feature sequence by superimposing disturbance in the emotion field on the feature sequence of the original voice sample comprises the following specific steps:
taking an original characteristic sequence corresponding to the voice sample with emotion colors as input, and calculating cross entropy loss corresponding to emotion classification branches:
wherein x is a feature sequence corresponding to the extracted fbank feature of the voice sample, namely the input of the neural network, W is an integral network model parameter, and the emotion label corresponding to x is y e ,f W (x) Is the predicted output result of the network model when the input is x,the emotion classification result is input as the prediction of the network model when x is the input;
the disturbance in the emotion field is obtained by calculating a jacobian matrix of cross entropy loss values of an emotion classification part relative to a characteristic sequence of an original input sample, and the characteristic sequence x of the original input sample is superimposed with a new characteristic sequence x obtained by the disturbance e
Wherein, CELoss e (x) Representing x as a loss value of the network input in the emotion classification part, and epsilon represents a disturbance coefficient;
repeating the training steps until the training is finished, reserving a network structure of the speaker classification task for speaker authentication after the training of the network model is finished, selecting a network structure middle layer of the speaker classification task to output as voiceprint characteristics of a speaker, and calculating cosine similarity of the voiceprint template of a registrant and the voiceprint characteristics of a tester for speaker authentication.
2. The emotion speaker authentication method based on cross-gradient training of claim 1, wherein the network structure of the network model is specifically:
the front three layers adopt a shared TDNN layer to process the characteristics of the frame level, the first layer and the second layer process the characteristics of the front and rear 2 frames taking the current frame as the center, and the third layer processes the characteristics of the front and rear 3 frames taking the current frame as the center;
dividing network branches of emotion classification and speaker classification, adopting two layers of Bi-LSTM, finally outputting and reserving an output result at each moment, and aggregating the characteristic representation of the frame level output by the Bi-LSTM layer by using a statistical pooling layer to obtain the characteristic representation of the segment level;
the statistical pooling layers of the two branches are followed by three full-connection layers, and the output dimension of the last full-connection layer is respectively consistent with the classification category number of the corresponding task.
3. The emotion speaker authentication method based on cross-gradient training according to claim 1, wherein a feature sequence x of an original speech sample is used as a network input, and a joint loss of emotion classification and speaker classification and a loss of speaker classification after updating are used as network input as a total loss of a network model, specifically expressed as:
TLoss(x)=MLoss(x)+α·CELoss s (x e )
wherein alpha represents a new feature sequence x e TLoss (x) represents total loss, MLoss (x) represents joint loss of multi-task learning, and CELSs as the proportion of gradient return of speaker classification loss corresponding to network input s (x e ) Is x e And adopting a cross entropy loss function to obtain a result at the loss value of the speaker classification part.
CN202111483807.1A 2021-12-07 2021-12-07 Emotion speaker authentication method based on cross-gradient training Active CN114357414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111483807.1A CN114357414B (en) 2021-12-07 2021-12-07 Emotion speaker authentication method based on cross-gradient training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111483807.1A CN114357414B (en) 2021-12-07 2021-12-07 Emotion speaker authentication method based on cross-gradient training

Publications (2)

Publication Number Publication Date
CN114357414A CN114357414A (en) 2022-04-15
CN114357414B true CN114357414B (en) 2024-04-02

Family

ID=81096484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111483807.1A Active CN114357414B (en) 2021-12-07 2021-12-07 Emotion speaker authentication method based on cross-gradient training

Country Status (1)

Country Link
CN (1) CN114357414B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272687A (en) * 2012-03-07 2015-01-07 Actiwave公司 Signal conversion system and method
CN111402929A (en) * 2020-03-16 2020-07-10 南京工程学院 Small sample speech emotion recognition method based on domain invariance
CN111797844A (en) * 2020-07-20 2020-10-20 苏州思必驰信息科技有限公司 Adaptive model training method for antagonistic domain and adaptive model for antagonistic domain
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272687A (en) * 2012-03-07 2015-01-07 Actiwave公司 Signal conversion system and method
CN111402929A (en) * 2020-03-16 2020-07-10 南京工程学院 Small sample speech emotion recognition method based on domain invariance
CN111797844A (en) * 2020-07-20 2020-10-20 苏州思必驰信息科技有限公司 Adaptive model training method for antagonistic domain and adaptive model for antagonistic domain
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合全局向量特征的神经网络依存句法分析模型;王衡军;司念文;宋玉龙;单义栋;;通信学报;20180225(第02期);全文 *

Also Published As

Publication number Publication date
CN114357414A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
Liu et al. GMM and CNN hybrid method for short utterance speaker recognition
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110289003A (en) A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
KR20040037180A (en) System and method of face recognition using portions of learned model
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
Hsu et al. Scalable factorized hierarchical variational autoencoder training
CN109147774A (en) A kind of improved Delayed Neural Networks acoustic model
CN111401105B (en) Video expression recognition method, device and equipment
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN108877812B (en) Voiceprint recognition method and device and storage medium
CN114937182B (en) Image emotion distribution prediction method based on emotion wheel and convolutional neural network
Yulita et al. Fuzzy Hidden Markov Models for Indonesian Speech Classification.
CN109150538A (en) A kind of fingerprint merges identity identifying method with vocal print
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
Cao et al. Speaker-independent speech emotion recognition based on random forest feature selection algorithm
JPH09507921A (en) Speech recognition system using neural network and method of using the same
CN114357414B (en) Emotion speaker authentication method based on cross-gradient training
Zhao et al. Transferring age and gender attributes for dimensional emotion prediction from big speech data using hierarchical deep learning
CN109119073A (en) Audio recognition method, system, speaker and storage medium based on multi-source identification
CN110598737B (en) Online learning method, device, equipment and medium of deep learning model
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN116434758A (en) Voiceprint recognition model training method and device, electronic equipment and storage medium
Lingenfelser et al. Age and gender classification from speech using decision level fusion and ensemble based techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant