WO2022178942A1 - Emotion recognition method and apparatus, computer device, and storage medium - Google Patents

Emotion recognition method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2022178942A1
WO2022178942A1 PCT/CN2021/084252 CN2021084252W WO2022178942A1 WO 2022178942 A1 WO2022178942 A1 WO 2022178942A1 CN 2021084252 W CN2021084252 W CN 2021084252W WO 2022178942 A1 WO2022178942 A1 WO 2022178942A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
feature
feature vector
emotion
classification model
Prior art date
Application number
PCT/CN2021/084252
Other languages
French (fr)
Chinese (zh)
Inventor
顾艳梅
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022178942A1 publication Critical patent/WO2022178942A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to an emotion recognition method, device, computer equipment and storage medium.
  • the inventors realized that the existing emotion recognition models generally predict emotion categories by analyzing and recognizing speech signals.
  • the emotional state expressed by humans is often affected by various factors such as culture, country, and population.
  • the existing emotion recognition models cannot effectively avoid the influence of these factors, so the accuracy of emotion recognition is low.
  • the present application provides an emotion recognition method, device, computer equipment and storage medium, by back-propagating the predicted feature vector output by the speaker classification model to the feature generator to generate the emotion feature vector that eliminates the speaker's features, and according to the elimination of speaking
  • the emotion feature vector of human features is used to train the emotion classification model, which can eliminate the influence of different speakers on the emotion classification model and improve the accuracy of emotion recognition.
  • the present application provides an emotion recognition method, the method comprising:
  • the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
  • the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
  • the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
  • the present application also provides an emotion recognition device, the device comprising:
  • a training data acquisition module used for acquiring training data
  • the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels;
  • a model calling module used for calling an emotion recognition model to be trained, the emotion recognition model comprising a feature generator, an emotion classification model and a speaker classification model;
  • a first feature generation module configured to input the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
  • the first training module is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the corresponding speaker classification model after training. predict feature vector;
  • the second feature generation module is used for back-propagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
  • the second training module is used for inputting the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and the trained emotion is obtained. Identify the model;
  • the emotion recognition module is used for acquiring the speech signal to be recognized, and inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
  • the present application also provides a computer device, the computer device comprising a memory and a processor;
  • the memory for storing computer programs
  • the processor is configured to execute the computer program and realize when executing the computer program:
  • the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
  • the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
  • the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
  • the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements:
  • the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
  • the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
  • the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
  • the embodiments of the present application have the following beneficial effects: by acquiring training data, emotional feature information and labeled emotional category labels, speaker feature information and labeled speaker category labels can be obtained; by calling the to-be-trained The emotion recognition model can separately train the emotion classification model and the speaker classification model in the emotion recognition model to obtain the trained emotion recognition model; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, it can Obtain the corresponding emotional feature vector group and speaker feature vector group; by inputting the speaker feature vector group and the labeled speaker category label into the speaker classification model to iteratively train to convergence, the trained speaker classification model can be used to obtain predictions feature vector; by back-propagating the predicted feature vector to the feature generator for feature generation, the speaker feature vector can be unified, and then the emotional feature vector group that eliminates the speaker feature can be obtained; the emotional feature vector group that eliminates the speaker feature and the label Iteratively trains the emotion classification model by inputting the emotion category labels of the speech signals into the emotion classification model, and can
  • FIG. 1 is a schematic flowchart of a method for emotion recognition provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a sub-step of acquiring training data provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a feature generated by a feature generator provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a sub-step of training a speaker classification model provided by an embodiment of the present application
  • FIG. 6 is a schematic interaction diagram for obtaining an emotion feature vector group that eliminates speaker features provided by an embodiment of the present application
  • FIG. 7 is a schematic flowchart of a sub-step of acquiring an emotion feature vector group that eliminates speaker features according to an embodiment of the present application
  • FIG. 8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of an emotion recognition device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
  • Embodiments of the present application provide an emotion recognition method, apparatus, computer device, and storage medium.
  • the emotion recognition method can be applied to a server or a terminal to generate an emotion feature vector that eliminates the speaker feature by back-propagating the predicted feature vector output by the speaker classification model to the feature generator.
  • the emotion feature vector training the emotion classification model can eliminate the influence of different speakers on the emotion recognition model and improve the accuracy of emotion recognition.
  • the server may be an independent server or a server cluster.
  • Terminals can be electronic devices such as smart phones, tablet computers, notebook computers, and desktop computers.
  • the emotion recognition method includes steps S101 to S106.
  • Step S101 Acquire training data, where the training data includes emotional feature information and labeled emotional category labels, and speaker feature information and labeled speaker category labels.
  • emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels can be obtained;
  • the label trains the speaker classification model, and obtains the predicted feature vector corresponding to the trained speaker classification model, and then can generate the emotional feature vector that eliminates the speaker feature according to the predicted feature vector, and will eliminate the speaker feature.
  • FIG. 2 is a schematic flowchart of sub-steps of acquiring training data in step S101 , which may specifically include the following steps S1011 to S1014 .
  • Step S1011 Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain.
  • sample voice signals corresponding to a preset number of sample users may be obtained from the blockchain.
  • the sample users include different speakers.
  • the voices of test subjects in different regions, cultures or age groups in different emotions can be collected.
  • the obtained sample speech signal includes speech signals of different emotional categories corresponding to multiple speakers.
  • the emotion categories may include positive emotions and negative emotions.
  • positive emotions may include, but are not limited to, calm, optimistic, happy, etc.
  • negative emotions may include, but are not limited to, complaints, blame, abuse, complaints, and the like.
  • the above-mentioned sample voice signal may also be stored in a node of a blockchain.
  • the sample speech signal may also be stored in a local database or an external storage device, which is not specifically limited.
  • the sample speech signal may include useless signals, in order to improve the recognition accuracy of subsequent speaker categories and emotion categories, it is necessary to extract useful speech signals in the sample speech signals.
  • the useless signals may include but are not limited to footsteps, silence, horns, and machine noises.
  • the useful speech signal in the sample speech signal may be extracted based on the speech activity endpoint detection model.
  • voice activity detection Viice Activity Detection, VAD
  • VAD Voice Activity Detection
  • VAD can be used for echo cancellation, noise suppression, speaker recognition, and speech recognition, among others.
  • extracting useful speech signals in the sample speech signals based on the voice activity endpoint detection model may include: segmenting the sample speech signals to obtain at least one segmented speech signal corresponding to the sample speech signals; determining each segmented speech signal The short-term energy of the segmented speech signal; the segmented speech signals whose short-term energy is greater than the preset energy amplitude are spliced to obtain a useful speech signal.
  • the preset energy amplitude can be set according to the actual situation, and the specific value is not limited here.
  • spectral energy and zero-crossing rate of the sample voice signal can also be selected for judgment.
  • the specific process is here. Not limited.
  • Step S1012 perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information.
  • the emotional feature information may include, but is not limited to, energy, fundamental frequency, speech rate, frequency spectrum, formant frequency, etc.; speaker feature information may include voiceprint features.
  • pre-emphasis processing, framing and windowing can be performed on the useful speech signal to obtain window data corresponding to the useful speech signal; characteristic parameters of the window data are calculated, and the characteristic parameters at least include energy, fundamental frequency, speech rate, One of spectrum and formant frequency, and the characteristic parameter is determined as emotional characteristic information.
  • a windowing function such as a rectangular window, a Haining window, or a Hamming window, can be used to implement windowing processing on the framed signals.
  • the energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency may be calculated according to respective calculation formulas corresponding to energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency.
  • the specific calculation process is not limited here.
  • mel spectral data of the window data may be calculated, and the mel spectral data may be determined as speaker characteristic information.
  • the process of calculating the mel spectral data of the window data : performing fast Fourier transform processing and squaring on the window data to obtain the spectral line energy corresponding to the window data; process to obtain Mel spectrum data corresponding to the window data.
  • the window data may include multiple pieces, so that Mel spectrum data corresponding to each window data can be obtained.
  • Step S1013 Label the feature information according to the identity information and emotion information of the sample user, and obtain the labeled speaker category label and the labeled emotion category label.
  • sample user 1 if the identity information of sample user 1 is A and the emotional information is positive, the characteristic information of sample user 1 can be marked; for example, the emotional characteristic information of sample user 1 is marked "positive”. , and mark "A" for the speaker feature information, so as to obtain the speaker category label marked by sample user 1 and the marked emotion category label.
  • sample user 2 if the identity information of sample user 2 is B and the emotional information is negative, the characteristic information of sample user 2 can be marked; for example, the emotional characteristic information of sample user 2 is marked "negative”. , and mark "B" for the speaker feature information, so as to obtain the speaker category label marked by sample user 2 and the marked emotion category label.
  • Step S1014 Determine the emotional feature information, speaker feature information, the marked emotional category label, and the marked speaker category label as the training data.
  • emotion feature information is used as training data.
  • the training data includes data sets corresponding to multiple sample users.
  • the training data may include a data set of sample user 1, the data set including emotion feature information, speaker feature information, annotated emotion category label "positive", and annotated speaker category label "A”.
  • the training data may also include a data set of sample user 2, including emotion feature information, speaker feature information, annotated emotion category label "Negative”, and annotated speaker category label "B”.
  • Step S102 calling the emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model.
  • the emotion recognition model may include a Generative Adversarial Network (GAN).
  • GAN Generative Adversarial Network
  • the generative adversarial network mainly includes a feature generator and a feature discriminator; the feature generator is used to generate text, image, video and other data from the input data.
  • the feature discriminator is equivalent to a classifier, which is used to judge the authenticity of the input data.
  • FIG. 3 is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application.
  • the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model.
  • the emotion classification model and the speaker classification model are both feature discriminators.
  • the feature generator may use an MLP (Multi Layer Perceptron, multi-layer perceptron) network or a deep neural network to represent the generating function.
  • MLP Multi Layer Perceptron, multi-layer perceptron
  • the emotion classification model and the speaker classification model may include, but are not limited to, convolutional neural networks, restricted Boltzmann machines, or recurrent neural networks, among others.
  • the feature vector required for training can be generated by the feature generator, and then the speaker classification model and the emotion classification model can be trained to convergence according to the feature vector.
  • Step S103 Input the emotional feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotional feature vector group and speaker feature vector group.
  • FIG. 4 is a schematic diagram of generating a feature by a feature generator according to an embodiment of the present application.
  • the emotion feature information and the speaker feature information are input into the feature generator, and the feature generator generates an emotion feature vector group according to the emotion feature information, and generates a speaker feature vector group according to the speaker feature information.
  • the emotion feature vector group includes at least one emotion feature vector; the speaker feature vector group includes at least one speaker feature vector.
  • the feature generator may generate a corresponding feature vector according to the feature information by using a generating function.
  • the corresponding feature vector can be generated according to the feature information through a deep neural network.
  • the specific feature generation process is not limited here.
  • the corresponding emotional feature vector group and speaker feature vector group can be obtained, and the speaker feature vector group can be input into the speaker classification model for training subsequently.
  • Step S104 Input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the predicted feature vector corresponding to the trained speaker classification model.
  • FIG. 5 is a schematic flowchart of sub-steps of training a speaker classification model provided by an embodiment of the present application, which may specifically include the following steps S1041 to S1044.
  • Step S1041 Determine the training sample data for each round of training by comparing one of the speaker feature vectors in the speaker feature vector group with the speaker category label corresponding to the speaker feature vector.
  • one of the speaker feature vectors and the speaker category label corresponding to the speaker feature vector may be sequentially selected from the speaker feature vector group, and determined as the training sample data for each round of training.
  • Step S1042 Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data.
  • the speaker classification prediction result may include the prediction probability corresponding to the speaker prediction category and the speaker prediction category.
  • Step S1043 Determine a loss function value corresponding to the current round according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.
  • the loss function value corresponding to the current round may be determined based on the preset loss function, according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.
  • a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value.
  • Step S1044 if the loss function value is greater than the preset loss value threshold, adjust the parameters of the speaker classification model, and perform the next round of training, until the obtained loss function value is less than or equal to the loss value threshold, After finishing the training, the trained speaker classification model is obtained.
  • the preset loss value threshold may be set according to the actual situation, and the specific value is not limited herein.
  • a convergence algorithm such as a gradient descent algorithm, a Newton algorithm, a conjugate gradient method, or a Cauchy-Newton method can be used to adjust the parameters of the speaker classification model.
  • a convergence algorithm such as a gradient descent algorithm, a Newton algorithm, a conjugate gradient method, or a Cauchy-Newton method can be used to adjust the parameters of the speaker classification model.
  • the speaker classification model can be rapidly converged, thereby improving the training efficiency and accuracy of the speaker classification model.
  • the speaker classification model learns the speaker features, and then the learned speaker features can be back-propagated to the feature generator to generate Generate a sentiment feature vector that removes speaker features.
  • obtaining the prediction feature vector corresponding to the trained speaker classification model may include: inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining speech The feature vector output by the fully connected layer of the person classification model; the mean value of all the obtained feature vectors is determined as the predicted feature vector.
  • the speaker classification model includes at least a fully connected layer.
  • the speaker classification model may be a convolutional neural network model, including a convolutional layer, a pooling layer, a fully connected layer, a normalization layer, and the like.
  • the feature vector output by the fully connected layer of the speaker classification model can be obtained.
  • one feature vector is output corresponding to the training sample data of each round of training, so multiple feature vectors can be obtained.
  • the mean value of all the obtained feature vectors may be determined as the predicted feature vector. Understandably, the predicted feature vector can be understood as the speaker feature learned by the trained speaker classification model.
  • the speaker classification model can learn the speaker The predicted feature vector for the feature.
  • Step S105 back-propagating the predicted feature vector to the feature generator for feature generation, to obtain an emotion feature vector group from which speaker features are eliminated.
  • FIG. 6 is a schematic interaction diagram of obtaining an emotion feature vector group with speaker features removed according to an embodiment of the present application.
  • the predicted feature vector is back-propagated to the feature generator for feature generation, and the emotional feature vector group that eliminates the speaker feature is obtained; then the emotional feature vector group that eliminates the speaker feature is sent to the emotion classification model. train.
  • FIG. 7 is a schematic flowchart of the sub-steps of step S105 .
  • the specific step S105 may include the following steps S1051 and S1052 .
  • Step S1051 Adjust the speaker feature vector group in the feature generator according to the predicted feature vector, and obtain the adjusted speaker feature vector group, wherein, in the adjusted speaker feature vector group The eigenvectors of each speaker are the same.
  • the speaker feature vector may be represented by a first distribution function, and the speaker feature vector group includes at least one first distribution function. It can be understood that, since the speaker feature vector includes speaker feature information of multiple sample users, the speaker feature vector group corresponds to a plurality of different first distribution functions.
  • the first distribution function may be a normal distribution function, which may be expressed as:
  • represents the mean
  • ⁇ 2 represents the variance
  • adjusting the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group may include: determining a second distribution function corresponding to the predicted feature vector, and obtaining a second distribution function corresponding to the predicted feature vector.
  • the mean and variance of the distribution function; according to the mean and variance, the mean and variance in each first distribution function are updated to obtain an updated first distribution function.
  • the second distribution function corresponding to the predicted feature vector can be expressed as:
  • the mean ⁇ and the variance ⁇ 2 in the second distribution function F(x) can be updated to obtain the updated each The first distribution function is f'(x).
  • each updated first distribution function f'(x) has the same mean and the same variance, that is, each speaker feature vector in the adjusted speaker feature vector group is the same.
  • Step S1052 based on the adjusted speaker feature vector group, generate the emotion feature vector group from which the speaker feature is eliminated through the generating function.
  • a generating function in the feature generator can generate an emotion feature vector group that eliminates the speaker feature.
  • the output of the generator function is the speaker feature vector group and the speaker feature vector group, wherein each speaker feature vector in the speaker feature vector group is the same, so the speaker feature vector does not affect the emotion feature vector.
  • Influence that is, the obtained emotion feature vector group is the emotion feature vector that eliminates the speaker features.
  • the speaker feature-eliminated emotional feature vector group is generated, so that when the emotion classification model is trained according to the speaker-feature-eliminated emotional feature vector group, different speaker feature pairs can be eliminated.
  • the impact of sentiment classification models is generated, so that when the emotion classification model is trained according to the speaker-feature-eliminated emotional feature vector group, different speaker feature pairs can be eliminated.
  • Step S106 Input the emotion feature vector group from which the speaker feature is eliminated and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained.
  • the emotion feature information and the speaker feature information are generally input into the feature generator, and the feature generator generates the emotion feature vector group and the speaker feature vector group; Then, the emotion feature vector group and the speaker feature vector group are input into the feature discriminator for emotion classification training, and the trained feature discriminator is obtained. Therefore, the emotion recognition models of the prior art cannot eliminate the influence of different speakers on emotion recognition.
  • the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training until the emotion classification model converges.
  • the training process may include: determining the training sample data for each round of training according to the emotion feature vector group and emotion category label that eliminates the speaker features; inputting the current round of training sample data into the emotion classification model for emotion classification training, and obtaining the current
  • the emotion classification prediction result corresponding to the training sample data of the current round is used to determine the loss function value according to the emotion classification label corresponding to the current training sample data and the emotion classification prediction result; if the loss function value is greater than the preset loss value threshold, the emotion classification model is adjusted. parameters, and perform the next round of training until the obtained loss function value is less than or equal to the loss value threshold, end the training, and obtain the trained sentiment classification model.
  • the preset loss value threshold may be set according to the actual situation, and the specific value is not limited herein.
  • a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value.
  • Convergence algorithms such as gradient descent algorithm, Newton algorithm, conjugate gradient method, or Cauchy-Newton method can be used to adjust the parameters of the speaker classification model.
  • the emotion recognition model includes an emotion classification model and a speaker classification model
  • the emotion classification model converges, it means that the emotion recognition model also converges, and a trained emotion recognition model is obtained.
  • the trained emotion recognition model is not affected by speaker characteristics.
  • the above-mentioned trained emotion recognition model may also be stored in a node of a blockchain.
  • the trained emotion recognition model needs to be used, it can be obtained from the nodes of the blockchain.
  • an emotion recognition model that is not affected by the speaker features can be obtained, thereby improving the accuracy of emotion recognition.
  • Step S107 obtain the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
  • the voice data to be recognized may be a voice signal collected in advance and stored in a database, or may be generated according to a voice signal collected in real time.
  • the voice signal input by the user at the robot terminal may be collected by a voice acquisition device, and then noise reduction processing is performed on the voice signal, and the voice signal after noise reduction processing is determined as the voice signal to be recognized. .
  • the voice collection device may include electronic devices such as a voice recorder, a voice recorder, and a microphone that collect voice.
  • the noise reduction processing of the speech signal can be realized according to the spectral subtraction algorithm, the Wiener filter algorithm, the minimum average error algorithm and the wavelet transform algorithm.
  • the accuracy of subsequent recognition of the emotion category corresponding to the speech signal can be improved.
  • the method before inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal, the method may further include: extracting the useful speech signal in the speech signal, and performing feature extraction on the useful speech signal to obtain the speech signal.
  • the emotional feature information and speaker feature information corresponding to the signal may further include: extracting the useful speech signal in the speech signal, and performing feature extraction on the useful speech signal to obtain the speech signal.
  • the useful voice signal in the voice signal can be extracted based on the voice activity endpoint detection model.
  • the specific process of extracting the useful speech signal reference may be made to the detailed description of the foregoing embodiment, and the specific process will not be repeated here.
  • performing feature extraction on the useful voice signal to obtain emotional feature information and speaker feature information corresponding to the voice signal may include: performing pre-emphasis processing, framing and windowing on the useful voice signal to obtain the useful voice signal Corresponding window data; calculate the characteristic parameters of the window data, the characteristic parameters include at least one of energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency, and determine the characteristic parameters as emotional characteristic information; calculate the Mel spectrum of the window data data, and the Mel spectrum data is determined as speaker feature information.
  • inputting the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal may include: inputting the emotion feature information and speaker feature information into the emotion recognition model for emotion recognition, and obtaining an emotion recognition result corresponding to the speech signal. Emotion recognition results.
  • the emotion recognition model is a pre-trained model, which can be stored in the blockchain or in a local database.
  • FIG. 8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application.
  • the trained emotion recognition model can be called from the blockchain, and the emotion feature information and speaker feature information can be input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal can be obtained.
  • the emotion recognition result may include the prediction probability corresponding to the emotion prediction category and the emotion prediction category.
  • the sentiment prediction category can be positive or negative.
  • the emotion recognition result is "positive, 90%”.
  • emotion feature information and speaker feature information By inputting the emotion feature information and speaker feature information into the pre-trained emotion recognition model for prediction, the influence of different speaker features on emotion recognition can be eliminated, and the accuracy of emotion recognition can be improved.
  • the emotion recognition method provided by the above embodiment can improve the recognition accuracy of subsequent speaker categories and emotion categories by extracting useful voice signals in the sample voice signals; pre-emphasis processing, framing and Windowing can improve high-frequency components and reduce leakage in the frequency domain, so as to achieve the effect of improving subsequent feature extraction; by calling the emotion recognition model to be trained, the feature vector required for training can be generated by the feature generator, and then according to The feature vector trains the speaker classification model and the emotion classification model to convergence; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, the corresponding emotion feature vector group and speaker feature vector group can be obtained.
  • the speaker feature vector group can be input into the speaker classification model for training; by updating the parameters of the speaker classification model according to the preset loss function and convergence algorithm, the speaker classification model can be quickly converged, thereby improving the speaker classification model.
  • the training efficiency and accuracy are improved; the speaker classification model is iteratively trained to converge by inputting the speaker feature vector group and the speaker category label into the speaker classification model, so that the speaker classification model learns the speaker features, and the learned speaker features can be reversed later.
  • the feature vector output by the connection layer can be obtained by the speaker classification model to learn the predicted feature vector of the speaker feature; based on the adjusted speaker feature vector group, the emotion feature vector group that eliminates the speaker feature is generated, so as to eliminate the speech according to the speaker feature vector group.
  • the emotion feature vector group of human features can eliminate the influence of different speaker features on the emotion classification model when training the emotion classification model; by inputting the emotion feature vector group of speaker features and the labeled emotion category label into the emotion classification model Through iterative training, an emotion recognition model that is not affected by speaker characteristics can be obtained; by inputting emotional characteristic information and speaker characteristic information into the pre-trained emotion recognition model for prediction, the influence of different speaker characteristics on emotion recognition can be eliminated. Improves the accuracy of emotion recognition.
  • FIG. 9 is a schematic block diagram of an emotion recognition apparatus 1000 further provided by an embodiment of the present application, and the emotion recognition apparatus is used for executing the foregoing emotion recognition method.
  • the emotion recognition device may be configured in a server or a terminal.
  • the emotion recognition device 1000 includes: a training data acquisition module 1001 , a model calling module 1002 , a first feature generation module 1003 , a first training module 1004 , a second feature generation module 1005 , and a second training module 1006 and emotion recognition module 1007.
  • the training data acquisition module 1001 is configured to acquire training data, where the training data includes emotion feature information and labeled emotion category labels, and speaker feature information and labeled speaker category labels.
  • the model invoking module 1002 is used for invoking an emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model.
  • the first feature generation module 1003 is configured to input the emotion feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotion feature vector group and speaker feature vector group.
  • the first training module 1004 is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtain the corresponding speaker classification model after training.
  • the second feature generation module 1005 is configured to back-propagate the predicted feature vector to the feature generator for feature generation, and obtain an emotion feature vector group from which speaker features are eliminated.
  • the second training module 1006 is configured to input the emotion feature vector group and the labeled emotion category label into the emotion classification model for iterative training until the emotion classification model converges, and obtain the trained emotion classification model. Emotion recognition model.
  • the emotion recognition module 1007 is configured to acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
  • the above-mentioned apparatus can be implemented in the form of a computer program that can be executed on a computer device as shown in FIG. 10 .
  • FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be a server or a terminal.
  • the computer device includes a processor and a memory connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running a computer program in the non-volatile storage medium, and when the computer program is executed by the processor, the processor can cause the processor to execute any emotion recognition method.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels; call the emotional recognition model to be trained, and the emotional recognition model includes a feature generator, An emotion classification model and a speaker classification model; the emotion feature information and the speaker feature information are input into the feature generator for feature generation, and corresponding emotion feature vector groups and speaker feature vector groups are obtained; The person feature vector group and the labeled speaker category label are input into the speaker classification model for iterative training to convergence, and the predicted feature vector corresponding to the trained speaker classification model is obtained; the predicted feature vector is reversed.
  • the speaker feature vector group includes at least one speaker feature vector; the processor is implementing inputting the speaker feature vector group and the annotated speaker category label into the speaker classification When the model is iteratively trained to converge, it is used to achieve:
  • the speaker classification model includes at least a fully-connected layer; when the processor obtains the prediction feature vector corresponding to the trained speaker classification model, the processor is configured to:
  • Input the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtain the feature vector output by the fully connected layer of the speaker classification model;
  • the mean value of the eigenvectors is determined as the predicted eigenvectors.
  • the feature generator includes a generating function; the processor performs feature generation by back-propagating the predicted feature vector to the feature generator to obtain a set of emotion feature vectors that eliminates speaker features , used to achieve:
  • the speaker feature vectors are the same; based on the adjusted speaker feature vector group, the emotion feature vector group in which the speaker feature is eliminated is generated by the generating function.
  • the speaker feature vector group includes at least one first distribution function; the processor adjusts the speaker feature vector group in the feature generator according to the predicted feature vector to obtain When the adjusted set of speaker feature vectors is used to achieve:
  • the processor before implementing that the voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, the processor is further configured to:
  • the processor inputs the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal, so as to achieve:
  • the emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
  • the processor when the processor acquires the training data, the processor is configured to:
  • sample voice signals corresponding to a preset number of sample users extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain; perform feature extraction on the useful voice signals to obtain Corresponding feature information, the feature information includes emotional feature information and speaker feature information; the feature information is marked according to the identity information and emotional information of the sample user, and the marked speaker category label and the marked feature information are obtained.
  • the emotion category label; the emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
  • the embodiments of the present application further provide a computer-readable storage medium
  • the computer-readable storage medium may be non-volatile or volatile
  • the computer-readable storage medium stores a computer program
  • the The computer program includes program instructions
  • the processor executes the program instructions to implement any emotion recognition method provided by the embodiments of the present application.
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a Secure Digital Card (Secure Digital Card) , SD Card), flash memory card (Flash Card), etc.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

The present application relates to the field of artificial intelligence, implements elimination of the effect of different speakers on emotion recognition, thereby improving the accuracy of emotion recognition, and relates to an emotion recognition method and apparatus, a device, and a medium. The method comprises: calling an emotion recognition model to be trained, inputting emotion feature information and speaker feature information into a feature generator for feature generation, to obtain an emotion feature vector group and a speaker feature vector group; inputting the speaker feature vector group and a speaker category label into a speaker classification model for training, and acquiring a prediction feature vector corresponding to the speaker classification model; backpropagating the prediction feature vector to the feature generator for feature generation, and inputting the emotion feature vector group of which speaker features are eliminated and an emotion category label into an emotion classification model for training; and acquiring a speech signal to be recognized, and inputting said speech signal into a trained emotion classification model to obtain an emotion recognition result. In addition, the present application further relates to blockchain technology, and the emotion recognition model can be stored in a blockchain.

Description

情绪识别方法、装置、计算机设备和存储介质Emotion recognition method, device, computer equipment and storage medium
本申请要求于2021年02月26日提交中国专利局、申请号为202110218668.3,发明名称为“情绪识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110218668.3 and the invention titled "Emotion Recognition Method, Apparatus, Computer Equipment and Storage Medium" filed with the China Patent Office on February 26, 2021, the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种情绪识别方法、装置、计算机设备和存储介质。The present application relates to the field of artificial intelligence, and in particular, to an emotion recognition method, device, computer equipment and storage medium.
背景技术Background technique
随着人工智能的快速发展,人机交互技术受到人们的高度重视。在人机交互过程中,需要对不同用户、不同任务、不同场景给予不同的情感反馈和支持,并对人的情感做出友好、灵敏以及智能的反应。因此需要训练计算机进行情绪识别,以使计算机学习人类的理解、察觉和反馈情感特征的能力。With the rapid development of artificial intelligence, human-computer interaction technology has been highly valued by people. In the process of human-computer interaction, it is necessary to give different emotional feedback and support to different users, different tasks, and different scenarios, and to make friendly, sensitive and intelligent responses to human emotions. Therefore, it is necessary to train computers for emotion recognition, so that computers can learn the ability of human beings to understand, perceive and feedback emotional characteristics.
技术问题technical problem
综上,发明人意识到,现有的情绪识别模型,一般通过对语音信号进行分析与识别,进而预测情绪类别。但是在实际场景中,人类表达的情感状态常常受到文化、国家、人群等多种因素,现有的情绪识别模型并不能有效地规避这些因素的影响,从而情绪识别的准确度较低。To sum up, the inventors realized that the existing emotion recognition models generally predict emotion categories by analyzing and recognizing speech signals. However, in actual scenarios, the emotional state expressed by humans is often affected by various factors such as culture, country, and population. The existing emotion recognition models cannot effectively avoid the influence of these factors, so the accuracy of emotion recognition is low.
因此如何提高情绪识别模型的准确性成为亟需解决的问题。Therefore, how to improve the accuracy of emotion recognition model has become an urgent problem to be solved.
技术解决方案technical solutions
本申请提供了一种情绪识别方法、装置、计算机设备和存储介质,通过将说话人分类模型输出的预测特征向量反向传播至特征生成器生成消除说话人特征的情绪特征向量,并根据消除说话人特征的情绪特征向量对情绪分类模型进行训练,可以实现消除不同说话人对情绪分类模型的影响,提高了情绪识别的准确性。The present application provides an emotion recognition method, device, computer equipment and storage medium, by back-propagating the predicted feature vector output by the speaker classification model to the feature generator to generate the emotion feature vector that eliminates the speaker's features, and according to the elimination of speaking The emotion feature vector of human features is used to train the emotion classification model, which can eliminate the influence of different speakers on the emotion classification model and improve the accuracy of emotion recognition.
第一方面,本申请提供了一种情绪识别方法,所述方法包括:In a first aspect, the present application provides an emotion recognition method, the method comprising:
获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;
将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
第二方面,本申请还提供了一种情绪识别装置,所述装置包括:In a second aspect, the present application also provides an emotion recognition device, the device comprising:
训练数据获取模块,用于获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;a training data acquisition module, used for acquiring training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels;
模型调用模块,用于调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;a model calling module, used for calling an emotion recognition model to be trained, the emotion recognition model comprising a feature generator, an emotion classification model and a speaker classification model;
第一特征生成模块,用于将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;a first feature generation module, configured to input the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
第一训练模块,用于将所述说话人特征向量组与标注的所述说话人类别标签输入 所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;The first training module is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the corresponding speaker classification model after training. predict feature vector;
第二特征生成模块,用于将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;The second feature generation module is used for back-propagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
第二训练模块,用于将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The second training module is used for inputting the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and the trained emotion is obtained. Identify the model;
情绪识别模块,用于获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。The emotion recognition module is used for acquiring the speech signal to be recognized, and inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
第三方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;In a third aspect, the present application also provides a computer device, the computer device comprising a memory and a processor;
所述存储器,用于存储计算机程序;the memory for storing computer programs;
所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现:The processor is configured to execute the computer program and realize when executing the computer program:
获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;
将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
第四方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现:In a fourth aspect, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements:
获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;
将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
有益效果beneficial effect
本申请实施例与现有技术相比存在的有益效果是:通过获取训练数据,可以获得情绪特征信息与标注的情绪类别标签以及说话人特征信息与标注的说话人类别标签; 通过调用待训练的情绪识别模型,可以分别对情绪识别模型中的情绪分类模型与说话人分类模型进行训练,得到训练后的情绪识别模型;通过将情绪特征信息与说话人特征信息输入特征生成器进行特征生成,可以得到对应的情绪特征向量组与说话人特征向量组;通过将说话人特征向量组与标注的说话人类别标签输入说话人分类模型进行迭代训练至收敛,可以通过训练后的说话人分类模型获取预测特征向量;通过将预测特征向量反向传播至特征生成器进行特征生成,可以统一说话人特征向量,进而得到消除说话人特征的情绪特征向量组;将消除说话人特征的情绪特征向量组与标注的情绪类别标签输入情绪分类模型进行迭代训练,可以得到消除不同说话人的影响的情绪识别模型;通过将待识别的语音信号输入训练后的情绪识别模型进行情绪识别,提高了情绪识别的准确性。Compared with the prior art, the embodiments of the present application have the following beneficial effects: by acquiring training data, emotional feature information and labeled emotional category labels, speaker feature information and labeled speaker category labels can be obtained; by calling the to-be-trained The emotion recognition model can separately train the emotion classification model and the speaker classification model in the emotion recognition model to obtain the trained emotion recognition model; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, it can Obtain the corresponding emotional feature vector group and speaker feature vector group; by inputting the speaker feature vector group and the labeled speaker category label into the speaker classification model to iteratively train to convergence, the trained speaker classification model can be used to obtain predictions feature vector; by back-propagating the predicted feature vector to the feature generator for feature generation, the speaker feature vector can be unified, and then the emotional feature vector group that eliminates the speaker feature can be obtained; the emotional feature vector group that eliminates the speaker feature and the label Iteratively trains the emotion classification model by inputting the emotion category labels of the speech signals into the emotion classification model, and can obtain an emotion recognition model that eliminates the influence of different speakers; by inputting the speech signal to be recognized into the trained emotion recognition model for emotion recognition, the accuracy of emotion recognition is improved. .
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请实施例提供的一种情绪识别方法的示意性流程图;1 is a schematic flowchart of a method for emotion recognition provided by an embodiment of the present application;
图2是本申请实施例提供的一种获取训练数据的子步骤的示意性流程图;2 is a schematic flowchart of a sub-step of acquiring training data provided by an embodiment of the present application;
图3是本申请实施例提供的一种情绪识别模型的结构示意图;3 is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application;
图4是本申请实施例提供的一种通过特征生成器生成特征的示意图;4 is a schematic diagram of a feature generated by a feature generator provided by an embodiment of the present application;
图5是本申请实施例提供的一种对说话人分类模型进行训练的子步骤的示意性流程图;5 is a schematic flowchart of a sub-step of training a speaker classification model provided by an embodiment of the present application;
图6是本申请实施例提供的一种获取消除说话人特征的情绪特征向量组的示意性交互图;FIG. 6 is a schematic interaction diagram for obtaining an emotion feature vector group that eliminates speaker features provided by an embodiment of the present application;
图7是本申请实施例提供的一种获取消除说话人特征的情绪特征向量组的子步骤的示意性流程图;7 is a schematic flowchart of a sub-step of acquiring an emotion feature vector group that eliminates speaker features according to an embodiment of the present application;
图8是本申请实施例提供的一种调用情绪识别模型进行情绪识别的示意性交互图;8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application;
图9是本申请实施例提供的一种情绪识别装置的示意性框图;9 is a schematic block diagram of an emotion recognition device provided by an embodiment of the present application;
图10是本申请实施例提供的一种计算机设备的结构示意性框图。FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation.
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in the specification of the present application herein are for the purpose of describing particular embodiments only and are not intended to limit the present application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
本申请的实施例提供了一种情绪识别方法、装置、计算机设备和存储介质。其中,该情绪识别方法可以应用于服务器或终端中,实现通过将说话人分类模型输出的预测特征向量反向传播至特征生成器生成消除说话人特征的情绪特征向量,并根据消除说话人特征的情绪特征向量对情绪分类模型进行训练,可以实现消除不同说话人对情绪 识别模型的影响,提高了情绪识别的准确性。Embodiments of the present application provide an emotion recognition method, apparatus, computer device, and storage medium. The emotion recognition method can be applied to a server or a terminal to generate an emotion feature vector that eliminates the speaker feature by back-propagating the predicted feature vector output by the speaker classification model to the feature generator. The emotion feature vector training the emotion classification model can eliminate the influence of different speakers on the emotion recognition model and improve the accuracy of emotion recognition.
其中,服务器可以为独立的服务器,也可以为服务器集群。终端可以是智能手机、平板电脑、笔记本电脑和台式电脑等电子设备。The server may be an independent server or a server cluster. Terminals can be electronic devices such as smart phones, tablet computers, notebook computers, and desktop computers.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.
如图1所示,情绪识别方法包括步骤S101至步骤S106。As shown in FIG. 1 , the emotion recognition method includes steps S101 to S106.
步骤S101、获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签。Step S101: Acquire training data, where the training data includes emotional feature information and labeled emotional category labels, and speaker feature information and labeled speaker category labels.
在本申请实施例中,通过获取训练数据,可以获得情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;从而可以根据说话人特征信息与标注的说话人类别标签对说话人分类模型进行训练,并获取训练后的说话人分类模型对应的预测特征向量,进而可以根据预测特征向量生成消除说话人特征的情绪特征向量,并将消除说话人特征的情绪特征向量输入情绪分类模型中训练,实现消除不同说话人对情绪分类模型的影响,提高了情绪识别的准确性。In the embodiment of the present application, by acquiring training data, emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels can be obtained; The label trains the speaker classification model, and obtains the predicted feature vector corresponding to the trained speaker classification model, and then can generate the emotional feature vector that eliminates the speaker feature according to the predicted feature vector, and will eliminate the speaker feature. Emotion feature vector Input into the emotion classification model for training to eliminate the influence of different speakers on the emotion classification model and improve the accuracy of emotion recognition.
请参阅图2,图2是步骤S101中获取训练数据的子步骤的示意性流程图,具体可以包括以下步骤S1011至步骤S1014。Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of sub-steps of acquiring training data in step S101 , which may specifically include the following steps S1011 to S1014 .
步骤S1011、获取预设数量的样本用户对应的样本语音信号,提取所述样本语音信号中的有用语音信号,其中,所述样本语音信号存储在区块链中。Step S1011: Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain.
示例性的,可以从区块链中获取预设数量的样本用户对应的样本语音信号。Exemplarily, sample voice signals corresponding to a preset number of sample users may be obtained from the blockchain.
其中,样本用户包括不同的说话人。例如,可以采集不同地域、不同文化或不同年龄段的测试者在不同情绪时的语音。从而得到的样本语音信号包括多个说话人对应的不同情绪类别的语音信号。Among them, the sample users include different speakers. For example, the voices of test subjects in different regions, cultures or age groups in different emotions can be collected. The obtained sample speech signal includes speech signals of different emotional categories corresponding to multiple speakers.
示例性的,情绪类别可以包括正面情绪和负面情绪。例如,正面情绪可以包括但不限于平静、乐观、开心等等;负面情绪可以包括但不限于抱怨、责备、辱骂以及投诉等等。Illustratively, the emotion categories may include positive emotions and negative emotions. For example, positive emotions may include, but are not limited to, calm, optimistic, happy, etc.; negative emotions may include, but are not limited to, complaints, blame, abuse, complaints, and the like.
需要强调的是,为进一步保证上述样本语音信号的私密和安全性,上述样本语音信号还可以存储于一区块链的节点中。当然,样本语音信号还可以存储在本地数据库或外部存储设备中,具体不作限定。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned sample voice signal, the above-mentioned sample voice signal may also be stored in a node of a blockchain. Of course, the sample speech signal may also be stored in a local database or an external storage device, which is not specifically limited.
需要说明的是,由于样本语音信号可能包括无用信号,为提高后续说话人类别与情绪类别的识别准确度,因此需要提取样本语音信号中的有用语音信号。其中,无用信号可以包括但不限于脚步声、静音、喇叭声以及机器噪声等等。It should be noted that, since the sample speech signal may include useless signals, in order to improve the recognition accuracy of subsequent speaker categories and emotion categories, it is necessary to extract useful speech signals in the sample speech signals. The useless signals may include but are not limited to footsteps, silence, horns, and machine noises.
在本申请实施例中,可以基于语音活动端点检测模型,提取样本语音信号中的有用语音信号。需要说明的是,在语音信号处理中,语音活动端点检测(Voice Activity Detection,VAD)用于检测是否存在语音,从而将信号中的语音段和非语音段分开。VAD可被用于回波消除、噪音抑制、语者识别和语音识别等。In this embodiment of the present application, the useful speech signal in the sample speech signal may be extracted based on the speech activity endpoint detection model. It should be noted that, in voice signal processing, voice activity detection (Voice Activity Detection, VAD) is used to detect whether there is voice, so as to separate the voice segment and the non-speech segment in the signal. VAD can be used for echo cancellation, noise suppression, speaker recognition, and speech recognition, among others.
在一些实施例中,基于语音活动端点检测模型,提取样本语音信号中的有用语音信号,可以包括:对样本语音信号进行切分,得到样本语音信号对应的至少一个分段语音信号;确定每个分段语音信号的短时能量;将短时能量大于预设的能量幅值对应的分段语音信号进行拼接,得到有用语音信号。In some embodiments, extracting useful speech signals in the sample speech signals based on the voice activity endpoint detection model may include: segmenting the sample speech signals to obtain at least one segmented speech signal corresponding to the sample speech signals; determining each segmented speech signal The short-term energy of the segmented speech signal; the segmented speech signals whose short-term energy is greater than the preset energy amplitude are spliced to obtain a useful speech signal.
其中,预设的能量幅值可以根据实际情况设定,具体数值在此不作限定。The preset energy amplitude can be set according to the actual situation, and the specific value is not limited here.
示例性的,在基于语音活动端点检测模型,提取样本语音信号中的有用语音信号时,除了短时能量,还可以选用样本语音信号的频谱能量、过零率等特征进行判断,具体过程在此不作限定。Exemplarily, when extracting the useful voice signal in the sample voice signal based on the voice activity endpoint detection model, in addition to the short-term energy, features such as spectral energy and zero-crossing rate of the sample voice signal can also be selected for judgment. The specific process is here. Not limited.
通过提取样本语音信号中的有用语音信号,可以提高后续说话人类别与情绪类别的识别准确度。By extracting useful speech signals from the sample speech signals, the recognition accuracy of subsequent speaker categories and emotion categories can be improved.
步骤S1012、对所述有用语音信号进行特征提取,得到对应的特征信息,所述特征 信息包括情绪特征信息与说话人特征信息。Step S1012, perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information.
需要说明的是,在本申请实施例中,情绪特征信息可以包括但不限于能量、基频、语速、频谱以及共振峰频率等等;说话人特征信息可以包括声纹特征。It should be noted that, in this embodiment of the present application, the emotional feature information may include, but is not limited to, energy, fundamental frequency, speech rate, frequency spectrum, formant frequency, etc.; speaker feature information may include voiceprint features.
在一些实施例中,可以对有用语音信号进行预加重处理、分帧以及加窗,得到有用语音信号对应的窗口数据;计算窗口数据的特征参数,特征参数至少包括能量、基频、语速、频谱、共振峰频率中的一种,将特征参数确定为情绪特征信息。In some embodiments, pre-emphasis processing, framing and windowing can be performed on the useful speech signal to obtain window data corresponding to the useful speech signal; characteristic parameters of the window data are calculated, and the characteristic parameters at least include energy, fundamental frequency, speech rate, One of spectrum and formant frequency, and the characteristic parameter is determined as emotional characteristic information.
示例性的,可以通过加窗函数,如矩形窗、海宁窗或汉明窗来实现对分帧后的各帧信号进行加窗处理。Exemplarily, a windowing function, such as a rectangular window, a Haining window, or a Hamming window, can be used to implement windowing processing on the framed signals.
可以理解的是,通过对有用语音信号进行预加重处理预加重处理、分帧以及加窗,可以提升高频分量与减少频域中的泄漏,从而达到提高后续特征提取的效果。It can be understood that by performing pre-emphasis processing, framing and windowing on the useful speech signal, high-frequency components can be increased and leakage in the frequency domain can be reduced, thereby achieving the effect of improving subsequent feature extraction.
示例性的,可以根据能量、基频、语速、频谱、共振峰频率各自对应的计算公式计算出能量、基频、语速、频谱、共振峰频率。具体的计算过程在此不作限定。Exemplarily, the energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency may be calculated according to respective calculation formulas corresponding to energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency. The specific calculation process is not limited here.
在一些实施例中,可以计算窗口数据的梅尔频谱数据,将梅尔频谱数据确定为说话人特征信息。In some embodiments, mel spectral data of the window data may be calculated, and the mel spectral data may be determined as speaker characteristic information.
示例性的,计算窗口数据的梅尔频谱数据的过程:对窗口数据进行快速傅里叶变换处理和取平方处理,得到窗口数据对应的谱线能量;基于梅尔滤波器组对谱线能量进行处理,以得到窗口数据对应的梅尔频谱数据。其中,窗口数据可以包括多个,从而可以得到各窗口数据对应的梅尔频谱数据。Exemplarily, the process of calculating the mel spectral data of the window data: performing fast Fourier transform processing and squaring on the window data to obtain the spectral line energy corresponding to the window data; process to obtain Mel spectrum data corresponding to the window data. The window data may include multiple pieces, so that Mel spectrum data corresponding to each window data can be obtained.
步骤S1013、根据所述样本用户的身份信息与情绪信息对所述特征信息进行标注,获得标注的所述说话人类别标签与标注的所述情绪类别标签。Step S1013: Label the feature information according to the identity information and emotion information of the sample user, and obtain the labeled speaker category label and the labeled emotion category label.
示例性的,对于样本用户1,若样本用户1的身份信息为A,情绪信息为正面,则可以对样本用户1的特征信息进行标注;例如,对样本用户1的情绪特征信息标注“正面”,对说话人特征信息标注“A”,从而得到样本用户1标注的说话人类别标签与标注的情绪类别标签。Exemplarily, for sample user 1, if the identity information of sample user 1 is A and the emotional information is positive, the characteristic information of sample user 1 can be marked; for example, the emotional characteristic information of sample user 1 is marked "positive". , and mark "A" for the speaker feature information, so as to obtain the speaker category label marked by sample user 1 and the marked emotion category label.
示例性的,对于样本用户2,若样本用户2的身份信息为B,情绪信息为负面,则可以对样本用户2的特征信息进行标注;例如,对样本用户2的情绪特征信息标注“负面”,对说话人特征信息标注“B”,从而得到样本用户2标注的说话人类别标签与标注的情绪类别标签。Exemplarily, for sample user 2, if the identity information of sample user 2 is B and the emotional information is negative, the characteristic information of sample user 2 can be marked; for example, the emotional characteristic information of sample user 2 is marked "negative". , and mark "B" for the speaker feature information, so as to obtain the speaker category label marked by sample user 2 and the marked emotion category label.
步骤S1014、将所述情绪特征信息、说话人特征信息、标注的所述情绪类别标签以及标注的所述说话人类别标签,确定为所述训练数据。Step S1014: Determine the emotional feature information, speaker feature information, the marked emotional category label, and the marked speaker category label as the training data.
示例性的,将情绪特征信息、说话人特征信息、标注的情绪类别标签以及标注的说话人类别标签,作为训练数据。其中,训练数据包括多个样本用户对应的数据集合。Exemplarily, emotion feature information, speaker feature information, annotated emotion category labels, and annotated speaker category labels are used as training data. The training data includes data sets corresponding to multiple sample users.
例如,训练数据可以包括样本用户1的数据集合,该数据集合包括情绪特征信息、说话人特征信息、标注的情绪类别标签“正面”以及标注的说话人类别标签“A”。训练数据还可以包括样本用户2的数据集合,包括情绪特征信息、说话人特征信息、标注的情绪类别标签“负面”以及标注的说话人类别标签“B”。For example, the training data may include a data set of sample user 1, the data set including emotion feature information, speaker feature information, annotated emotion category label "positive", and annotated speaker category label "A". The training data may also include a data set of sample user 2, including emotion feature information, speaker feature information, annotated emotion category label "Negative", and annotated speaker category label "B".
步骤S102、调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型。Step S102 , calling the emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model.
需要说明的是,情绪识别模型可以包括生成式对抗网络(Generative Adversarial Network,GAN)。其中,生成式对抗网络主要包括特征生成器和特征判别器;特征生成器用于将输入的数据生成文字、图像、视频等数据。特征判别器相当于分类器,用于判断输入的数据的真假。It should be noted that the emotion recognition model may include a Generative Adversarial Network (GAN). Among them, the generative adversarial network mainly includes a feature generator and a feature discriminator; the feature generator is used to generate text, image, video and other data from the input data. The feature discriminator is equivalent to a classifier, which is used to judge the authenticity of the input data.
请参阅图3,图3是本申请实施例提供的一种情绪识别模型的结构示意图。如图3所示,在本申请实施例中,情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型。其中,情绪分类模型与说话人分类模型都是特征判别器。Please refer to FIG. 3 , which is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application. As shown in FIG. 3 , in this embodiment of the present application, the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model. Among them, the emotion classification model and the speaker classification model are both feature discriminators.
示例性的,特征生成器可以采用MLP(Multi Layer Perceptron,多层感知机)网络、 深度神经网络来表示生成函数。情绪分类模型与说话人分类模型可以包括但不限于卷积神经网络、受限玻尔兹曼机或循环神经网络等等。Exemplarily, the feature generator may use an MLP (Multi Layer Perceptron, multi-layer perceptron) network or a deep neural network to represent the generating function. The emotion classification model and the speaker classification model may include, but are not limited to, convolutional neural networks, restricted Boltzmann machines, or recurrent neural networks, among others.
通过调用待训练的情绪识别模型,可以通过特征生成器生成训练所需的特征向量,进而可以根据特征向量对说话人分类模型与情绪分类模型进行训练至收敛。By calling the emotion recognition model to be trained, the feature vector required for training can be generated by the feature generator, and then the speaker classification model and the emotion classification model can be trained to convergence according to the feature vector.
步骤S103、将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组。Step S103: Input the emotional feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotional feature vector group and speaker feature vector group.
请参阅图4,图4是本申请实施例提供的一种通过特征生成器生成特征的示意图。如图4所示,将情绪特征信息与说话人特征信息输入特征生成器,由特征生成器根据情绪特征信息生成情绪特征向量组,以及根据说话人特征信息生成说话人特征向量组。其中,情绪特征向量组包括至少一个情绪特征向量;说话人特征向量组包括至少一个说话人特征向量。Please refer to FIG. 4. FIG. 4 is a schematic diagram of generating a feature by a feature generator according to an embodiment of the present application. As shown in Figure 4, the emotion feature information and the speaker feature information are input into the feature generator, and the feature generator generates an emotion feature vector group according to the emotion feature information, and generates a speaker feature vector group according to the speaker feature information. Wherein, the emotion feature vector group includes at least one emotion feature vector; the speaker feature vector group includes at least one speaker feature vector.
示例性的,特征生成器可以通过生成函数,根据特征信息生成对应的特征向量。例如,可以通过深度神经网络,根据特征信息生成对应的特征向量。具体的特征生成过程,在此不作限定。Exemplarily, the feature generator may generate a corresponding feature vector according to the feature information by using a generating function. For example, the corresponding feature vector can be generated according to the feature information through a deep neural network. The specific feature generation process is not limited here.
通过将情绪特征信息与说话人特征信息输入特征生成器进行特征生成,可以得到对应的情绪特征向量组与说话人特征向量组,后续可以将说话人特征向量组输入说话人分类模型进行训练。By inputting the emotional feature information and speaker feature information into the feature generator for feature generation, the corresponding emotional feature vector group and speaker feature vector group can be obtained, and the speaker feature vector group can be input into the speaker classification model for training subsequently.
步骤S104、将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量。Step S104: Input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the predicted feature vector corresponding to the trained speaker classification model.
请参阅图5,图5是本申请实施例提供的一种对说话人分类模型进行训练的子步骤的示意性流程图,具体可以包括以下步骤S1041至步骤S1044。Please refer to FIG. 5. FIG. 5 is a schematic flowchart of sub-steps of training a speaker classification model provided by an embodiment of the present application, which may specifically include the following steps S1041 to S1044.
步骤S1041、将所述说话人特征向量组中的其中一说话人特征向量与所述说话人特征向量对应的说话人类别标签,确定每一轮训练的训练样本数据。Step S1041: Determine the training sample data for each round of training by comparing one of the speaker feature vectors in the speaker feature vector group with the speaker category label corresponding to the speaker feature vector.
示例性的,可以在说话人特征向量组中的依次选取其中一说话人特征向量与该说话人特征向量对应的说话人类别标签,确定为每一轮训练的训练样本数据。Exemplarily, one of the speaker feature vectors and the speaker category label corresponding to the speaker feature vector may be sequentially selected from the speaker feature vector group, and determined as the training sample data for each round of training.
步骤S1042、将当前轮训练样本数据输入所述说话人分类模型中进行说话人分类训练,得到所述当前轮训练样本数据对应的说话人分类预测结果。Step S1042: Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data.
示例性的,说话人分类预测结果可以包括说话人预测类别与说话人预测类别对应的预测概率。Exemplarily, the speaker classification prediction result may include the prediction probability corresponding to the speaker prediction category and the speaker prediction category.
步骤S1043、根据所述当前轮训练样本数据对应的说话人类别标签与所述说话人分类预测结果,确定当前轮对应损失函数值。Step S1043: Determine a loss function value corresponding to the current round according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.
示例性的,可以基于预设的损失函数,根据当前轮训练样本数据对应的说话人类别标签与说话人分类预测结果,确定当前轮对应损失函数值。Exemplarily, the loss function value corresponding to the current round may be determined based on the preset loss function, according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.
示例性的,可以采用0-1损失函数、绝对值损失函数、对数损失函数、交叉熵损失函数、平方损失函数或指数损失函数等损失函数来计算损失函数值。Exemplarily, a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value.
步骤S1044、若所述损失函数值大于预设的损失值阈值,则调整所述说话人分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于所述损失值阈值,结束训练,得到训练后的所述说话人分类模型。Step S1044, if the loss function value is greater than the preset loss value threshold, adjust the parameters of the speaker classification model, and perform the next round of training, until the obtained loss function value is less than or equal to the loss value threshold, After finishing the training, the trained speaker classification model is obtained.
示例性的,预设的损失值阈值可以根据实际情况进行设定,具体数值在此不作限定。Exemplarily, the preset loss value threshold may be set according to the actual situation, and the specific value is not limited herein.
示例性的,可以采用梯度下降算法、牛顿算法、共轭梯度法或柯西-牛顿法等收敛算法来调整说话人分类模型的参数。在调整说话人分类模型的参数之后,将下一轮训练样本数据输入说话人分类模型中进行说话人分类训练,并确定对应损失函数值,直至得到的损失函数值小于或等于损失值阈值,结束训练,得到训练后的说话人分类模型。Exemplarily, a convergence algorithm such as a gradient descent algorithm, a Newton algorithm, a conjugate gradient method, or a Cauchy-Newton method can be used to adjust the parameters of the speaker classification model. After adjusting the parameters of the speaker classification model, input the next round of training sample data into the speaker classification model for speaker classification training, and determine the corresponding loss function value, until the obtained loss function value is less than or equal to the loss value threshold, end Train to get the trained speaker classification model.
通过根据预设的损失函数和收敛算法对说话人分类模型进行参数更新,可以使得说话人分类模型快速收敛,进而提高了说话人分类模型的训练效率和准确度。By updating the parameters of the speaker classification model according to the preset loss function and convergence algorithm, the speaker classification model can be rapidly converged, thereby improving the training efficiency and accuracy of the speaker classification model.
通过将说话人特征向量组与说话人类别标签输入说话人分类模型进行迭代训练至收敛,使得说话人分类模型学习说话人特征,后续可以将学习的说话人特征反向传播至特征生成器,以生成消除说话人特征的情绪特征向量。By inputting the speaker feature vector group and the speaker category label into the speaker classification model for iterative training until convergence, the speaker classification model learns the speaker features, and then the learned speaker features can be back-propagated to the feature generator to generate Generate a sentiment feature vector that removes speaker features.
在一些实施例中,获取训练后的说话人分类模型对应的预测特征向量,可以包括:将每一轮训练的训练样本数据输入训练后的说话人分类模型中进行说话人分类预测,并获取说话人分类模型的全连接层输出的特征向量;将获取的全部特征向量的均值,确定为预测特征向量。In some embodiments, obtaining the prediction feature vector corresponding to the trained speaker classification model may include: inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining speech The feature vector output by the fully connected layer of the person classification model; the mean value of all the obtained feature vectors is determined as the predicted feature vector.
其中,说话人分类模型至少包括全连接层。示例性的,说话人分类模型可以是卷积神经网络模型,包括卷积层、池化层、全连接层以及归一化层等等。Among them, the speaker classification model includes at least a fully connected layer. Exemplarily, the speaker classification model may be a convolutional neural network model, including a convolutional layer, a pooling layer, a fully connected layer, a normalization layer, and the like.
在本申请实施例中,可以获取说话人分类模型的全连接层输出的特征向量。示例性的,每一轮训练的训练样本数据对应输出一个特征向量,因此可以获得多个特征向量。In this embodiment of the present application, the feature vector output by the fully connected layer of the speaker classification model can be obtained. Exemplarily, one feature vector is output corresponding to the training sample data of each round of training, so multiple feature vectors can be obtained.
在一些实施方式中,可以将获取的全部特征向量的均值,确定为预测特征向量。可以理解的是,预测特征向量可以理解为训练后的说话人分类模型学的说话人特征。In some embodiments, the mean value of all the obtained feature vectors may be determined as the predicted feature vector. Understandably, the predicted feature vector can be understood as the speaker feature learned by the trained speaker classification model.
通过将每一轮训练的训练样本数据输入训练后的说话人分类模型中进行说话人分类预测,并获取说话人分类模型的全连接层输出的特征向量,从而可以得到说话人分类模型学习说话人特征的预测特征向量。By inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model, the speaker classification model can learn the speaker The predicted feature vector for the feature.
步骤S105、将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组。Step S105 , back-propagating the predicted feature vector to the feature generator for feature generation, to obtain an emotion feature vector group from which speaker features are eliminated.
请参阅图6,图6是本申请实施例提供的一种获取消除说话人特征的情绪特征向量组的示意性交互图。如图6所示,将预测特征向量反向传播至特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;然后将消除说话人特征的情绪特征向量组发送至情绪分类模型中进行训练。Please refer to FIG. 6 . FIG. 6 is a schematic interaction diagram of obtaining an emotion feature vector group with speaker features removed according to an embodiment of the present application. As shown in Figure 6, the predicted feature vector is back-propagated to the feature generator for feature generation, and the emotional feature vector group that eliminates the speaker feature is obtained; then the emotional feature vector group that eliminates the speaker feature is sent to the emotion classification model. train.
请参阅图7,图7是步骤S105的子步骤的示意性流程图,具体步骤S105可以包括以下步骤S1051与步骤S1052。Please refer to FIG. 7 . FIG. 7 is a schematic flowchart of the sub-steps of step S105 . The specific step S105 may include the following steps S1051 and S1052 .
步骤S1051、根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,其中,调节后的所述说话人特征向量组中的每个说话人特征向量相同。Step S1051: Adjust the speaker feature vector group in the feature generator according to the predicted feature vector, and obtain the adjusted speaker feature vector group, wherein, in the adjusted speaker feature vector group The eigenvectors of each speaker are the same.
示例性的,说话人特征向量可以用第一分布函数来表示,说话人特征向量组包括至少一个第一分布函数。可以理解的是,由于说话人特征向量包括多个样本用户的说话人特征信息,因此说话人特征向量组对应有多个不同的第一分布函数。Exemplarily, the speaker feature vector may be represented by a first distribution function, and the speaker feature vector group includes at least one first distribution function. It can be understood that, since the speaker feature vector includes speaker feature information of multiple sample users, the speaker feature vector group corresponds to a plurality of different first distribution functions.
示例性的,第一分布函数可以是正态分布函数,可以表示为:Exemplarily, the first distribution function may be a normal distribution function, which may be expressed as:
Figure PCTCN2021084252-appb-000001
Figure PCTCN2021084252-appb-000001
式中,μ表示均值;σ 2表示方差。 In the formula, μ represents the mean; σ 2 represents the variance.
在一些实施例中,根据预测特征向量调节特征生成器中的说话人特征向量组,获得调节后的说话人特征向量组,可以包括:确定预测特征向量对应的第二分布函数,并获取第二分布函数的均值与方差;根据均值与方差,对每个第一分布函数中的均值与方差进行更新,得到更新后的第一分布函数。In some embodiments, adjusting the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group may include: determining a second distribution function corresponding to the predicted feature vector, and obtaining a second distribution function corresponding to the predicted feature vector. The mean and variance of the distribution function; according to the mean and variance, the mean and variance in each first distribution function are updated to obtain an updated first distribution function.
示例性的,预测特征向量对应的第二分布函数,可以表示为:Exemplarily, the second distribution function corresponding to the predicted feature vector can be expressed as:
Figure PCTCN2021084252-appb-000002
Figure PCTCN2021084252-appb-000002
示例性的,可以根据第二分布函数F(x)中的均值μ与方差σ 2,对每个第一分布函数f(x)中的均值μ与方差σ 2进行更新,得到更新后的每个第一分布函数为f′(x)。 Exemplarily, according to the mean μ and the variance σ 2 in the second distribution function F(x), the mean μ and the variance σ 2 in each first distribution function f(x) can be updated to obtain the updated each The first distribution function is f'(x).
可以理解的是,更新后的每个第一分布函数为f′(x)的均值都相同且方差也都相同,即调节后的说话人特征向量组中的每个说话人特征向量相同。It can be understood that each updated first distribution function f'(x) has the same mean and the same variance, that is, each speaker feature vector in the adjusted speaker feature vector group is the same.
步骤S1052、基于调节后的所述说话人特征向量组,通过所述生成函数生成消除说话人特征的所述情绪特征向量组。Step S1052 , based on the adjusted speaker feature vector group, generate the emotion feature vector group from which the speaker feature is eliminated through the generating function.
示例性的,在获得调节后的说话人特征向量组之后,可以基于调节后的说话人特征向量组,通过特征生成器中的生成函数生成消除说话人特征的情绪特征向量组。Exemplarily, after obtaining the adjusted speaker feature vector group, based on the adjusted speaker feature vector group, a generating function in the feature generator can generate an emotion feature vector group that eliminates the speaker feature.
可以理解的是,生成函数输出的是说话人特征向量组与说话人特征向量组,其中,说话人特征向量组中的每个说话人特征向量相同,因而说话人特征向量不会对情绪特征向量造成影响,即得到的情绪特征向量组是消除说话人特征的情绪特征向量。It can be understood that the output of the generator function is the speaker feature vector group and the speaker feature vector group, wherein each speaker feature vector in the speaker feature vector group is the same, so the speaker feature vector does not affect the emotion feature vector. Influence, that is, the obtained emotion feature vector group is the emotion feature vector that eliminates the speaker features.
通过基于调节后的说话人特征向量组,生成消除说话人特征的情绪特征向量组,从而在根据消除说话人特征的情绪特征向量组对情绪分类模型进行训练时,可以消除不同的说话人特征对情绪分类模型的影响。Based on the adjusted speaker feature vector group, the speaker feature-eliminated emotional feature vector group is generated, so that when the emotion classification model is trained according to the speaker-feature-eliminated emotional feature vector group, different speaker feature pairs can be eliminated. The impact of sentiment classification models.
步骤S106、将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型。Step S106: Input the emotion feature vector group from which the speaker feature is eliminated and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained.
需要说明的是,在现有技术的情绪识别模型的训练过程中,一般是将情绪特征信息与说话人特征信息输入特征生成器,由特征生成器生成情绪特征向量组与说话人特征向量组;然后将情绪特征向量组与说话人特征向量组输入特征判别器进行情绪分类训练,得到训练好的特征判别器。因此,现有技术的情绪识别模型无法消除不同说话人对情绪识别的影响。It should be noted that, in the training process of the emotion recognition model in the prior art, the emotion feature information and the speaker feature information are generally input into the feature generator, and the feature generator generates the emotion feature vector group and the speaker feature vector group; Then, the emotion feature vector group and the speaker feature vector group are input into the feature discriminator for emotion classification training, and the trained feature discriminator is obtained. Therefore, the emotion recognition models of the prior art cannot eliminate the influence of different speakers on emotion recognition.
示例性的,将消除说话人特征的情绪特征向量组与标注的情绪类别标签输入情绪分类模型进行迭代训练,直至情绪分类模型收敛。Exemplarily, the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training until the emotion classification model converges.
其中,训练过程可以包括:根据消除说话人特征的情绪特征向量组与情绪类别标签,确定每一轮训练的训练样本数据;将当前轮训练样本数据输入情绪分类模型中进行情绪分类训练,得到当前轮训练样本数据对应的情绪分类预测结果;根据当前轮训练样本数据对应的情绪类别标签与情绪分类预测结果,确定损失函数值;若损失函数值大于预设的损失值阈值,则调整情绪分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于损失值阈值,结束训练,得到训练后的情绪分类模型。The training process may include: determining the training sample data for each round of training according to the emotion feature vector group and emotion category label that eliminates the speaker features; inputting the current round of training sample data into the emotion classification model for emotion classification training, and obtaining the current The emotion classification prediction result corresponding to the training sample data of the current round is used to determine the loss function value according to the emotion classification label corresponding to the current training sample data and the emotion classification prediction result; if the loss function value is greater than the preset loss value threshold, the emotion classification model is adjusted. parameters, and perform the next round of training until the obtained loss function value is less than or equal to the loss value threshold, end the training, and obtain the trained sentiment classification model.
示例性的,预设的损失值阈值可以根据实际情况进行设定,具体数值在此不作限定。Exemplarily, the preset loss value threshold may be set according to the actual situation, and the specific value is not limited herein.
示例性的,可以采用0-1损失函数、绝对值损失函数、对数损失函数、交叉熵损失函数、平方损失函数或指数损失函数等损失函数来计算损失函数值。可以采用梯度下降算法、牛顿算法、共轭梯度法或柯西-牛顿法等收敛算法来调整说话人分类模型的参数。Exemplarily, a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value. Convergence algorithms such as gradient descent algorithm, Newton algorithm, conjugate gradient method, or Cauchy-Newton method can be used to adjust the parameters of the speaker classification model.
需要说明的是,由于情绪识别模型包括情绪分类模型以及说话人分类模型,当情绪分类模型收敛时,表示情绪识别模型也收敛,得到训练后的情绪识别模型。训练后的情绪识别模型不受说话人特征的影响。It should be noted that since the emotion recognition model includes an emotion classification model and a speaker classification model, when the emotion classification model converges, it means that the emotion recognition model also converges, and a trained emotion recognition model is obtained. The trained emotion recognition model is not affected by speaker characteristics.
在一些实施例中,为进一步保证上述训练后的情绪识别模型的私密和安全性,上述训练后的情绪识别模型还可以存储于一区块链的节点中。当需要使用训练后的情绪识别模型时,可以从区块链的节点中获取。In some embodiments, in order to further ensure the privacy and security of the above-mentioned trained emotion recognition model, the above-mentioned trained emotion recognition model may also be stored in a node of a blockchain. When the trained emotion recognition model needs to be used, it can be obtained from the nodes of the blockchain.
通过将消除说话人特征的情绪特征向量组与标注的情绪类别标签输入情绪分类模型进行迭代训练,可以得到不受说话人特征影响的情绪识别模型,从而提高了情绪识别的准确性。By iterative training of the emotion feature vector group and the labeled emotion category labels that eliminate the speaker features into the emotion classification model, an emotion recognition model that is not affected by the speaker features can be obtained, thereby improving the accuracy of emotion recognition.
步骤S107、获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别 模型得到所述语音信号对应的情绪识别结果。Step S107, obtain the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
需要说明的是,在本申请实施例中,待识别的语音数据可以是预先采集并存储在数据库的语音信号,也可以是根据实时采集的语音信号生成的。It should be noted that, in this embodiment of the present application, the voice data to be recognized may be a voice signal collected in advance and stored in a database, or may be generated according to a voice signal collected in real time.
示例性的,在人机交互场景中,可以通过语音采集装置采集用户在机器人终端输入的语音信号,然后对语音信号进行降噪处理,将降噪处理后的语音信号确定为待识别的语音信号。Exemplarily, in a human-computer interaction scenario, the voice signal input by the user at the robot terminal may be collected by a voice acquisition device, and then noise reduction processing is performed on the voice signal, and the voice signal after noise reduction processing is determined as the voice signal to be recognized. .
其中,语音采集装置可以包括录音机、录音笔以及麦克风等采集语音的电子设备。Wherein, the voice collection device may include electronic devices such as a voice recorder, a voice recorder, and a microphone that collect voice.
可以根据谱相减算法、维纳滤波算法、最小均分误差算法、小波变换算法实现对语音信号进行降噪处理。The noise reduction processing of the speech signal can be realized according to the spectral subtraction algorithm, the Wiener filter algorithm, the minimum average error algorithm and the wavelet transform algorithm.
通过对语音信号进行降噪处理,可以提高后续识别语音信号对应的情绪类别的准确度。By performing noise reduction processing on the speech signal, the accuracy of subsequent recognition of the emotion category corresponding to the speech signal can be improved.
在一些实施例中,将语音信号输入训练后的情绪识别模型得到语音信号对应的情绪识别结果之前,还可以包括:提取语音信号中的有用语音信号,并对有用语音信号进行特征提取,得到语音信号对应的情绪特征信息与说话人特征信息。In some embodiments, before inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal, the method may further include: extracting the useful speech signal in the speech signal, and performing feature extraction on the useful speech signal to obtain the speech signal. The emotional feature information and speaker feature information corresponding to the signal.
示例性的,可以基于语音活动端点检测模型,提取语音信号中的有用语音信号。具体的提取有用语音信号过程,可以参见上述实施例的详细说明,具体过程在此不再赘述。Exemplarily, the useful voice signal in the voice signal can be extracted based on the voice activity endpoint detection model. For the specific process of extracting the useful speech signal, reference may be made to the detailed description of the foregoing embodiment, and the specific process will not be repeated here.
通过提取语音信号中的有用语音信号,可以提高后续识别情绪类别的准确度。By extracting useful speech signals from speech signals, the accuracy of subsequent recognition of emotion categories can be improved.
在一些实施例中,对有用语音信号进行特征提取,得到语音信号对应的情绪特征信息与说话人特征信息,可以包括:对有用语音信号进行预加重处理、分帧以及加窗,得到有用语音信号对应的窗口数据;计算窗口数据的特征参数,特征参数至少包括能量、基频、语速、频谱、共振峰频率中的一种,将特征参数确定为情绪特征信息;计算窗口数据的梅尔频谱数据,将梅尔频谱数据确定为说话人特征信息。In some embodiments, performing feature extraction on the useful voice signal to obtain emotional feature information and speaker feature information corresponding to the voice signal may include: performing pre-emphasis processing, framing and windowing on the useful voice signal to obtain the useful voice signal Corresponding window data; calculate the characteristic parameters of the window data, the characteristic parameters include at least one of energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency, and determine the characteristic parameters as emotional characteristic information; calculate the Mel spectrum of the window data data, and the Mel spectrum data is determined as speaker feature information.
其中,特征提取的具体过程,可以参见上述实施例的详细说明,具体过程在此不再赘述。For the specific process of feature extraction, reference may be made to the detailed description of the foregoing embodiment, and the specific process will not be repeated here.
通过对有用语音信号进行预加重处理预加重处理、分帧以及加窗,可以提升高频分量与减少频域中的泄漏,从而达到提高后续特征提取的效果。By performing pre-emphasis processing, framing and windowing on the useful speech signal, high-frequency components can be enhanced and leakage in the frequency domain can be reduced, thereby improving the effect of subsequent feature extraction.
在一些实施例中,将语音信号输入训练后的情绪识别模型得到语音信号对应的情绪识别结果,可以包括:将情绪特征信息与说话人特征信息输入情绪识别模型进行情绪识别,得到语音信号对应的情绪识别结果。In some embodiments, inputting the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal may include: inputting the emotion feature information and speaker feature information into the emotion recognition model for emotion recognition, and obtaining an emotion recognition result corresponding to the speech signal. Emotion recognition results.
需要说明的是,情绪识别模型为预先训练好的模型,可以存储在区块链中,也可以存储在本地数据库中。It should be noted that the emotion recognition model is a pre-trained model, which can be stored in the blockchain or in a local database.
请参阅图8,图8是本申请实施例提供的一种调用情绪识别模型进行情绪识别的示意性交互图。如图8所示,可以从区块链中调用训练好的情绪识别模型,将情绪特征信息与说话人特征信息输入情绪识别模型进行情绪识别,得到语音信号对应的情绪识别结果。Please refer to FIG. 8 . FIG. 8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application. As shown in Figure 8, the trained emotion recognition model can be called from the blockchain, and the emotion feature information and speaker feature information can be input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal can be obtained.
示例性的,情绪识别结果可以包括情绪预测类别与情绪预测类别对应的预测概率。其中,情绪预测类别可以是正面或负面。例如,情绪识别结果为“正面,90%”。Exemplarily, the emotion recognition result may include the prediction probability corresponding to the emotion prediction category and the emotion prediction category. Among them, the sentiment prediction category can be positive or negative. For example, the emotion recognition result is "positive, 90%".
通过将情绪特征信息与说话人特征信息输入预先训练好的情绪识别模型中预测,可以消除不同说话人特征对情绪识别的影响,提高了情绪识别的准确性。By inputting the emotion feature information and speaker feature information into the pre-trained emotion recognition model for prediction, the influence of different speaker features on emotion recognition can be eliminated, and the accuracy of emotion recognition can be improved.
上述实施例提供的情绪识别方法,通过提取样本语音信号中的有用语音信号,可以提高后续说话人类别与情绪类别的识别准确度;通过对有用语音信号进行预加重处理预加重处理、分帧以及加窗,可以提升高频分量与减少频域中的泄漏,从而达到提高后续特征提取的效果;通过调用待训练的情绪识别模型,可以通过特征生成器生成训练所需的特征向量,进而可以根据特征向量对说话人分类模型与情绪分类模型进行训练至收敛;通过将情绪特征信息与说话人特征信息输入特征生成器进行特征生成, 可以得到对应的情绪特征向量组与说话人特征向量组,后续可以将说话人特征向量组输入说话人分类模型进行训练;通过根据预设的损失函数和收敛算法对说话人分类模型进行参数更新,可以使得说话人分类模型快速收敛,进而提高了说话人分类模型的训练效率和准确度;通过将说话人特征向量组与说话人类别标签输入说话人分类模型进行迭代训练至收敛,使得说话人分类模型学习说话人特征,后续可以将学习的说话人特征反向传播至特征生成器,以生成消除说话人特征的情绪特征向量;通过将每一轮训练的训练样本数据输入训练后的说话人分类模型中进行说话人分类预测,并获取说话人分类模型的全连接层输出的特征向量,从而可以得到说话人分类模型学习说话人特征的预测特征向量;通过基于调节后的说话人特征向量组,生成消除说话人特征的情绪特征向量组,从而在根据消除说话人特征的情绪特征向量组对情绪分类模型进行训练时,可以消除不同的说话人特征对情绪分类模型的影响;通过将消除说话人特征的情绪特征向量组与标注的情绪类别标签输入情绪分类模型进行迭代训练,可以得到不受说话人特征影响的情绪识别模型;通过将情绪特征信息与说话人特征信息输入预先训练好的情绪识别模型中预测,可以消除不同说话人特征对情绪识别的影响,提高了情绪识别的准确性。The emotion recognition method provided by the above embodiment can improve the recognition accuracy of subsequent speaker categories and emotion categories by extracting useful voice signals in the sample voice signals; pre-emphasis processing, framing and Windowing can improve high-frequency components and reduce leakage in the frequency domain, so as to achieve the effect of improving subsequent feature extraction; by calling the emotion recognition model to be trained, the feature vector required for training can be generated by the feature generator, and then according to The feature vector trains the speaker classification model and the emotion classification model to convergence; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, the corresponding emotion feature vector group and speaker feature vector group can be obtained. The speaker feature vector group can be input into the speaker classification model for training; by updating the parameters of the speaker classification model according to the preset loss function and convergence algorithm, the speaker classification model can be quickly converged, thereby improving the speaker classification model. The training efficiency and accuracy are improved; the speaker classification model is iteratively trained to converge by inputting the speaker feature vector group and the speaker category label into the speaker classification model, so that the speaker classification model learns the speaker features, and the learned speaker features can be reversed later. Propagated to the feature generator to generate the emotion feature vector that eliminates the speaker features; the speaker classification prediction is performed by inputting the training sample data of each round of training into the trained speaker classification model, and the full range of the speaker classification model is obtained. The feature vector output by the connection layer can be obtained by the speaker classification model to learn the predicted feature vector of the speaker feature; based on the adjusted speaker feature vector group, the emotion feature vector group that eliminates the speaker feature is generated, so as to eliminate the speech according to the speaker feature vector group. The emotion feature vector group of human features can eliminate the influence of different speaker features on the emotion classification model when training the emotion classification model; by inputting the emotion feature vector group of speaker features and the labeled emotion category label into the emotion classification model Through iterative training, an emotion recognition model that is not affected by speaker characteristics can be obtained; by inputting emotional characteristic information and speaker characteristic information into the pre-trained emotion recognition model for prediction, the influence of different speaker characteristics on emotion recognition can be eliminated. Improves the accuracy of emotion recognition.
请参阅图9,图9是本申请的实施例还提供一种情绪识别装置1000的示意性框图,该情绪识别装置用于执行前述的情绪识别方法。其中,该情绪识别装置可以配置于服务器或终端中。Please refer to FIG. 9. FIG. 9 is a schematic block diagram of an emotion recognition apparatus 1000 further provided by an embodiment of the present application, and the emotion recognition apparatus is used for executing the foregoing emotion recognition method. Wherein, the emotion recognition device may be configured in a server or a terminal.
如图9所示,该情绪识别装置1000,包括:训练数据获取模块1001、模型调用模块1002、第一特征生成模块1003、第一训练模块1004、第二特征生成模块1005、第二训练模块1006和情绪识别模块1007。As shown in FIG. 9 , the emotion recognition device 1000 includes: a training data acquisition module 1001 , a model calling module 1002 , a first feature generation module 1003 , a first training module 1004 , a second feature generation module 1005 , and a second training module 1006 and emotion recognition module 1007.
训练数据获取模块1001,用于获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签。The training data acquisition module 1001 is configured to acquire training data, where the training data includes emotion feature information and labeled emotion category labels, and speaker feature information and labeled speaker category labels.
模型调用模块1002,用于调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型。The model invoking module 1002 is used for invoking an emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model.
第一特征生成模块1003,用于将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组。The first feature generation module 1003 is configured to input the emotion feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotion feature vector group and speaker feature vector group.
第一训练模块1004,用于将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量。The first training module 1004 is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtain the corresponding speaker classification model after training. The predicted feature vector of .
第二特征生成模块1005,用于将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组。The second feature generation module 1005 is configured to back-propagate the predicted feature vector to the feature generator for feature generation, and obtain an emotion feature vector group from which speaker features are eliminated.
第二训练模块1006,用于将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型。The second training module 1006 is configured to input the emotion feature vector group and the labeled emotion category label into the emotion classification model for iterative training until the emotion classification model converges, and obtain the trained emotion classification model. Emotion recognition model.
情绪识别模块1007,用于获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。The emotion recognition module 1007 is configured to acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the device and each module described above may refer to the corresponding process in the foregoing method embodiments, which is not repeated here. Repeat.
上述的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图10所示的计算机设备上运行。The above-mentioned apparatus can be implemented in the form of a computer program that can be executed on a computer device as shown in FIG. 10 .
请参阅图10,图10是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。Please refer to FIG. 10. FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application. The computer device can be a server or a terminal.
请参阅图10,该计算机设备包括通过系统总线连接的处理器和存储器,其中,存储器可以包括非易失性存储介质和内存储器。Referring to FIG. 10, the computer device includes a processor and a memory connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种情绪识别方法。The internal memory provides an environment for running a computer program in the non-volatile storage medium, and when the computer program is executed by the processor, the processor can cause the processor to execute any emotion recognition method.
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:Wherein, in one embodiment, the processor is configured to run a computer program stored in the memory to implement the following steps:
获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Obtain training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels; call the emotional recognition model to be trained, and the emotional recognition model includes a feature generator, An emotion classification model and a speaker classification model; the emotion feature information and the speaker feature information are input into the feature generator for feature generation, and corresponding emotion feature vector groups and speaker feature vector groups are obtained; The person feature vector group and the labeled speaker category label are input into the speaker classification model for iterative training to convergence, and the predicted feature vector corresponding to the trained speaker classification model is obtained; the predicted feature vector is reversed. Propagating to the feature generator for feature generation to obtain an emotion feature vector group that eliminates the speaker feature; the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for Iterative training until the emotion classification model converges, and a trained emotion recognition model is obtained; the speech signal to be recognized is obtained, and the speech signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal .
在一个实施例中,所述说话人特征向量组包括至少一个说话人特征向量;所述处理器在实现将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛时,用于实现:In one embodiment, the speaker feature vector group includes at least one speaker feature vector; the processor is implementing inputting the speaker feature vector group and the annotated speaker category label into the speaker classification When the model is iteratively trained to converge, it is used to achieve:
将所述说话人特征向量组中的其中一说话人特征向量与所述说话人特征向量对应的说话人类别标签,确定每一轮训练的训练样本数据;将当前轮训练样本数据输入所述说话人分类模型中进行说话人分类训练,得到所述当前轮训练样本数据对应的说话人分类预测结果;根据所述当前轮训练样本数据对应的说话人类别标签与所述说话人分类预测结果,确定当前轮对应损失函数值;若所述损失函数值大于预设的损失值阈值,则调整所述说话人分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于所述损失值阈值,结束训练,得到训练后的所述说话人分类模型。Determine the training sample data of each round of training by using one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector; input the current round of training sample data into the speech Perform speaker classification training in the speaker classification model to obtain the speaker classification prediction result corresponding to the current round of training sample data; determine the speaker classification prediction result according to the speaker classification label corresponding to the current round of training sample data and the speaker classification prediction result. The loss function value corresponding to the current round; if the loss function value is greater than the preset loss value threshold, adjust the parameters of the speaker classification model, and perform the next round of training until the obtained loss function value is less than or equal to the The loss value threshold, the training is ended, and the trained speaker classification model is obtained.
在一个实施例中,所述说话人分类模型至少包括全连接层;所述处理器在实现获取训练后的所述说话人分类模型对应的预测特征向量时,用于实现:In one embodiment, the speaker classification model includes at least a fully-connected layer; when the processor obtains the prediction feature vector corresponding to the trained speaker classification model, the processor is configured to:
将每一轮训练的所述训练样本数据输入训练后的所述说话人分类模型中进行说话人分类预测,并获取所述说话人分类模型的全连接层输出的特征向量;将获取的全部所述特征向量的均值,确定为所述预测特征向量。Input the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtain the feature vector output by the fully connected layer of the speaker classification model; The mean value of the eigenvectors is determined as the predicted eigenvectors.
在一个实施例中,所述特征生成器包括生成函数;所述处理器在实现将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组时,用于实现:In one embodiment, the feature generator includes a generating function; the processor performs feature generation by back-propagating the predicted feature vector to the feature generator to obtain a set of emotion feature vectors that eliminates speaker features , used to achieve:
根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,其中,调节后的所述说话人特征向量组中的每个说话人特征向量相同;基于调节后的所述说话人特征向量组,通过所述生成函数生成消除说话人特征的所述情绪特征向量组。Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vectors are the same; based on the adjusted speaker feature vector group, the emotion feature vector group in which the speaker feature is eliminated is generated by the generating function.
在一个实施例中,所述说话人特征向量组包括至少一个第一分布函数;所述处理器在实现根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获 得调节后的所述说话人特征向量组时,用于实现:In one embodiment, the speaker feature vector group includes at least one first distribution function; the processor adjusts the speaker feature vector group in the feature generator according to the predicted feature vector to obtain When the adjusted set of speaker feature vectors is used to achieve:
确定所述预测特征向量对应的第二分布函数,并获取所述第二分布函数的均值与方差;根据所述均值与所述方差,对每个所述第一分布函数中的均值与方差进行更新,得到更新后的所述第一分布函数。Determine the second distribution function corresponding to the predicted feature vector, and obtain the mean value and variance of the second distribution function; according to the mean value and the variance, perform the mean value and variance in each of the first distribution functions. update to obtain the updated first distribution function.
在一个实施例中,所述处理器在实现将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果之前,还用于实现:In one embodiment, before implementing that the voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, the processor is further configured to:
提取所述语音信号中的有用语音信号,并对所述有用语音信号进行特征提取,得到所述语音信号对应的情绪特征信息与说话人特征信息。Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals.
在一个实施例中,所述处理器在实现将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果,用于实现:In one embodiment, the processor inputs the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal, so as to achieve:
将所述情绪特征信息与所述说话人特征信息输入所述情绪识别模型进行情绪识别,得到所述语音信号对应的所述情绪识别结果。The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
在一个实施例中,所述处理器在实现获取训练数据时,用于实现:In one embodiment, when the processor acquires the training data, the processor is configured to:
获取预设数量的样本用户对应的样本语音信号,提取所述样本语音信号中的有用语音信号,其中,所述样本语音信号存储在区块链中;对所述有用语音信号进行特征提取,得到对应的特征信息,所述特征信息包括情绪特征信息与说话人特征信息;根据所述样本用户的身份信息与情绪信息对所述特征信息进行标注,获得标注的所述说话人类别标签与标注的所述情绪类别标签;将所述情绪特征信息、说话人特征信息、标注的所述情绪类别标签以及标注的所述说话人类别标签,确定为所述训练数据。Obtain sample voice signals corresponding to a preset number of sample users, extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain; perform feature extraction on the useful voice signals to obtain Corresponding feature information, the feature information includes emotional feature information and speaker feature information; the feature information is marked according to the identity information and emotional information of the sample user, and the marked speaker category label and the marked feature information are obtained. The emotion category label; the emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项情绪识别方法。The embodiments of the present application further provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, the computer-readable storage medium stores a computer program, and the The computer program includes program instructions, and the processor executes the program instructions to implement any emotion recognition method provided by the embodiments of the present application.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字卡(Secure Digital Card,SD Card),闪存卡(Flash Card)等。The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a Secure Digital Card (Secure Digital Card) , SD Card), flash memory card (Flash Card), etc.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种情绪识别方法,其中,包括:An emotion recognition method, which includes:
    获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
    调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
    将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
    将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;
    将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
    将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
    获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
  2. 根据权利要求1所述的情绪识别方法,其中,所述说话人特征向量组包括至少一个说话人特征向量;所述将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,包括:The emotion recognition method according to claim 1, wherein the speaker feature vector group includes at least one speaker feature vector; the speaker feature vector group and the labeled speaker category label are input into the The speaker classification model is iteratively trained to converge, including:
    将所述说话人特征向量组中的其中一说话人特征向量与所述说话人特征向量对应的说话人类别标签,确定每一轮训练的训练样本数据;Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;
    将当前轮训练样本数据输入所述说话人分类模型中进行说话人分类训练,得到所述当前轮训练样本数据对应的说话人分类预测结果;Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;
    根据所述当前轮训练样本数据对应的说话人类别标签与所述说话人分类预测结果,确定当前轮对应损失函数值;According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;
    若所述损失函数值大于预设的损失值阈值,则调整所述说话人分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于所述损失值阈值,结束训练,得到训练后的所述说话人分类模型。If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
  3. 根据权利要求2所述的情绪识别方法,其中,所述说话人分类模型至少包括全连接层;所述获取训练后的所述说话人分类模型对应的预测特征向量,包括:The emotion recognition method according to claim 2, wherein the speaker classification model includes at least a fully connected layer; the obtaining the prediction feature vector corresponding to the trained speaker classification model comprises:
    将每一轮训练的所述训练样本数据输入训练后的所述说话人分类模型中进行说话人分类预测,并获取所述说话人分类模型的全连接层输出的特征向量;Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;
    将获取的全部所述特征向量的均值,确定为所述预测特征向量。The mean value of all the obtained feature vectors is determined as the predicted feature vector.
  4. 根据权利要求1所述的情绪识别方法,其中,所述特征生成器包括生成函数;所述将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组,包括:The emotion recognition method according to claim 1, wherein the feature generator comprises a generating function; the feature generation is performed by back-propagating the predicted feature vector to the feature generator to obtain the emotion that eliminates the speaker feature set of eigenvectors, including:
    根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,其中,调节后的所述说话人特征向量组中的每个说话人特征向量相同;Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;
    基于调节后的所述说话人特征向量组,通过所述生成函数生成消除说话人特征的所述情绪特征向量组。Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
  5. 根据权利要求4所述的情绪识别方法,其中,所述说话人特征向量组包括至少一个第一分布函数;所述根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,包括:The emotion recognition method according to claim 4, wherein the speaker feature vector group includes at least one first distribution function; the speaker feature vector in the feature generator is adjusted according to the predicted feature vector group, and obtain the adjusted speaker feature vector group, including:
    确定所述预测特征向量对应的第二分布函数,并获取所述第二分布函数的均值与方差;determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;
    根据所述均值与所述方差,对每个所述第一分布函数中的均值与方差进行更新,得到更新后的所述第一分布函数。According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.
  6. 根据权利要求1所述的情绪识别方法,其中,所述将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果之前,还包括:The emotion recognition method according to claim 1, wherein before the inputting the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal, the method further comprises:
    提取所述语音信号中的有用语音信号,并对所述有用语音信号进行特征提取,得到所述语音信号对应的情绪特征信息与说话人特征信息;Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals;
    所述将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果,包括:The described voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, including:
    将所述情绪特征信息与所述说话人特征信息输入所述情绪识别模型进行情绪识别,得到所述语音信号对应的所述情绪识别结果。The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
  7. 根据权利要求1-6任一项所述的情绪识别方法,其中,所述获取训练数据,包括:The emotion recognition method according to any one of claims 1-6, wherein the acquiring training data comprises:
    获取预设数量的样本用户对应的样本语音信号,提取所述样本语音信号中的有用语音信号,其中,所述样本语音信号存储在区块链中;Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain;
    对所述有用语音信号进行特征提取,得到对应的特征信息,所述特征信息包括情绪特征信息与说话人特征信息;Perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information;
    根据所述样本用户的身份信息与情绪信息对所述特征信息进行标注,获得标注的所述说话人类别标签与标注的所述情绪类别标签;Mark the feature information according to the identity information and emotional information of the sample user, and obtain the marked speaker category label and the marked emotional category label;
    将所述情绪特征信息、说话人特征信息、标注的所述情绪类别标签以及标注的所述说话人类别标签,确定为所述训练数据。The emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
  8. 一种情绪识别装置,其中,包括:An emotion recognition device, comprising:
    训练数据获取模块,用于获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;a training data acquisition module, used for acquiring training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels;
    模型调用模块,用于调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;a model calling module, used for calling an emotion recognition model to be trained, the emotion recognition model comprising a feature generator, an emotion classification model and a speaker classification model;
    第一特征生成模块,用于将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;a first feature generation module, configured to input the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
    第一训练模块,用于将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;The first training module is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the corresponding speaker classification model after training. predict feature vector;
    第二特征生成模块,用于将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;The second feature generation module is used for back-propagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
    第二训练模块,用于将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The second training module is used for inputting the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and the trained emotion is obtained. Identify the model;
    情绪识别模块,用于获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。The emotion recognition module is used for acquiring the speech signal to be recognized, and inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;A computer device, wherein the computer device includes a memory and a processor;
    所述存储器,用于存储计算机程序;the memory for storing computer programs;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现:The processor is configured to execute the computer program and realize when executing the computer program:
    获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
    调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
    将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
    将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型 进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;The speaker feature vector group and the marked speaker category label are input into the speaker classification model for iterative training to convergence, and the prediction feature vector corresponding to the trained speaker classification model is obtained;
    将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
    将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
    获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
  10. 根据权利要求9所述的计算机设备,其中,所述说话人特征向量组包括至少一个说话人特征向量;所述将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,包括:The computer device according to claim 9, wherein the speaker feature vector group includes at least one speaker feature vector; the speaker feature vector group and the annotated speaker category label are input into the speaking The person classification model is iteratively trained to converge, including:
    将所述说话人特征向量组中的其中一说话人特征向量与所述说话人特征向量对应的说话人类别标签,确定每一轮训练的训练样本数据;Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;
    将当前轮训练样本数据输入所述说话人分类模型中进行说话人分类训练,得到所述当前轮训练样本数据对应的说话人分类预测结果;Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;
    根据所述当前轮训练样本数据对应的说话人类别标签与所述说话人分类预测结果,确定当前轮对应损失函数值;According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;
    若所述损失函数值大于预设的损失值阈值,则调整所述说话人分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于所述损失值阈值,结束训练,得到训练后的所述说话人分类模型。If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
  11. 根据权利要求10所述的计算机设备,其中,所述说话人分类模型至少包括全连接层;所述获取训练后的所述说话人分类模型对应的预测特征向量,包括:The computer device according to claim 10, wherein the speaker classification model comprises at least a fully-connected layer; the obtaining the predicted feature vector corresponding to the trained speaker classification model comprises:
    将每一轮训练的所述训练样本数据输入训练后的所述说话人分类模型中进行说话人分类预测,并获取所述说话人分类模型的全连接层输出的特征向量;Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;
    将获取的全部所述特征向量的均值,确定为所述预测特征向量。The mean value of all the obtained feature vectors is determined as the predicted feature vector.
  12. 根据权利要求9所述的计算机设备,其中,所述特征生成器包括生成函数;所述将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组,包括:The computer device according to claim 9, wherein the feature generator comprises a generating function; and the feature generator is back-propagated the predicted feature vector to the feature generator for feature generation, so as to obtain an emotional feature that eliminates speaker features set of vectors, including:
    根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,其中,调节后的所述说话人特征向量组中的每个说话人特征向量相同;Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;
    基于调节后的所述说话人特征向量组,通过所述生成函数生成消除说话人特征的所述情绪特征向量组。Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
  13. 根据权利要求12所述的计算机设备,其中,所述说话人特征向量组包括至少一个第一分布函数;所述根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,包括:13. The computer device of claim 12, wherein the set of speaker feature vectors includes at least one first distribution function; the adjustment of the set of speaker feature vectors in the feature generator according to the predicted feature vector , to obtain the adjusted set of speaker feature vectors, including:
    确定所述预测特征向量对应的第二分布函数,并获取所述第二分布函数的均值与方差;determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;
    根据所述均值与所述方差,对每个所述第一分布函数中的均值与方差进行更新,得到更新后的所述第一分布函数。According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.
  14. 根据权利要求9所述的计算机设备,其中,所述将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果之前,还包括:The computer device according to claim 9, wherein before the voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, the method further comprises:
    提取所述语音信号中的有用语音信号,并对所述有用语音信号进行特征提取,得到所述语音信号对应的情绪特征信息与说话人特征信息;Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals;
    所述将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果,包括:The described voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, including:
    将所述情绪特征信息与所述说话人特征信息输入所述情绪识别模型进行情绪识别, 得到所述语音信号对应的所述情绪识别结果。The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
  15. 根据权利要求9-14任一项所述的计算机设备,其中,所述获取训练数据,包括:The computer device according to any one of claims 9-14, wherein the acquiring training data comprises:
    获取预设数量的样本用户对应的样本语音信号,提取所述样本语音信号中的有用语音信号,其中,所述样本语音信号存储在区块链中;Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain;
    对所述有用语音信号进行特征提取,得到对应的特征信息,所述特征信息包括情绪特征信息与说话人特征信息;Perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information;
    根据所述样本用户的身份信息与情绪信息对所述特征信息进行标注,获得标注的所述说话人类别标签与标注的所述情绪类别标签;Mark the feature information according to the identity information and emotional information of the sample user, and obtain the marked speaker category label and the marked emotional category label;
    将所述情绪特征信息、说话人特征信息、标注的所述情绪类别标签以及标注的所述说话人类别标签,确定为所述训练数据。The emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to implement:
    获取训练数据,所述训练数据包括情绪特征信息与标注的情绪类别标签、以及说话人特征信息与标注的说话人类别标签;Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;
    调用待训练的情绪识别模型,所述情绪识别模型包括特征生成器、情绪分类模型以及说话人分类模型;Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;
    将所述情绪特征信息与所述说话人特征信息输入所述特征生成器进行特征生成,得到对应的情绪特征向量组与说话人特征向量组;Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;
    将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,并获取训练后的所述说话人分类模型对应的预测特征向量;Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;
    将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组;Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;
    将消除说话人特征的所述情绪特征向量组与标注的所述情绪类别标签输入所述情绪分类模型进行迭代训练,直至所述情绪分类模型收敛,获得训练后的情绪识别模型;The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;
    获取待识别的语音信号,将所述语音信号输入所述训练后的情绪识别模型得到所述语音信号对应的情绪识别结果。Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述说话人特征向量组包括至少一个说话人特征向量;所述将所述说话人特征向量组与标注的所述说话人类别标签输入所述说话人分类模型进行迭代训练至收敛,包括:The computer-readable storage medium of claim 16, wherein the set of speaker feature vectors includes at least one speaker feature vector; the input of the set of speaker feature vectors and the labeled speaker category label The speaker classification model is iteratively trained to converge, including:
    将所述说话人特征向量组中的其中一说话人特征向量与所述说话人特征向量对应的说话人类别标签,确定每一轮训练的训练样本数据;Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;
    将当前轮训练样本数据输入所述说话人分类模型中进行说话人分类训练,得到所述当前轮训练样本数据对应的说话人分类预测结果;Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;
    根据所述当前轮训练样本数据对应的说话人类别标签与所述说话人分类预测结果,确定当前轮对应损失函数值;According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;
    若所述损失函数值大于预设的损失值阈值,则调整所述说话人分类模型的参数,并进行下一轮训练,直至得到的损失函数值小于或等于所述损失值阈值,结束训练,得到训练后的所述说话人分类模型。If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述说话人分类模型至少包括全连接层;所述获取训练后的所述说话人分类模型对应的预测特征向量,包括:The computer-readable storage medium according to claim 17, wherein the speaker classification model comprises at least a fully connected layer; and the obtaining the prediction feature vector corresponding to the trained speaker classification model comprises:
    将每一轮训练的所述训练样本数据输入训练后的所述说话人分类模型中进行说话人分类预测,并获取所述说话人分类模型的全连接层输出的特征向量;Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;
    将获取的全部所述特征向量的均值,确定为所述预测特征向量。The mean value of all the obtained feature vectors is determined as the predicted feature vector.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述特征生成器包括生成函数;所述将所述预测特征向量反向传播至所述特征生成器进行特征生成,得到消除说话人特征的情绪特征向量组,包括:The computer-readable storage medium of claim 16, wherein the feature generator comprises a generating function; the back-propagation of the predicted feature vector to the feature generator for feature generation results in removing speaker features The sentiment feature vector group of , including:
    根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,其中,调节后的所述说话人特征向量组中的每个说话人特征向量相同;Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;
    基于调节后的所述说话人特征向量组,通过所述生成函数生成消除说话人特征的所述情绪特征向量组。Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述说话人特征向量组包括至少一个第一分布函数;所述根据所述预测特征向量调节所述特征生成器中的所述说话人特征向量组,获得调节后的所述说话人特征向量组,包括:19. The computer-readable storage medium of claim 19, wherein the set of speaker feature vectors includes at least one first distribution function; the adjustment of the speakers in the feature generator based on the predicted feature vectors A feature vector group to obtain the adjusted speaker feature vector group, including:
    确定所述预测特征向量对应的第二分布函数,并获取所述第二分布函数的均值与方差;determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;
    根据所述均值与所述方差,对每个所述第一分布函数中的均值与方差进行更新,得到更新后的所述第一分布函数。According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.
PCT/CN2021/084252 2021-02-26 2021-03-31 Emotion recognition method and apparatus, computer device, and storage medium WO2022178942A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110218668.3A CN112949708B (en) 2021-02-26 2021-02-26 Emotion recognition method, emotion recognition device, computer equipment and storage medium
CN202110218668.3 2021-02-26

Publications (1)

Publication Number Publication Date
WO2022178942A1 true WO2022178942A1 (en) 2022-09-01

Family

ID=76246480

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084252 WO2022178942A1 (en) 2021-02-26 2021-03-31 Emotion recognition method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN112949708B (en)
WO (1) WO2022178942A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796290A (en) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium
CN117582227A (en) * 2024-01-18 2024-02-23 华南理工大学 fNIRS emotion recognition method and system based on probability distribution labels and brain region characteristics

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780341B (en) * 2021-08-04 2024-02-06 华中科技大学 Multidimensional emotion recognition method and system
CN113889149B (en) * 2021-10-15 2023-08-29 北京工业大学 Speech emotion recognition method and device
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment
CN115482837B (en) * 2022-07-25 2023-04-28 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN112259105A (en) * 2020-10-10 2021-01-22 西南政法大学 Training method of voiceprint recognition model, storage medium and computer equipment
CN112382309A (en) * 2020-12-11 2021-02-19 平安科技(深圳)有限公司 Emotion recognition model training method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063427A (en) * 2014-06-06 2014-09-24 北京搜狗科技发展有限公司 Expression input method and device based on semantic understanding
CN110379445A (en) * 2019-06-20 2019-10-25 深圳壹账通智能科技有限公司 Method for processing business, device, equipment and storage medium based on mood analysis
GB2588747B (en) * 2019-06-28 2021-12-08 Huawei Tech Co Ltd Facial behaviour analysis
CN110556129B (en) * 2019-09-09 2022-04-19 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method
CN111523389A (en) * 2020-03-25 2020-08-11 中国平安人寿保险股份有限公司 Intelligent emotion recognition method and device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN112259105A (en) * 2020-10-10 2021-01-22 西南政法大学 Training method of voiceprint recognition model, storage medium and computer equipment
CN112382309A (en) * 2020-12-11 2021-02-19 平安科技(深圳)有限公司 Emotion recognition model training method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796290A (en) * 2023-08-23 2023-09-22 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium
CN116796290B (en) * 2023-08-23 2024-03-29 江西尚通科技发展有限公司 Dialog intention recognition method, system, computer and storage medium
CN117582227A (en) * 2024-01-18 2024-02-23 华南理工大学 fNIRS emotion recognition method and system based on probability distribution labels and brain region characteristics

Also Published As

Publication number Publication date
CN112949708B (en) 2023-10-24
CN112949708A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2022178942A1 (en) Emotion recognition method and apparatus, computer device, and storage medium
WO2020173133A1 (en) Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
El-Moneim et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement
US8804918B2 (en) Method and system for using conversational biometrics and speaker identification/verification to filter voice streams
WO2021082420A1 (en) Voiceprint authentication method and device, medium and electronic device
CN108962237A (en) Mixing voice recognition methods, device and computer readable storage medium
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
US20200125836A1 (en) Training Method for Descreening System, Descreening Method, Device, Apparatus and Medium
WO2019237518A1 (en) Model library establishment method, voice recognition method and apparatus, and device and medium
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
WO2023283823A1 (en) Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
WO2019232848A1 (en) Voice distinguishing method and device, computer device and storage medium
Lee et al. Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities
Nugroho et al. Javanese gender speech recognition using deep learning and singular value decomposition
WO2022121182A1 (en) Voice activity detection method and apparatus, and device and computer-readable storage medium
US10446138B2 (en) System and method for assessing audio files for transcription services
WO2023216760A1 (en) Speech processing method and apparatus, and storage medium, computer device and program product
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
Mardhotillah et al. Speaker recognition for digital forensic audio analysis using support vector machine
Jahanirad et al. Blind source computer device identification from recorded VoIP calls for forensic investigation
Chettri Voice biometric system security: Design and analysis of countermeasures for replay attacks.
Bao et al. A Novel System for Recognizing Recording Devices from Recorded Speech Signals.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927376

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927376

Country of ref document: EP

Kind code of ref document: A1