CN113571096A - Speech emotion classification model training method and device, computer equipment and medium - Google Patents

Speech emotion classification model training method and device, computer equipment and medium Download PDF

Info

Publication number
CN113571096A
CN113571096A CN202110836890.XA CN202110836890A CN113571096A CN 113571096 A CN113571096 A CN 113571096A CN 202110836890 A CN202110836890 A CN 202110836890A CN 113571096 A CN113571096 A CN 113571096A
Authority
CN
China
Prior art keywords
emotion
voice
recognized
speech
classification model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110836890.XA
Other languages
Chinese (zh)
Other versions
CN113571096B (en
Inventor
张超
魏韬
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110836890.XA priority Critical patent/CN113571096B/en
Publication of CN113571096A publication Critical patent/CN113571096A/en
Application granted granted Critical
Publication of CN113571096B publication Critical patent/CN113571096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a speech emotion classification model training method, a device, computer equipment and a storage medium, wherein the method comprises the steps of determining a target emotion recognition result of each to-be-recognized speech data, wherein the target emotion recognition result comprises a target object and a target emotion label; inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result which corresponds to the voice data to be recognized and comprises a predicted object and a predicted emotion label; determining a prediction loss value of a preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label; and when the predicted loss value does not reach the preset convergence condition, iteratively updating the initial parameters in the preset classification model until the predicted loss value reaches the convergence condition, and recording the converged preset classification model as the speech emotion classification model. The invention improves the accuracy of speech emotion recognition.

Description

Speech emotion classification model training method and device, computer equipment and medium
Technical Field
The invention relates to the technical field of classification models, in particular to a speech emotion classification model training method, a speech emotion classification model training device, computer equipment and a medium.
Background
Emotion recognition plays a very important role in intelligent human-computer interaction systems, especially in automated customer service systems. For example, in an automatic customer service system, the system needs to instantly recognize the emotion exposed in the user's conversation so as to take corresponding measures against the emotion.
In the prior art, voice data is mainly converted into a text through machine recognition, and then text emotion recognition is performed on the text. However, the method only utilizes emotion information reflected by text information in the voice data, and the emotion information in the voice data is lost, so that the emotion recognition accuracy is low; when voice data is converted into text, if a text conversion error occurs, the emotion recognition accuracy is more likely to be low.
Disclosure of Invention
The embodiment of the invention provides a speech emotion classification model training method, a speech emotion classification model training device, computer equipment and a medium, and aims to solve the problem of low emotion recognition accuracy.
A speech emotion classification model training method comprises the following steps:
acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;
and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.
A speech emotion classification model training apparatus, comprising:
the voice training set acquisition module is used for acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
the target emotion recognition result determining module is used for determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
the predicted emotion recognition result determining module is used for inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
the prediction loss value determining module is used for determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;
and the model updating training module is used for iteratively updating the initial parameters in the preset classification model when the prediction loss value does not reach a preset convergence condition, and recording the converged preset classification model as a speech emotion classification model when the prediction loss value reaches the convergence condition.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above speech emotion classification model training method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the above speech emotion classification model training method.
The method comprises the steps of obtaining a preset voice training set; the preset voice training set comprises at least one voice data to be recognized; determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label; inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object; determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label; and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.
According to the emotion recognition method, when emotion recognition is carried out on the voice data to be recognized through the preset classification model, the same emotion in the voice data to be recognized is tracked by adopting an emotion tracking recognition method, so that voice fragments among the same emotion are subjected to fusion judgment, a predicted emotion label with higher accuracy is generated, and the accuracy of emotion recognition is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a speech emotion classification model training method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for training a speech emotion classification model according to an embodiment of the present invention;
FIG. 3 is a flowchart of step S30 of the method for training the speech emotion classification model according to the embodiment of the present invention;
FIG. 4 is a schematic block diagram of a training apparatus for speech emotion classification model according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a predicted emotion recognition result determination module in the speech emotion classification model training apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The speech emotion classification model training method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the speech emotion classification model training method is applied to a speech emotion classification model training system, the speech emotion classification model training system comprises a client and a server shown in fig. 1, and the client and the server are communicated through a network and used for solving the problem of low emotion recognition accuracy. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a method for training a speech emotion classification model is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s10: acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
it can be understood that the voice data to be recognized refers to voice data that needs emotion recognition, and the data source of the voice data to be recognized is different based on different application scenarios. Illustratively, as in a smart customer service application scenario, the speech data to be recognized may be the user's speech data received by the system.
S20: determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
as can be understood, the target object is a participant in the voice data to be recognized, and in an intelligent customer service application scenario, the voice data to be recognized may include three target objects, i.e., a user, a robot, or a customer service. A target emotion tag refers to the emotion of the target subject during the conversation, which may include, but is not limited to, for example, a happy emotion tag, a sad emotion tag, a calm emotion tag, or an angry emotion tag; further, the target emotion tag represents the emotion of a target object in the voice data to be recognized in a certain voice segment, but not the emotion of the target object in the whole voice data to be recognized, that is, there is a target emotion tag of the target object for each voice segment in the voice data to be recognized. Further, the target object and the target emotion label in this embodiment may be determined by way of manual labeling, or may be determined by the following steps.
In one embodiment, step S20 includes:
performing role recognition on the voice data to be recognized, and determining all target objects in the voice data to be recognized;
it can be understood that the role recognition refers to a method for recognizing different speaking objects in the speech data to be recognized, and the voiceprint features of each different speaking object are different, so that all target objects can be determined according to the voiceprint features in the speech data to be recognized.
Dividing the voice data to be recognized into voice data segments corresponding to the target objects; one of the target objects corresponds to at least one voice data segment;
it can be understood that after the voice data to be recognized is subjected to role recognition and all target objects in the voice data to be recognized are determined, the voice data to be recognized can be divided according to each target object, that is, the voice data to be recognized is divided into a plurality of voice data segments, and each voice data segment is dialog data of one target object; in the voice data to be recognized, one target object corresponds to at least one voice data segment, that is, one target object may correspond to one voice data segment or a plurality of voice data segments.
Performing voice emotion recognition on each voice data fragment to obtain a target emotion label corresponding to each voice data fragment;
as can be appreciated, speech emotion recognition is a method for determining the emotion of a conversation of a target object in a speech data segment.
In an embodiment, the performing speech emotion recognition on each of the speech data segments to obtain a target emotion tag corresponding to each of the speech data segments includes:
performing voice preprocessing on the voice data segment to obtain voice preprocessing characteristics of the voice data segment after preprocessing;
it is understood that the voice preprocessing refers to a process of eliminating other voices (such as noise and background voice) in the voice data segment except for the voice of the target object, and thus the accuracy in extracting the voice preprocessing features of the preprocessed voice data segment is high. The voice preprocessing characteristic is the voice characteristic of the target object in the voice data segment after preprocessing.
Carrying out endpoint detection and voice filtering processing on the voice data fragments to obtain voice data characteristics corresponding to each voice data fragment;
it is understood that endpoint detection is a method for detecting a start time point and an end time point of a dialogue vocal voice of a target object in a voice data section. The voice filtering process is a method for filtering noise, silence, and other sounds other than the target object's dialogue personal voice in the voice data segment. The voice data characteristics are the voice characteristics of the target object in the voice data segment after the endpoint detection and the voice filtering processing.
Performing feature fusion on the voice preprocessing features and the voice data features to obtain voice fusion features, and performing feature dimensionality reduction on the voice fusion features to obtain voice emotion features;
the voice pre-processing characteristics and the voice data characteristics obtained by the different voice processing modes are subjected to characteristic fusion, so that the voice characteristic information in the voice data to be recognized can be reflected more accurately, and the accuracy of emotion recognition is improved. After the feature dimensionality reduction is carried out on the voice fusion features, the emotion change features in the voice fusion features can be better displayed, and the accuracy of emotion recognition is further improved.
And determining a target emotion label corresponding to each voice data segment according to the voice emotion characteristics.
Specifically, after the voice fusion features are subjected to feature dimensionality reduction to obtain voice emotion features, the voice emotion features can be input into a trained emotion recognition model, and then the emotion recognition model outputs target emotion labels corresponding to the voice data fragments.
And generating the target emotion recognition result according to the target object corresponding to each voice data segment and the target emotion label.
Specifically, after the target emotion tag corresponding to each of the voice data segments is determined according to the voice emotion feature, a target emotion recognition result is generated according to the target object and the target emotion tag corresponding to each of the voice data segments, that is, the target emotion recognition result includes the target object and the target emotion tag corresponding to each of the voice data segments.
S30: inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
it is to be understood that the preset classification model may be a model constructed based on a deep learning network and a classifier, and the preset classification model is used for performing emotion tracking recognition on the speech data to be recognized. The emotion tracking and recognition means that in the process of recognizing the voice data to be recognized, emotion in the voice data to be recognized is tracked and recognized so as to improve the accuracy of emotion recognition. The prediction object refers to all interlocutors in the voice data to be recognized which are recognized when emotion tracking recognition is performed through a preset classification model. The predicted emotion labels characterize the emotion of each predicted object when talking.
In an embodiment, as shown in fig. 3, in step S30, that is, performing emotion tracking recognition on the speech data to be recognized to obtain a predicted emotion recognition result corresponding to the speech data to be recognized, the method includes:
s301: performing role recognition on the voice data to be recognized, and determining a prediction object in the voice data to be recognized;
it can be understood that the role recognition refers to a method for recognizing different speaking objects in the speech data to be recognized, and the voiceprint features of each different speaking object are different, so that all the prediction objects can be determined according to the voiceprint features in the speech data to be recognized through a preset classification model. Further, the prediction object and the target object in this embodiment are both dialog objects in the speech data to be recognized, and the target object determined in step S20 is accurate, and the prediction object determined in this embodiment may be correct or wrong, so that when the prediction object determined by the preset classification model for the same speech data segment is different from the target object, the initial parameters of the preset classification model may be adjusted through the difference between the target object and the prediction object, so that the preset classification model may accurately distinguish voiceprint features between different objects.
S302: dividing the voice data to be recognized according to the prediction objects to obtain a voice sequence containing voice segments to be recognized corresponding to the prediction objects, wherein the voice segments to be recognized in the voice sequence are arranged according to the time sequence of the voice segments to be recognized in the voice data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;
it can be understood that after the role recognition is performed on the voice data to be recognized and the prediction object in the voice data to be recognized is determined, the voice data to be recognized can be divided according to each prediction object, that is, the voice data to be recognized is divided into a plurality of voice segments to be recognized, and each voice segment to be recognized is dialog data of one prediction object; in the whole voice data to be recognized, the voice data to be recognized may be composed of multiple rounds of dialogue voice data of multiple prediction objects, so that in the voice data to be recognized, one prediction object corresponds to at least one voice segment to be recognized, that is, one prediction object may correspond to one voice segment to be recognized, and may also correspond to multiple voice segments to be recognized.
Further, after the voice data to be recognized is divided into a plurality of voice segments to be recognized by each prediction object, the voice segments to be recognized are arranged according to the time sequence in the voice data to be recognized, and a voice sequence is formed. It can be understood that each speech segment to be recognized is divided from the speech data to be recognized, so that each speech segment to be recognized has a temporal sequence in the speech data to be recognized, that is, a sequence of dialog generation, and further, each speech segment to be recognized can be arranged according to the temporal sequence in the speech data to be recognized to form a speech sequence.
S303: performing initial emotion recognition on each voice segment to be recognized to obtain initial voice emotion corresponding to each voice segment to be recognized;
it is understood that the initial emotion recognition is a method for determining the dialogue emotion of the predicted object in each speech segment to be recognized. Specifically, voice preprocessing is carried out on a voice segment to be recognized, and a first feature to be recognized of the voice segment to be recognized after preprocessing is obtained; carrying out endpoint detection and voice filtering processing on voice segments to be recognized to obtain second features to be recognized corresponding to the voice segments to be recognized; and performing feature fusion on the first feature to be recognized and the second feature to be recognized to obtain a feature to be recognized and fused, performing feature dimensionality reduction on the feature to be recognized and fused to obtain an emotion feature to be recognized, and determining an initial voice emotion according to the emotion feature to be recognized.
S304: acquiring initial voice emotions corresponding to two voice fragments to be recognized which are adjacent in the voice sequence and correspond to the same prediction object, and determining emotion change characteristics according to the two acquired initial voice emotions;
understandably, the emotion change characteristics represent the relationship between initial speech emotions corresponding to two to-be-recognized speech segments which are adjacent in the speech sequence and correspond to the same prediction object; for example, the emotion-changing feature may be a feature in which the two initial speech emotions have not changed (e.g., both initial speech emotions are happy emotions); the mood-changing characteristic may be a characteristic in which two initial speech emotions change (e.g., one initial speech emotion is a happy emotion and the other initial speech emotion is a distressing emotion).
In one embodiment, step S304 includes:
comparing the emotion of the two obtained initial voice emotions to determine an emotion comparison result;
as can be understood, the emotion comparison determines whether the two acquired initial speech emotions are the same emotion, and then obtains an emotion comparison result. Further, if there are more categories in the initial speech emotion, for example, the category of the happy emotion is: like emotions such as laugh emotions and joyful emotions can be determined by judging whether the two initial voice emotions are similar emotions or not, and then determining an emotion comparison result.
When the emotion comparison result is the same emotion result, determining the emotion change characteristic as a tracking emotion characteristic; the same emotion result represents that the obtained two initial voice emotions are the same;
it can be understood that, after comparing the emotion of the two obtained initial speech emotions and determining the emotion comparison result, if the two initial speech emotions are the same (or if the two initial speech emotions are of the same category when the types of the initial speech emotions are more as described above, it can be considered that the two initial speech emotions are the same), it may be determined that the emotion comparison result is the same emotion result, and then it is determined that the emotion change feature is the tracking emotion feature, that is, it is necessary to perform emotion tracking on the to-be-recognized speech segments corresponding to the two initial speech emotions.
When the emotion comparison result is different emotion results, determining that the emotion change characteristic is a single emotion characteristic; the different emotion result represents that the two acquired initial speech emotions are different.
It can be understood that after comparing the emotion of the two initial speech emotions obtained and determining the emotion comparison result, if the two initial speech emotions are different (or if the two initial speech emotions are different categories of emotion when the categories of the initial speech emotions are more as described above, the two initial speech emotions can be considered as different), the emotion comparison result can be determined to be different emotion results, and then the emotion change characteristic is determined to be a single emotion characteristic, that is, the speech segment to be recognized corresponding to the two initial speech emotions does not need to be subjected to emotion tracking.
S305: determining a predicted emotion label corresponding to each voice segment to be recognized according to the voice segment to be recognized, the initial voice emotion and the emotion change characteristics;
specifically, after emotion change characteristics are determined according to the two acquired initial voice emotions, prediction emotion labels corresponding to the voice segments to be recognized are determined according to the voice segments to be recognized, the initial voice emotions and the emotion change characteristics.
In one embodiment, step S305 includes:
determining a predicted emotion label corresponding to a first voice segment according to an initial voice emotion corresponding to the first voice segment when the emotion change feature is the single emotion feature; the first voice segment is a voice segment to be recognized corresponding to the previous initial voice emotion in the two acquired initial voice emotions;
determining a predicted emotion label corresponding to the second voice segment according to the initial voice emotion corresponding to the second voice segment; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions.
It can be understood that after the emotion change feature is determined according to the two obtained initial speech emotions, if the emotion change feature is a single emotion feature, the predicted emotion tag corresponding to the first speech segment can be directly determined according to the initial speech emotion corresponding to the first speech segment, and the predicted emotion tag corresponding to the second speech segment can be directly determined according to the initial speech emotion corresponding to the second speech segment. And the sequence of the first voice fragment and the second voice fragment is the sequence of the dialog occurrence in the voice data to be recognized.
In an embodiment, step S305 further includes:
when the emotion change feature is the tracking emotion feature, performing voice fusion on the first voice segment and the second voice segment to obtain a voice fusion segment;
it can be understood that, when the emotion change feature is the emotion tracking feature, the two initial speech emotions obtained by the characterization are the same, and therefore speech fusion needs to be performed on the speech segments to be recognized corresponding to the two initial speech emotions, that is, the first speech segment and the second speech segment are subjected to speech fusion to obtain the speech fusion segment, so that the speech segments between the same emotions can be tracked, and the accuracy of recognition of the same emotion is improved.
Performing voice emotion recognition on the voice fusion segment to obtain a predicted emotion label corresponding to the second voice segment;
as can be understood, after the first speech segment and the second speech segment are subjected to speech fusion to obtain a speech fusion segment, speech emotion recognition is performed on the speech fusion segment to determine a predicted emotion tag corresponding to the second speech segment. It will be appreciated that the predicted emotion tag may or may not be the same as the initial speech emotion.
And determining a predicted emotion label corresponding to the first voice segment according to the initial voice emotion corresponding to the first voice segment.
It can be understood that, since the time sequence of the second speech segment is after the first speech segment, only emotion tracking needs to be performed on the second speech segment according to the first speech segment, and for the first speech segment, the predicted emotion tag corresponding to the first speech segment is determined directly according to the initial speech emotion corresponding to the first speech segment.
S306: and generating the predicted emotion recognition result according to the predicted object and the predicted emotion label corresponding to each voice segment to be recognized.
Specifically, after the predicted emotion tag corresponding to each to-be-recognized voice segment is determined according to the to-be-recognized voice segment, the initial voice emotion and the emotion change feature, a predicted emotion recognition result is generated according to the predicted object and the predicted emotion tag corresponding to each to-be-recognized voice segment, that is, the predicted emotion recognition result includes the predicted object and the predicted emotion tag associated with each to-be-recognized voice segment.
S40: determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;
the predicted loss value is the loss generated in the emotion tracking recognition process of the speech data to be recognized by the preset classification model.
In one embodiment, step S40 includes:
when the emotion change characteristic is a tracking emotion characteristic, acquiring an initial voice emotion corresponding to the second voice segment and a predicted emotion label; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions;
determining an emotion recognition loss value according to the initial voice emotion corresponding to the second voice segment and the predicted emotion label;
it can be understood that, in the voice data to be recognized, if the emotion change feature is the tracking emotion feature, it represents that the situation that the initial voice emotions corresponding to two adjacent voice segments to be recognized corresponding to the same prediction object in the voice data to be recognized are the same voice emotions, and when the emotion change feature is the tracking emotion feature, the first voice segment and the second voice segment are fused, and the fused voice fusion segment is subjected to voice emotion recognition to generate a prediction emotion tag corresponding to the second voice segment, which may be the same as or different from the initial voice emotion, so that when the emotion change feature is the tracking emotion feature, an emotion recognition loss value exists between the initial voice emotion corresponding to the second voice segment and the prediction emotion tag.
And determining the prediction loss value according to the emotion recognition loss value, the target object, the target emotion label, the prediction object and the prediction emotion label.
It can be understood that after the target object, the target emotion label, the predicted object and the predicted emotion label are determined, an object loss value between each target object and each predicted object is determined, a label loss value between each target emotion label and each predicted emotion label is determined, and then the predicted loss value is determined through a preset loss model according to the emotion recognition loss value, the object loss value and the label loss value. The preset loss model may be a model constructed based on a cross entropy loss function, a model constructed based on a maximum likelihood function, or the like.
Further, the method for determining the object loss value and the tag loss value is as follows: arranging the target object and the voice data segment associated with the target emotion label in the voice data to be recognized according to a time sequence, and comparing the predicted object and the predicted emotion label associated with the voice data segment to be recognized with the target object and the target emotion label of the voice data segment with the same sequence; namely, according to time sequencing, comparing a target object and a target emotion label corresponding to the first voice data segment with a predicted object and a predicted emotion label corresponding to the first voice segment to be recognized, determining an object loss value between the target object and the predicted object, and determining a label loss value between the target emotion label and the predicted emotion label; and then comparing the target object and the target emotion label corresponding to the second voice data segment with the prediction object and the prediction emotion label corresponding to the second voice segment to be recognized until all the voice data segments are compared with the voice segment to be recognized, and determining a prediction loss value.
S50: and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.
It is understood that the convergence condition may be a condition that the predicted loss value is smaller than the set threshold, that is, when the predicted loss value is smaller than the set threshold, the training is stopped; the convergence condition may also be a condition that the value of the predicted loss value is small and does not decrease after 10000 times of calculation, that is, when the value of the predicted loss value is small and does not decrease after 10000 times of calculation, the training is stopped, and the preset classification model after convergence is recorded as the speech emotion classification model.
Further, after determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label, when the predicted loss value does not reach the preset convergence condition, adjusting the initial parameters of the preset classification model according to the predicted loss value, and the voice data to be recognized is input into the preset classification model after the initial parameters are adjusted again, when the predicted loss value of the voice data to be recognized reaches the preset convergence condition, selecting another voice data to be recognized in the preset voice training set, and executing the above steps S20 to S40, and obtaining the predicted loss value of the speech data to be recognized, and when the predicted loss value does not reach the preset convergence condition, and adjusting the initial parameters of the preset classification model again according to the prediction loss value so that the prediction loss value of the voice data to be recognized reaches a preset convergence condition.
Therefore, after all the voice data to be recognized are concentrated through the preset voice training and the preset classification model is trained, the result output by the preset classification model can be continuously drawn close to the accurate result, the recognition accuracy is higher and higher, and the preset classification model after convergence is recorded as the voice emotion classification model until the prediction loss values of all the voice data to be recognized reach the preset convergence condition.
In the embodiment, when emotion recognition is performed on voice data to be recognized through the preset classification model, the same emotion in the voice data to be recognized is tracked by adopting an emotion tracking recognition method, so that voice segments among the same emotion are subjected to fusion judgment, a predicted emotion label with higher accuracy is generated, and the accuracy of emotion recognition is further improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a speech emotion classification model training device is provided, and the speech emotion classification model training device corresponds to the speech emotion classification model training method in the above embodiments one to one. As shown in fig. 4, the speech emotion classification model training apparatus includes a speech training set acquisition module 10, a target emotion recognition result determination module 20, a predicted emotion recognition result determination module 30, a prediction loss value determination module 40, and a model update training module 50. The functional modules are explained in detail as follows:
a voice training set obtaining module 10, configured to obtain a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
a target emotion recognition result determining module 20, configured to determine a target emotion recognition result for each piece of the to-be-recognized speech data; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
the predicted emotion recognition result determining module 30 is configured to input the voice data to be recognized into a preset classification model including initial parameters, and perform emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
a prediction loss value determination module 40, configured to determine a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object, and the prediction emotion label;
and the model updating training module 50 is configured to iteratively update the initial parameters in the preset classification model when the prediction loss value does not reach a preset convergence condition, and record the preset classification model after convergence as a speech emotion classification model until the prediction loss value reaches the convergence condition.
Preferably, the target emotion recognition result determination module 20 includes:
the target object determining unit is used for carrying out role recognition on the voice data to be recognized and determining all target objects in the voice data to be recognized;
a voice data segment dividing unit, configured to divide the voice data to be recognized into voice data segments corresponding to the target objects; one of the target objects corresponds to at least one voice data segment;
the target emotion tag determining unit is used for carrying out voice emotion recognition on each voice data segment to obtain a target emotion tag corresponding to each voice data segment;
and the target emotion recognition result determining unit is used for generating a target emotion recognition result according to the target object corresponding to each voice data segment and the target emotion label.
Preferably, as shown in fig. 5, the predicted emotion recognition result determination module 30 includes:
a prediction object determining unit 301, configured to perform role recognition on the voice data to be recognized, and determine a prediction object in the voice data to be recognized;
a speech sequence determining unit 302, configured to divide the speech data to be recognized according to each of the prediction objects to obtain a speech sequence including speech segments to be recognized corresponding to each of the prediction objects, where each of the speech segments to be recognized in the speech sequence is arranged according to a time sequence of the speech segment to be recognized in the speech data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;
an initial emotion recognition unit 303, configured to perform initial emotion recognition on each to-be-recognized speech segment to obtain an initial speech emotion corresponding to each to-be-recognized speech segment;
an emotion change feature determination unit 304, configured to acquire initial speech emotions corresponding to two to-be-recognized speech segments that are adjacent to each other in the speech sequence and correspond to the same prediction object, and determine an emotion change feature according to the acquired two initial speech emotions;
a predicted emotion tag determination unit 305, configured to determine, according to the to-be-recognized speech segments, the initial speech emotion, and the emotion change feature, a predicted emotion tag corresponding to each of the to-be-recognized speech segments;
a predicted emotion recognition result determining unit 306, configured to generate the predicted emotion recognition result according to the prediction object and the predicted emotion tag corresponding to each to-be-recognized speech segment.
Preferably, the emotion change feature determination unit 304 includes:
the emotion comparison subunit is used for performing emotion comparison on the two acquired initial voice emotions and determining an emotion comparison result;
the first emotion change subunit is used for determining the emotion change characteristic as a tracking emotion characteristic when the emotion comparison result is the same emotion result; the same emotion result represents that the obtained two initial voice emotions are the same;
the second emotion change subunit is used for determining the emotion change characteristics as the individual emotion characteristics when the emotion comparison results are different emotion results; the different emotion result represents that the two acquired initial speech emotions are different.
Preferably, the predicted emotion label determination unit 305 includes:
a first predicted emotion tag determination subunit, configured to determine, when the emotion change feature is the single emotion feature, a predicted emotion tag corresponding to the first speech segment according to an initial speech emotion corresponding to the first speech segment; the first voice segment is a voice segment to be recognized corresponding to the previous initial voice emotion in the two acquired initial voice emotions;
a second predicted emotion tag determination subunit, configured to determine, according to the initial speech emotion corresponding to the second speech segment, a predicted emotion tag corresponding to the second speech segment; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions.
Preferably, the predicted emotion label determination unit 305 further includes:
a voice fusion subunit, configured to perform voice fusion on the first voice segment and the second voice segment to obtain a voice fusion segment when the emotion change feature is the emotion tracking feature;
a third predicted emotion label determining subunit, configured to perform speech emotion recognition on the speech fusion segment to obtain a predicted emotion label corresponding to the second speech segment;
and the fourth predicted emotion label determining subunit is used for determining the predicted emotion label corresponding to the first voice segment according to the initial voice emotion corresponding to the first voice segment.
Preferably, the predicted loss value determining module 40 includes:
the data acquisition unit is used for acquiring an initial voice emotion corresponding to the second voice segment and a predicted emotion label when the emotion change characteristic is a tracking emotion characteristic; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions;
the emotion recognition loss value determining unit is used for determining an emotion recognition loss value according to the initial voice emotion corresponding to the second voice segment and the predicted emotion label;
and the prediction loss value determining unit is used for determining the prediction loss value through a preset loss model according to the emotion recognition loss value, the target object, the target emotion label, the prediction object and the prediction emotion label.
For specific limitations of the speech emotion classification model training device, reference may be made to the above limitations of the speech emotion classification model training method, which is not described herein again. All or part of the modules in the speech emotion classification model training device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the data used in the training method of the speech emotion classification model in the above embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech emotion classification model training method.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the speech emotion classification model training method in the above embodiments when executing the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the speech emotion classification model training method in the above-described embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A speech emotion classification model training method is characterized by comprising the following steps:
acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;
and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.
2. The method for training the speech emotion classification model according to claim 1, wherein the determining the target emotion recognition result of each of the speech data to be recognized includes:
performing role recognition on the voice data to be recognized, and determining all target objects in the voice data to be recognized;
dividing the voice data to be recognized into voice data segments corresponding to the target objects; one of the target objects corresponds to at least one voice data segment;
performing voice emotion recognition on each voice data fragment to obtain a target emotion label corresponding to each voice data fragment;
and generating the target emotion recognition result according to the target object corresponding to each voice data segment and the target emotion label.
3. The method for training the speech emotion classification model according to claim 1, wherein the performing emotion tracking recognition on the speech data to be recognized to obtain a predicted emotion recognition result corresponding to the speech data to be recognized includes:
performing role recognition on the voice data to be recognized, and determining a prediction object in the voice data to be recognized;
dividing the voice data to be recognized according to the prediction objects to obtain a voice sequence containing voice segments to be recognized corresponding to the prediction objects, wherein the voice segments to be recognized in the voice sequence are arranged according to the time sequence of the voice segments to be recognized in the voice data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;
performing initial emotion recognition on each voice segment to be recognized to obtain initial voice emotion corresponding to each voice segment to be recognized;
acquiring initial voice emotions corresponding to two voice fragments to be recognized which are adjacent in the voice sequence and correspond to the same prediction object, and determining emotion change characteristics according to the two acquired initial voice emotions;
determining a predicted emotion label corresponding to each voice segment to be recognized according to the voice segment to be recognized, the initial voice emotion and the emotion change characteristics;
and generating the predicted emotion recognition result according to the predicted object and the predicted emotion label corresponding to each voice segment to be recognized.
4. The method for training the speech emotion classification model according to claim 3, wherein the determining emotion change characteristics according to the two acquired initial speech emotions comprises:
comparing the emotion of the two obtained initial voice emotions to determine an emotion comparison result;
when the emotion comparison result is the same emotion result, determining the emotion change characteristic as a tracking emotion characteristic; the same emotion result represents that the obtained two initial voice emotions are the same;
when the emotion comparison result is different emotion results, determining that the emotion change characteristic is a single emotion characteristic; the different emotion result represents that the two acquired initial speech emotions are different.
5. The method for training the speech emotion classification model as claimed in claim 4, wherein the determining a predicted emotion label corresponding to each of the speech segments to be recognized according to the speech segment to be recognized, the initial speech emotion and the emotion change feature comprises:
determining a predicted emotion label corresponding to a first voice segment according to an initial voice emotion corresponding to the first voice segment when the emotion change feature is the single emotion feature; the first voice segment is a voice segment to be recognized corresponding to the previous initial voice emotion in the two acquired initial voice emotions;
determining a predicted emotion label corresponding to the second voice segment according to the initial voice emotion corresponding to the second voice segment; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions.
6. The method for training the speech emotion classification model as claimed in claim 5, wherein the determining a predicted emotion label corresponding to each of the speech segments to be recognized according to the speech segment to be recognized, the initial speech emotion and the emotion change feature comprises:
when the emotion change feature is the tracking emotion feature, performing voice fusion on the first voice segment and the second voice segment to obtain a voice fusion segment;
performing voice emotion recognition on the voice fusion segment to obtain a predicted emotion label corresponding to the second voice segment;
and determining a predicted emotion label corresponding to the first voice segment according to the initial voice emotion corresponding to the first voice segment.
7. The method for training the speech emotion classification model according to claim 4, wherein the determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the predicted object and the predicted emotion label comprises:
when the emotion change characteristic is a tracking emotion characteristic, acquiring an initial voice emotion corresponding to the second voice segment and a predicted emotion label; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions;
determining an emotion recognition loss value according to the initial voice emotion corresponding to the second voice segment and the predicted emotion label;
and determining the predicted loss value through a preset loss model according to the emotion recognition loss value, the target object, the target emotion label, the predicted object and the predicted emotion label.
8. A speech emotion classification model training device is characterized by comprising:
the voice training set acquisition module is used for acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;
the target emotion recognition result determining module is used for determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;
the predicted emotion recognition result determining module is used for inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;
the prediction loss value determining module is used for determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;
and the model updating training module is used for iteratively updating the initial parameters in the preset classification model when the prediction loss value does not reach a preset convergence condition, and recording the converged preset classification model as a speech emotion classification model when the prediction loss value reaches the convergence condition.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the speech emotion classification model training method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method for training a speech emotion classification model according to any one of claims 1 to 7.
CN202110836890.XA 2021-07-23 2021-07-23 Speech emotion classification model training method and device, computer equipment and medium Active CN113571096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110836890.XA CN113571096B (en) 2021-07-23 2021-07-23 Speech emotion classification model training method and device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110836890.XA CN113571096B (en) 2021-07-23 2021-07-23 Speech emotion classification model training method and device, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN113571096A true CN113571096A (en) 2021-10-29
CN113571096B CN113571096B (en) 2023-04-07

Family

ID=78166832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110836890.XA Active CN113571096B (en) 2021-07-23 2021-07-23 Speech emotion classification model training method and device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN113571096B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155882A (en) * 2021-11-30 2022-03-08 浙江大学 Method and device for judging road rage emotion based on voice recognition
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452405A (en) * 2017-08-16 2017-12-08 北京易真学思教育科技有限公司 A kind of method and device that data evaluation is carried out according to voice content
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
US20200250278A1 (en) * 2019-02-05 2020-08-06 International Business Machines Corporation Method for Fine-Grained Affective States Understanding and Prediction
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN113051910A (en) * 2021-03-19 2021-06-29 上海森宇文化传媒股份有限公司 Method and device for predicting emotion of character role

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452405A (en) * 2017-08-16 2017-12-08 北京易真学思教育科技有限公司 A kind of method and device that data evaluation is carried out according to voice content
US20200250278A1 (en) * 2019-02-05 2020-08-06 International Business Machines Corporation Method for Fine-Grained Affective States Understanding and Prediction
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN113051910A (en) * 2021-03-19 2021-06-29 上海森宇文化传媒股份有限公司 Method and device for predicting emotion of character role

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155882A (en) * 2021-11-30 2022-03-08 浙江大学 Method and device for judging road rage emotion based on voice recognition
CN114155882B (en) * 2021-11-30 2023-08-22 浙江大学 Method and device for judging emotion of road anger based on voice recognition
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment

Also Published As

Publication number Publication date
CN113571096B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111028827B (en) Interaction processing method, device, equipment and storage medium based on emotion recognition
CN110162633B (en) Voice data intention determining method and device, computer equipment and storage medium
CN109729383B (en) Double-recording video quality detection method and device, computer equipment and storage medium
CN111105782B (en) Session interaction processing method and device, computer equipment and storage medium
CN110472224B (en) Quality of service detection method, apparatus, computer device and storage medium
WO2020244153A1 (en) Conference voice data processing method and apparatus, computer device and storage medium
CN109472213B (en) Palm print recognition method and device, computer equipment and storage medium
US10885920B2 (en) Method and system for separating and authenticating speech of a speaker on an audio stream of speakers
CN113571096B (en) Speech emotion classification model training method and device, computer equipment and medium
CN111475616B (en) Multi-round dialogue method and device based on dialogue state prediction and computer equipment
CN110689881B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN106847305B (en) Method and device for processing recording data of customer service telephone
CN110930989B (en) Speech intention recognition method and device, computer equipment and storage medium
CN110890088B (en) Voice information feedback method and device, computer equipment and storage medium
KR20200004826A (en) Voice conversation based context acquisition method and device
CN108831481A (en) Symbol adding method, device, computer equipment and storage medium in speech recognition
US10971149B2 (en) Voice interaction system for interaction with a user by voice, voice interaction method, and program
CN109815489A (en) Collection information generating method, device, computer equipment and storage medium
CN110517673B (en) Speech recognition method, device, computer equipment and storage medium
CN111797632A (en) Information processing method and device and electronic equipment
CN111209380B (en) Control method and device for conversation robot, computer equipment and storage medium
CN109065026B (en) Recording control method and device
CN114493902A (en) Multi-mode information anomaly monitoring method and device, computer equipment and storage medium
CN112395857A (en) Voice text processing method, device, equipment and medium based on dialog system
CN111883109A (en) Voice information processing and verification model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant