CN113571096A

CN113571096A - Speech emotion classification model training method and device, computer equipment and medium

Info

Publication number: CN113571096A
Application number: CN202110836890.XA
Authority: CN
Inventors: 张超; 魏韬; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-29
Anticipated expiration: 2041-07-23
Also published as: CN113571096B

Abstract

The invention discloses a speech emotion classification model training method, a device, computer equipment and a storage medium, wherein the method comprises the steps of determining a target emotion recognition result of each to-be-recognized speech data, wherein the target emotion recognition result comprises a target object and a target emotion label; inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result which corresponds to the voice data to be recognized and comprises a predicted object and a predicted emotion label; determining a prediction loss value of a preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label; and when the predicted loss value does not reach the preset convergence condition, iteratively updating the initial parameters in the preset classification model until the predicted loss value reaches the convergence condition, and recording the converged preset classification model as the speech emotion classification model. The invention improves the accuracy of speech emotion recognition.

Description

Speech emotion classification model training method and device, computer equipment and medium

Technical Field

The invention relates to the technical field of classification models, in particular to a speech emotion classification model training method, a speech emotion classification model training device, computer equipment and a medium.

Background

Emotion recognition plays a very important role in intelligent human-computer interaction systems, especially in automated customer service systems. For example, in an automatic customer service system, the system needs to instantly recognize the emotion exposed in the user's conversation so as to take corresponding measures against the emotion.

In the prior art, voice data is mainly converted into a text through machine recognition, and then text emotion recognition is performed on the text. However, the method only utilizes emotion information reflected by text information in the voice data, and the emotion information in the voice data is lost, so that the emotion recognition accuracy is low; when voice data is converted into text, if a text conversion error occurs, the emotion recognition accuracy is more likely to be low.

Disclosure of Invention

The embodiment of the invention provides a speech emotion classification model training method, a speech emotion classification model training device, computer equipment and a medium, and aims to solve the problem of low emotion recognition accuracy.

A speech emotion classification model training method comprises the following steps:

acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;

determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;

inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;

determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;

and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.

A speech emotion classification model training apparatus, comprising:

the voice training set acquisition module is used for acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;

the target emotion recognition result determining module is used for determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;

the predicted emotion recognition result determining module is used for inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;

the prediction loss value determining module is used for determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;

and the model updating training module is used for iteratively updating the initial parameters in the preset classification model when the prediction loss value does not reach a preset convergence condition, and recording the converged preset classification model as a speech emotion classification model when the prediction loss value reaches the convergence condition.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above speech emotion classification model training method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the above speech emotion classification model training method.

The method comprises the steps of obtaining a preset voice training set; the preset voice training set comprises at least one voice data to be recognized; determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label; inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object; determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label; and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.

According to the emotion recognition method, when emotion recognition is carried out on the voice data to be recognized through the preset classification model, the same emotion in the voice data to be recognized is tracked by adopting an emotion tracking recognition method, so that voice fragments among the same emotion are subjected to fusion judgment, a predicted emotion label with higher accuracy is generated, and the accuracy of emotion recognition is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a speech emotion classification model training method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a speech emotion classification model according to an embodiment of the present invention;

FIG. 3 is a flowchart of step S30 of the method for training the speech emotion classification model according to the embodiment of the present invention;

FIG. 4 is a schematic block diagram of a training apparatus for speech emotion classification model according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a predicted emotion recognition result determination module in the speech emotion classification model training apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The speech emotion classification model training method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the speech emotion classification model training method is applied to a speech emotion classification model training system, the speech emotion classification model training system comprises a client and a server shown in fig. 1, and the client and the server are communicated through a network and used for solving the problem of low emotion recognition accuracy. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for training a speech emotion classification model is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

s10: acquiring a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;

it can be understood that the voice data to be recognized refers to voice data that needs emotion recognition, and the data source of the voice data to be recognized is different based on different application scenarios. Illustratively, as in a smart customer service application scenario, the speech data to be recognized may be the user's speech data received by the system.

S20: determining a target emotion recognition result of each voice data to be recognized; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;

as can be understood, the target object is a participant in the voice data to be recognized, and in an intelligent customer service application scenario, the voice data to be recognized may include three target objects, i.e., a user, a robot, or a customer service. A target emotion tag refers to the emotion of the target subject during the conversation, which may include, but is not limited to, for example, a happy emotion tag, a sad emotion tag, a calm emotion tag, or an angry emotion tag; further, the target emotion tag represents the emotion of a target object in the voice data to be recognized in a certain voice segment, but not the emotion of the target object in the whole voice data to be recognized, that is, there is a target emotion tag of the target object for each voice segment in the voice data to be recognized. Further, the target object and the target emotion label in this embodiment may be determined by way of manual labeling, or may be determined by the following steps.

In one embodiment, step S20 includes:

performing role recognition on the voice data to be recognized, and determining all target objects in the voice data to be recognized;

it can be understood that the role recognition refers to a method for recognizing different speaking objects in the speech data to be recognized, and the voiceprint features of each different speaking object are different, so that all target objects can be determined according to the voiceprint features in the speech data to be recognized.

Dividing the voice data to be recognized into voice data segments corresponding to the target objects; one of the target objects corresponds to at least one voice data segment;

it can be understood that after the voice data to be recognized is subjected to role recognition and all target objects in the voice data to be recognized are determined, the voice data to be recognized can be divided according to each target object, that is, the voice data to be recognized is divided into a plurality of voice data segments, and each voice data segment is dialog data of one target object; in the voice data to be recognized, one target object corresponds to at least one voice data segment, that is, one target object may correspond to one voice data segment or a plurality of voice data segments.

Performing voice emotion recognition on each voice data fragment to obtain a target emotion label corresponding to each voice data fragment;

as can be appreciated, speech emotion recognition is a method for determining the emotion of a conversation of a target object in a speech data segment.

In an embodiment, the performing speech emotion recognition on each of the speech data segments to obtain a target emotion tag corresponding to each of the speech data segments includes:

performing voice preprocessing on the voice data segment to obtain voice preprocessing characteristics of the voice data segment after preprocessing;

it is understood that the voice preprocessing refers to a process of eliminating other voices (such as noise and background voice) in the voice data segment except for the voice of the target object, and thus the accuracy in extracting the voice preprocessing features of the preprocessed voice data segment is high. The voice preprocessing characteristic is the voice characteristic of the target object in the voice data segment after preprocessing.

Carrying out endpoint detection and voice filtering processing on the voice data fragments to obtain voice data characteristics corresponding to each voice data fragment;

it is understood that endpoint detection is a method for detecting a start time point and an end time point of a dialogue vocal voice of a target object in a voice data section. The voice filtering process is a method for filtering noise, silence, and other sounds other than the target object's dialogue personal voice in the voice data segment. The voice data characteristics are the voice characteristics of the target object in the voice data segment after the endpoint detection and the voice filtering processing.

Performing feature fusion on the voice preprocessing features and the voice data features to obtain voice fusion features, and performing feature dimensionality reduction on the voice fusion features to obtain voice emotion features;

the voice pre-processing characteristics and the voice data characteristics obtained by the different voice processing modes are subjected to characteristic fusion, so that the voice characteristic information in the voice data to be recognized can be reflected more accurately, and the accuracy of emotion recognition is improved. After the feature dimensionality reduction is carried out on the voice fusion features, the emotion change features in the voice fusion features can be better displayed, and the accuracy of emotion recognition is further improved.

And determining a target emotion label corresponding to each voice data segment according to the voice emotion characteristics.

Specifically, after the voice fusion features are subjected to feature dimensionality reduction to obtain voice emotion features, the voice emotion features can be input into a trained emotion recognition model, and then the emotion recognition model outputs target emotion labels corresponding to the voice data fragments.

And generating the target emotion recognition result according to the target object corresponding to each voice data segment and the target emotion label.

Specifically, after the target emotion tag corresponding to each of the voice data segments is determined according to the voice emotion feature, a target emotion recognition result is generated according to the target object and the target emotion tag corresponding to each of the voice data segments, that is, the target emotion recognition result includes the target object and the target emotion tag corresponding to each of the voice data segments.

S30: inputting the voice data to be recognized into a preset classification model containing initial parameters, and performing emotion tracking recognition on the voice data to be recognized to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;

it is to be understood that the preset classification model may be a model constructed based on a deep learning network and a classifier, and the preset classification model is used for performing emotion tracking recognition on the speech data to be recognized. The emotion tracking and recognition means that in the process of recognizing the voice data to be recognized, emotion in the voice data to be recognized is tracked and recognized so as to improve the accuracy of emotion recognition. The prediction object refers to all interlocutors in the voice data to be recognized which are recognized when emotion tracking recognition is performed through a preset classification model. The predicted emotion labels characterize the emotion of each predicted object when talking.

In an embodiment, as shown in fig. 3, in step S30, that is, performing emotion tracking recognition on the speech data to be recognized to obtain a predicted emotion recognition result corresponding to the speech data to be recognized, the method includes:

s301: performing role recognition on the voice data to be recognized, and determining a prediction object in the voice data to be recognized;

it can be understood that the role recognition refers to a method for recognizing different speaking objects in the speech data to be recognized, and the voiceprint features of each different speaking object are different, so that all the prediction objects can be determined according to the voiceprint features in the speech data to be recognized through a preset classification model. Further, the prediction object and the target object in this embodiment are both dialog objects in the speech data to be recognized, and the target object determined in step S20 is accurate, and the prediction object determined in this embodiment may be correct or wrong, so that when the prediction object determined by the preset classification model for the same speech data segment is different from the target object, the initial parameters of the preset classification model may be adjusted through the difference between the target object and the prediction object, so that the preset classification model may accurately distinguish voiceprint features between different objects.

S302: dividing the voice data to be recognized according to the prediction objects to obtain a voice sequence containing voice segments to be recognized corresponding to the prediction objects, wherein the voice segments to be recognized in the voice sequence are arranged according to the time sequence of the voice segments to be recognized in the voice data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;

it can be understood that after the role recognition is performed on the voice data to be recognized and the prediction object in the voice data to be recognized is determined, the voice data to be recognized can be divided according to each prediction object, that is, the voice data to be recognized is divided into a plurality of voice segments to be recognized, and each voice segment to be recognized is dialog data of one prediction object; in the whole voice data to be recognized, the voice data to be recognized may be composed of multiple rounds of dialogue voice data of multiple prediction objects, so that in the voice data to be recognized, one prediction object corresponds to at least one voice segment to be recognized, that is, one prediction object may correspond to one voice segment to be recognized, and may also correspond to multiple voice segments to be recognized.

Further, after the voice data to be recognized is divided into a plurality of voice segments to be recognized by each prediction object, the voice segments to be recognized are arranged according to the time sequence in the voice data to be recognized, and a voice sequence is formed. It can be understood that each speech segment to be recognized is divided from the speech data to be recognized, so that each speech segment to be recognized has a temporal sequence in the speech data to be recognized, that is, a sequence of dialog generation, and further, each speech segment to be recognized can be arranged according to the temporal sequence in the speech data to be recognized to form a speech sequence.

S303: performing initial emotion recognition on each voice segment to be recognized to obtain initial voice emotion corresponding to each voice segment to be recognized;

it is understood that the initial emotion recognition is a method for determining the dialogue emotion of the predicted object in each speech segment to be recognized. Specifically, voice preprocessing is carried out on a voice segment to be recognized, and a first feature to be recognized of the voice segment to be recognized after preprocessing is obtained; carrying out endpoint detection and voice filtering processing on voice segments to be recognized to obtain second features to be recognized corresponding to the voice segments to be recognized; and performing feature fusion on the first feature to be recognized and the second feature to be recognized to obtain a feature to be recognized and fused, performing feature dimensionality reduction on the feature to be recognized and fused to obtain an emotion feature to be recognized, and determining an initial voice emotion according to the emotion feature to be recognized.

S304: acquiring initial voice emotions corresponding to two voice fragments to be recognized which are adjacent in the voice sequence and correspond to the same prediction object, and determining emotion change characteristics according to the two acquired initial voice emotions;

understandably, the emotion change characteristics represent the relationship between initial speech emotions corresponding to two to-be-recognized speech segments which are adjacent in the speech sequence and correspond to the same prediction object; for example, the emotion-changing feature may be a feature in which the two initial speech emotions have not changed (e.g., both initial speech emotions are happy emotions); the mood-changing characteristic may be a characteristic in which two initial speech emotions change (e.g., one initial speech emotion is a happy emotion and the other initial speech emotion is a distressing emotion).

In one embodiment, step S304 includes:

comparing the emotion of the two obtained initial voice emotions to determine an emotion comparison result;

as can be understood, the emotion comparison determines whether the two acquired initial speech emotions are the same emotion, and then obtains an emotion comparison result. Further, if there are more categories in the initial speech emotion, for example, the category of the happy emotion is: like emotions such as laugh emotions and joyful emotions can be determined by judging whether the two initial voice emotions are similar emotions or not, and then determining an emotion comparison result.

When the emotion comparison result is the same emotion result, determining the emotion change characteristic as a tracking emotion characteristic; the same emotion result represents that the obtained two initial voice emotions are the same;

it can be understood that, after comparing the emotion of the two obtained initial speech emotions and determining the emotion comparison result, if the two initial speech emotions are the same (or if the two initial speech emotions are of the same category when the types of the initial speech emotions are more as described above, it can be considered that the two initial speech emotions are the same), it may be determined that the emotion comparison result is the same emotion result, and then it is determined that the emotion change feature is the tracking emotion feature, that is, it is necessary to perform emotion tracking on the to-be-recognized speech segments corresponding to the two initial speech emotions.

When the emotion comparison result is different emotion results, determining that the emotion change characteristic is a single emotion characteristic; the different emotion result represents that the two acquired initial speech emotions are different.

It can be understood that after comparing the emotion of the two initial speech emotions obtained and determining the emotion comparison result, if the two initial speech emotions are different (or if the two initial speech emotions are different categories of emotion when the categories of the initial speech emotions are more as described above, the two initial speech emotions can be considered as different), the emotion comparison result can be determined to be different emotion results, and then the emotion change characteristic is determined to be a single emotion characteristic, that is, the speech segment to be recognized corresponding to the two initial speech emotions does not need to be subjected to emotion tracking.

S305: determining a predicted emotion label corresponding to each voice segment to be recognized according to the voice segment to be recognized, the initial voice emotion and the emotion change characteristics;

specifically, after emotion change characteristics are determined according to the two acquired initial voice emotions, prediction emotion labels corresponding to the voice segments to be recognized are determined according to the voice segments to be recognized, the initial voice emotions and the emotion change characteristics.

In one embodiment, step S305 includes:

determining a predicted emotion label corresponding to a first voice segment according to an initial voice emotion corresponding to the first voice segment when the emotion change feature is the single emotion feature; the first voice segment is a voice segment to be recognized corresponding to the previous initial voice emotion in the two acquired initial voice emotions;

determining a predicted emotion label corresponding to the second voice segment according to the initial voice emotion corresponding to the second voice segment; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions.

It can be understood that after the emotion change feature is determined according to the two obtained initial speech emotions, if the emotion change feature is a single emotion feature, the predicted emotion tag corresponding to the first speech segment can be directly determined according to the initial speech emotion corresponding to the first speech segment, and the predicted emotion tag corresponding to the second speech segment can be directly determined according to the initial speech emotion corresponding to the second speech segment. And the sequence of the first voice fragment and the second voice fragment is the sequence of the dialog occurrence in the voice data to be recognized.

In an embodiment, step S305 further includes:

when the emotion change feature is the tracking emotion feature, performing voice fusion on the first voice segment and the second voice segment to obtain a voice fusion segment;

it can be understood that, when the emotion change feature is the emotion tracking feature, the two initial speech emotions obtained by the characterization are the same, and therefore speech fusion needs to be performed on the speech segments to be recognized corresponding to the two initial speech emotions, that is, the first speech segment and the second speech segment are subjected to speech fusion to obtain the speech fusion segment, so that the speech segments between the same emotions can be tracked, and the accuracy of recognition of the same emotion is improved.

Performing voice emotion recognition on the voice fusion segment to obtain a predicted emotion label corresponding to the second voice segment;

as can be understood, after the first speech segment and the second speech segment are subjected to speech fusion to obtain a speech fusion segment, speech emotion recognition is performed on the speech fusion segment to determine a predicted emotion tag corresponding to the second speech segment. It will be appreciated that the predicted emotion tag may or may not be the same as the initial speech emotion.

And determining a predicted emotion label corresponding to the first voice segment according to the initial voice emotion corresponding to the first voice segment.

It can be understood that, since the time sequence of the second speech segment is after the first speech segment, only emotion tracking needs to be performed on the second speech segment according to the first speech segment, and for the first speech segment, the predicted emotion tag corresponding to the first speech segment is determined directly according to the initial speech emotion corresponding to the first speech segment.

S306: and generating the predicted emotion recognition result according to the predicted object and the predicted emotion label corresponding to each voice segment to be recognized.

Specifically, after the predicted emotion tag corresponding to each to-be-recognized voice segment is determined according to the to-be-recognized voice segment, the initial voice emotion and the emotion change feature, a predicted emotion recognition result is generated according to the predicted object and the predicted emotion tag corresponding to each to-be-recognized voice segment, that is, the predicted emotion recognition result includes the predicted object and the predicted emotion tag associated with each to-be-recognized voice segment.

S40: determining a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label;

the predicted loss value is the loss generated in the emotion tracking recognition process of the speech data to be recognized by the preset classification model.

In one embodiment, step S40 includes:

when the emotion change characteristic is a tracking emotion characteristic, acquiring an initial voice emotion corresponding to the second voice segment and a predicted emotion label; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions;

determining an emotion recognition loss value according to the initial voice emotion corresponding to the second voice segment and the predicted emotion label;

it can be understood that, in the voice data to be recognized, if the emotion change feature is the tracking emotion feature, it represents that the situation that the initial voice emotions corresponding to two adjacent voice segments to be recognized corresponding to the same prediction object in the voice data to be recognized are the same voice emotions, and when the emotion change feature is the tracking emotion feature, the first voice segment and the second voice segment are fused, and the fused voice fusion segment is subjected to voice emotion recognition to generate a prediction emotion tag corresponding to the second voice segment, which may be the same as or different from the initial voice emotion, so that when the emotion change feature is the tracking emotion feature, an emotion recognition loss value exists between the initial voice emotion corresponding to the second voice segment and the prediction emotion tag.

And determining the prediction loss value according to the emotion recognition loss value, the target object, the target emotion label, the prediction object and the prediction emotion label.

It can be understood that after the target object, the target emotion label, the predicted object and the predicted emotion label are determined, an object loss value between each target object and each predicted object is determined, a label loss value between each target emotion label and each predicted emotion label is determined, and then the predicted loss value is determined through a preset loss model according to the emotion recognition loss value, the object loss value and the label loss value. The preset loss model may be a model constructed based on a cross entropy loss function, a model constructed based on a maximum likelihood function, or the like.

Further, the method for determining the object loss value and the tag loss value is as follows: arranging the target object and the voice data segment associated with the target emotion label in the voice data to be recognized according to a time sequence, and comparing the predicted object and the predicted emotion label associated with the voice data segment to be recognized with the target object and the target emotion label of the voice data segment with the same sequence; namely, according to time sequencing, comparing a target object and a target emotion label corresponding to the first voice data segment with a predicted object and a predicted emotion label corresponding to the first voice segment to be recognized, determining an object loss value between the target object and the predicted object, and determining a label loss value between the target emotion label and the predicted emotion label; and then comparing the target object and the target emotion label corresponding to the second voice data segment with the prediction object and the prediction emotion label corresponding to the second voice segment to be recognized until all the voice data segments are compared with the voice segment to be recognized, and determining a prediction loss value.

S50: and when the prediction loss value does not reach a preset convergence condition, iteratively updating initial parameters in the preset classification model until the prediction loss value reaches the convergence condition, and recording the converged preset classification model as a speech emotion classification model.

It is understood that the convergence condition may be a condition that the predicted loss value is smaller than the set threshold, that is, when the predicted loss value is smaller than the set threshold, the training is stopped; the convergence condition may also be a condition that the value of the predicted loss value is small and does not decrease after 10000 times of calculation, that is, when the value of the predicted loss value is small and does not decrease after 10000 times of calculation, the training is stopped, and the preset classification model after convergence is recorded as the speech emotion classification model.

Further, after determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object and the prediction emotion label, when the predicted loss value does not reach the preset convergence condition, adjusting the initial parameters of the preset classification model according to the predicted loss value, and the voice data to be recognized is input into the preset classification model after the initial parameters are adjusted again, when the predicted loss value of the voice data to be recognized reaches the preset convergence condition, selecting another voice data to be recognized in the preset voice training set, and executing the above steps S20 to S40, and obtaining the predicted loss value of the speech data to be recognized, and when the predicted loss value does not reach the preset convergence condition, and adjusting the initial parameters of the preset classification model again according to the prediction loss value so that the prediction loss value of the voice data to be recognized reaches a preset convergence condition.

Therefore, after all the voice data to be recognized are concentrated through the preset voice training and the preset classification model is trained, the result output by the preset classification model can be continuously drawn close to the accurate result, the recognition accuracy is higher and higher, and the preset classification model after convergence is recorded as the voice emotion classification model until the prediction loss values of all the voice data to be recognized reach the preset convergence condition.

In the embodiment, when emotion recognition is performed on voice data to be recognized through the preset classification model, the same emotion in the voice data to be recognized is tracked by adopting an emotion tracking recognition method, so that voice segments among the same emotion are subjected to fusion judgment, a predicted emotion label with higher accuracy is generated, and the accuracy of emotion recognition is further improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a speech emotion classification model training device is provided, and the speech emotion classification model training device corresponds to the speech emotion classification model training method in the above embodiments one to one. As shown in fig. 4, the speech emotion classification model training apparatus includes a speech training set acquisition module 10, a target emotion recognition result determination module 20, a predicted emotion recognition result determination module 30, a prediction loss value determination module 40, and a model update training module 50. The functional modules are explained in detail as follows:

a voice training set obtaining module 10, configured to obtain a preset voice training set; the preset voice training set comprises at least one voice data to be recognized;

a target emotion recognition result determining module 20, configured to determine a target emotion recognition result for each piece of the to-be-recognized speech data; each target emotion recognition result comprises at least one target object and a target emotion label corresponding to the target object; one target object corresponds to at least one target emotion label;

the predicted emotion recognition result determining module 30 is configured to input the voice data to be recognized into a preset classification model including initial parameters, and perform emotion tracking recognition on the voice data to be recognized through the preset classification model to obtain a predicted emotion recognition result corresponding to the voice data to be recognized; the emotion recognition result comprises a prediction object and a prediction emotion label corresponding to the prediction object;

a prediction loss value determination module 40, configured to determine a prediction loss value of the preset classification model according to the target object, the target emotion label, the prediction object, and the prediction emotion label;

and the model updating training module 50 is configured to iteratively update the initial parameters in the preset classification model when the prediction loss value does not reach a preset convergence condition, and record the preset classification model after convergence as a speech emotion classification model until the prediction loss value reaches the convergence condition.

Preferably, the target emotion recognition result determination module 20 includes:

the target object determining unit is used for carrying out role recognition on the voice data to be recognized and determining all target objects in the voice data to be recognized;

a voice data segment dividing unit, configured to divide the voice data to be recognized into voice data segments corresponding to the target objects; one of the target objects corresponds to at least one voice data segment;

the target emotion tag determining unit is used for carrying out voice emotion recognition on each voice data segment to obtain a target emotion tag corresponding to each voice data segment;

and the target emotion recognition result determining unit is used for generating a target emotion recognition result according to the target object corresponding to each voice data segment and the target emotion label.

Preferably, as shown in fig. 5, the predicted emotion recognition result determination module 30 includes:

a prediction object determining unit 301, configured to perform role recognition on the voice data to be recognized, and determine a prediction object in the voice data to be recognized;

a speech sequence determining unit 302, configured to divide the speech data to be recognized according to each of the prediction objects to obtain a speech sequence including speech segments to be recognized corresponding to each of the prediction objects, where each of the speech segments to be recognized in the speech sequence is arranged according to a time sequence of the speech segment to be recognized in the speech data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;

an initial emotion recognition unit 303, configured to perform initial emotion recognition on each to-be-recognized speech segment to obtain an initial speech emotion corresponding to each to-be-recognized speech segment;

an emotion change feature determination unit 304, configured to acquire initial speech emotions corresponding to two to-be-recognized speech segments that are adjacent to each other in the speech sequence and correspond to the same prediction object, and determine an emotion change feature according to the acquired two initial speech emotions;

a predicted emotion tag determination unit 305, configured to determine, according to the to-be-recognized speech segments, the initial speech emotion, and the emotion change feature, a predicted emotion tag corresponding to each of the to-be-recognized speech segments;

a predicted emotion recognition result determining unit 306, configured to generate the predicted emotion recognition result according to the prediction object and the predicted emotion tag corresponding to each to-be-recognized speech segment.

Preferably, the emotion change feature determination unit 304 includes:

the emotion comparison subunit is used for performing emotion comparison on the two acquired initial voice emotions and determining an emotion comparison result;

the first emotion change subunit is used for determining the emotion change characteristic as a tracking emotion characteristic when the emotion comparison result is the same emotion result; the same emotion result represents that the obtained two initial voice emotions are the same;

the second emotion change subunit is used for determining the emotion change characteristics as the individual emotion characteristics when the emotion comparison results are different emotion results; the different emotion result represents that the two acquired initial speech emotions are different.

Preferably, the predicted emotion label determination unit 305 includes:

a first predicted emotion tag determination subunit, configured to determine, when the emotion change feature is the single emotion feature, a predicted emotion tag corresponding to the first speech segment according to an initial speech emotion corresponding to the first speech segment; the first voice segment is a voice segment to be recognized corresponding to the previous initial voice emotion in the two acquired initial voice emotions;

a second predicted emotion tag determination subunit, configured to determine, according to the initial speech emotion corresponding to the second speech segment, a predicted emotion tag corresponding to the second speech segment; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions.

Preferably, the predicted emotion label determination unit 305 further includes:

a voice fusion subunit, configured to perform voice fusion on the first voice segment and the second voice segment to obtain a voice fusion segment when the emotion change feature is the emotion tracking feature;

a third predicted emotion label determining subunit, configured to perform speech emotion recognition on the speech fusion segment to obtain a predicted emotion label corresponding to the second speech segment;

and the fourth predicted emotion label determining subunit is used for determining the predicted emotion label corresponding to the first voice segment according to the initial voice emotion corresponding to the first voice segment.

Preferably, the predicted loss value determining module 40 includes:

the data acquisition unit is used for acquiring an initial voice emotion corresponding to the second voice segment and a predicted emotion label when the emotion change characteristic is a tracking emotion characteristic; the second voice segment is a voice segment to be recognized corresponding to the latter initial voice emotion in the two acquired initial voice emotions;

the emotion recognition loss value determining unit is used for determining an emotion recognition loss value according to the initial voice emotion corresponding to the second voice segment and the predicted emotion label;

and the prediction loss value determining unit is used for determining the prediction loss value through a preset loss model according to the emotion recognition loss value, the target object, the target emotion label, the prediction object and the prediction emotion label.

For specific limitations of the speech emotion classification model training device, reference may be made to the above limitations of the speech emotion classification model training method, which is not described herein again. All or part of the modules in the speech emotion classification model training device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the data used in the training method of the speech emotion classification model in the above embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech emotion classification model training method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the speech emotion classification model training method in the above embodiments when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the speech emotion classification model training method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A speech emotion classification model training method is characterized by comprising the following steps:

2. The method for training the speech emotion classification model according to claim 1, wherein the determining the target emotion recognition result of each of the speech data to be recognized includes:

3. The method for training the speech emotion classification model according to claim 1, wherein the performing emotion tracking recognition on the speech data to be recognized to obtain a predicted emotion recognition result corresponding to the speech data to be recognized includes:

performing role recognition on the voice data to be recognized, and determining a prediction object in the voice data to be recognized;

dividing the voice data to be recognized according to the prediction objects to obtain a voice sequence containing voice segments to be recognized corresponding to the prediction objects, wherein the voice segments to be recognized in the voice sequence are arranged according to the time sequence of the voice segments to be recognized in the voice data to be recognized; one prediction object corresponds to at least one speech segment to be recognized;

performing initial emotion recognition on each voice segment to be recognized to obtain initial voice emotion corresponding to each voice segment to be recognized;

acquiring initial voice emotions corresponding to two voice fragments to be recognized which are adjacent in the voice sequence and correspond to the same prediction object, and determining emotion change characteristics according to the two acquired initial voice emotions;

determining a predicted emotion label corresponding to each voice segment to be recognized according to the voice segment to be recognized, the initial voice emotion and the emotion change characteristics;

and generating the predicted emotion recognition result according to the predicted object and the predicted emotion label corresponding to each voice segment to be recognized.

4. The method for training the speech emotion classification model according to claim 3, wherein the determining emotion change characteristics according to the two acquired initial speech emotions comprises:

5. The method for training the speech emotion classification model as claimed in claim 4, wherein the determining a predicted emotion label corresponding to each of the speech segments to be recognized according to the speech segment to be recognized, the initial speech emotion and the emotion change feature comprises:

6. The method for training the speech emotion classification model as claimed in claim 5, wherein the determining a predicted emotion label corresponding to each of the speech segments to be recognized according to the speech segment to be recognized, the initial speech emotion and the emotion change feature comprises:

7. The method for training the speech emotion classification model according to claim 4, wherein the determining the prediction loss value of the preset classification model according to the target object, the target emotion label, the predicted object and the predicted emotion label comprises:

and determining the predicted loss value through a preset loss model according to the emotion recognition loss value, the target object, the target emotion label, the predicted object and the predicted emotion label.

8. A speech emotion classification model training device is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the speech emotion classification model training method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method for training a speech emotion classification model according to any one of claims 1 to 7.