CN118245602A

CN118245602A - Emotion recognition model training method, device, equipment and storage medium

Info

Publication number: CN118245602A
Application number: CN202410339691.1A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-06-25

Abstract

The embodiment of the application discloses a training method, device and equipment of an emotion recognition model and a storage medium, belonging to the technical field of emotion recognition. The method comprises the following steps: carrying out emotion recognition on sample texts in a training data set through an emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, wherein the training data set comprises the sample texts and emotion truth labels corresponding to the sample texts; determining a loss adjustment weight corresponding to the sample text according to sample characteristics of the sample text and sample emotion recognition results, wherein the sample characteristics comprise at least one of sample types and label sample numbers of emotion truth labels corresponding to the sample text; determining emotion recognition loss of the sample text based on the loss adjustment weight, the sample emotion recognition result, and the emotion truth value tag; training an emotion recognition model based on the emotion recognition loss; by adopting the scheme provided by the embodiment of the application, the training effect of the emotion recognition model can be optimized, and the accuracy of emotion recognition can be improved.

Description

Emotion recognition model training method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of emotion recognition, in particular to a training method, device and equipment of an emotion recognition model and a storage medium.

Background

The understanding of film and television drama script requires emotion analysis of a script posted by a composer or to be shot, so that emotion trend of script characters, especially men and women, is known to evaluate whether the script is fluctuated or not or grasp emotion points in shooting. Because human emotion is complex, a sentence can represent multiple emotions, such as peaceful, happy and qi, and some tags are often easy to miss in multi-tag prediction.

In the related art, a balanced sampling mode is adopted, and the sample number n of each label category is counted, and then when the sample is sampled, the sample of the label category is extracted with the probability of 1/n, so that the larger sampling coverage can be obtained for the label category with small data quantity.

Under the condition of extremely unbalanced data, certain samples of the head labels (label types with a large number of samples) cannot be sampled and learned all the time, so that the generalization of the head labels is reduced, and samples with a plurality of label types also bring about the problems of sampling redundancy and insufficient learning benefits of the labels, so that the accuracy of emotion recognition is reduced.

Disclosure of Invention

The embodiment of the application provides a training method, device and equipment for an emotion recognition model and a storage medium, which can optimize the training effect of the emotion recognition model and improve the accuracy of emotion recognition. The technical scheme is as follows:

In one aspect, an embodiment of the present application provides a training method for an emotion recognition model, where the method includes:

carrying out emotion recognition on sample texts in a training data set through an emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, wherein the training data set comprises the sample texts and emotion truth labels corresponding to the sample texts;

Determining a loss adjustment weight corresponding to the sample text according to sample characteristics of the sample text and the sample emotion recognition result, wherein the sample characteristics comprise at least one of a sample type and a label sample number of emotion truth labels corresponding to the sample text, the sample type comprises an easy sample and a difficult sample, and the label sample number refers to the number of sample texts with the emotion truth labels in the training data set;

Determining a loss of emotion recognition for the sample text based on the loss adjustment weight, the sample emotion recognition result, and the emotion truth value tag;

training the emotion recognition model based on the emotion recognition loss.

In another aspect, an embodiment of the present application provides a training apparatus for an emotion recognition model, including:

The first emotion recognition module is used for carrying out emotion recognition on sample texts in a training data set through an emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, and the training data set comprises the sample texts and emotion truth labels corresponding to the sample texts;

The weight determining module is used for determining a loss adjustment weight corresponding to the sample text according to sample characteristics of the sample text and the sample emotion recognition result, wherein the sample characteristics comprise at least one of a sample type and a label sample number of emotion truth labels corresponding to the sample text, the sample type comprises easy samples and difficult samples, and the label sample number refers to the number of sample texts with the emotion truth labels in the training data set;

A loss determination module for determining a loss of emotion recognition for the sample text based on the loss adjustment weight, the sample emotion recognition result, and the emotion truth value tag;

and the model training module is used for training the emotion recognition model based on the emotion recognition loss.

In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, where the at least one instruction is loaded and executed by the processor to implement a training method of an emotion recognition model as described in the above aspect.

In another aspect, embodiments of the present application provide a computer readable storage medium having at least one instruction stored therein, the at least one instruction being loaded and executed by a processor to implement a training method for an emotion recognition model as described in the above aspect.

In another aspect, embodiments of the present application provide a computer program product comprising at least one instruction stored in a computer-readable storage medium. A processor of a computer device reads the at least one instruction from the computer-readable storage medium, the processor executing the at least one instruction causing the computer device to perform the method of training the emotion recognition model of the above aspect.

According to the method, after the sample text in the training data set is subjected to emotion recognition through the emotion recognition model to obtain the sample emotion recognition result of each sample text, emotion recognition loss is not determined directly according to the sample emotion recognition result and the emotion truth value label, sample characteristics of each sample text in the training data set are fully considered, firstly, loss adjustment weight is determined according to the sample characteristics of the sample text and the sample emotion recognition result, and then emotion recognition loss is determined by combining the loss adjustment weight, the sample emotion recognition result and the emotion truth value label, so that the emotion recognition loss is trained based on the emotion recognition loss. By adopting the scheme provided by the embodiment of the application, the loss of emotion recognition is determined by using the loss adjustment weight, so that the loss of emotion recognition of easy samples and difficult samples can be adjusted, and the loss of emotion recognition of unbalanced label samples can be adjusted under the condition that the number of sample texts of different emotion truth value labels is unbalanced, thereby training an emotion recognition model based on the adjusted loss of emotion recognition, optimizing the training effect of the emotion recognition model and improving the output accuracy of the emotion recognition model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flowchart of a training method for an emotion recognition model provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a schematic diagram of the BERT model provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a transducer encoder according to an exemplary embodiment of the present application;

FIG. 4 illustrates a schematic diagram of an input data format of an emotion recognition model provided by an exemplary embodiment of the present application;

FIG. 5 is a diagram showing a distribution of sample text amounts corresponding to different emotion tags according to an exemplary embodiment of the present application;

FIG. 6 illustrates a schematic diagram of staged training of emotion recognition models provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart illustrating a method of training an emotion recognition model provided in another exemplary embodiment of the present application;

FIG. 8 illustrates a schematic diagram of determining a second penalty adjustment weight in multi-tag sample text provided by an exemplary embodiment of the present application;

FIG. 9 illustrates a schematic diagram of a mood recognition model including two networks provided in accordance with an exemplary embodiment of the present application;

FIG. 10 illustrates a flow chart for emotion recognition using an emotion recognition model provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram showing emotion development curves corresponding to two key figures provided by an exemplary embodiment of the present application;

FIG. 12 is a block diagram showing a training apparatus for emotion recognition models provided in an exemplary embodiment of the present application;

Fig. 13 is a schematic diagram showing a structure of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and the like, and is specifically described through the following embodiment.

In some embodiments, the implementation environment in the embodiments of the present application may include a terminal and a server. The terminal and the server communicate data through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal is an electronic device installed with an application program having a function of training emotion recognition models. The emotion recognition model training function can be a function of an original application in the terminal or a function of a third party application; the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a wearable device or a vehicle-mounted terminal, and the like.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. In the embodiment of the application, the server can be a background server of an application program with a function of training the emotion recognition model.

In some embodiments, there is a data interaction between the server and the terminal. According to the method for training the emotion recognition model, which is provided by the embodiment of the application, the server executes the method, the terminal acquires a large number of sample texts, determines emotion truth labels corresponding to the sample texts, generates a training data set, and sends the training data set to the server, so that the server carries out emotion recognition on the sample texts in the training data set through the emotion recognition model to obtain sample emotion recognition results corresponding to the sample texts, determines loss adjustment weights of the sample texts according to sample characteristics of the sample texts and the sample emotion recognition results, further determines emotion recognition loss of the sample texts based on the loss adjustment weights, the sample emotion recognition results and the emotion truth labels, trains the emotion recognition model through the emotion recognition loss, finally returns model parameters of the emotion recognition model to the terminal after training of the emotion recognition model is completed, and the terminal applies the trained emotion recognition model to carry out emotion recognition.

Referring to fig. 1, a flowchart of a training method of emotion recognition model according to an exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal and/or a server) as an example, and the method includes the following steps:

step 101, carrying out emotion recognition on sample texts in a training data set through an emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, wherein the training data set comprises the sample texts and emotion truth labels corresponding to the sample texts.

Optionally, the training dataset includes sample text and emotion truth labels corresponding to the sample text. The sample text may be a description text or a dialogue text, and the embodiment of the present application is not limited specifically for the text type of the sample text.

Optionally, the emotion truth value tag is used for representing whether the sample text contains a certain emotion type, and if the sample text contains the emotion type, the emotion truth value tag can be marked as 1; if the sample text does not contain the emotion type, the emotion truth tag may be marked as 0.

Optionally, the emotion truth value tag may further characterize that the sample text includes emotion concentration of a certain emotion type, for example, the value range of the emotion truth value tag may be 0-1, and the higher the emotion concentration, the larger the value of the emotion truth value tag.

Alternatively, one sample may have a single emotion truth label, i.e. contain only a single emotion type, such as happy, sad, anger, etc.; a sample may also have multiple emotion truth labels, i.e. comprise at least two emotion types, such as camptotheca, grief and indignation cross-over, etc.

Alternatively, the model structure of the emotion recognition model may be a Chinese-BERT-WWM, the model input is a sample text, and the model output is an emotion recognition result. BERT (Bidirectional Encoder Representations from Transformers) is an open source language model, and according to the BERT model, a text learning task paradigm based on large-scale language data for pre-training and then fine-tuning in a target task can be formed.

Illustratively, as shown in fig. 2, the BERT model includes a process from inputting text (Token) to embedding (Embedding) for generating an input sequence E, a model core architecture composed of a plurality of transform encoders (abbreviated as Trm in fig. 2), and an output for a T-layer target classification task. Wherein each transducer encoder comprises two main parts, a self-attention (self-attention) module and a feed-forward neural network (feed-forward neural network). The structure of the transform encoder is shown in fig. 3, multi-head attention is a self-attention module, the self-attention module comprises three sequence input heads which are respectively used for inputting a Query (Query), a Key (Key) and a Value (Value), so that the input sequences are respectively transformed by using a Linear layer (Linear), and the input sequences are calculated by a scaling dot product attention module (Scaled dot-product Attention, SDPA) to obtain attention scores, and then are spliced by a splicing module (Concat), and finally, the result is output by using the Linear layer (Linear); feed-Forward is an intermediate layer module comprising a fully connected layer and an active layer (e.g., tanh active); add & Nor represents residual connection and layer normalization, and the Add & Nor layer operates as a process of summing the input of the previous layer (e.g., feed-Forward layer or Multi-head attention layer) with the input of the present layer and calculating layer normalization.

In some embodiments, to train the emotion recognition model, the computer device may first perform emotion recognition on the sample text in the training data set through the emotion recognition model, so as to obtain sample emotion recognition results corresponding to each sample text.

Optionally, the sample text in the training data set may be text that is acquired by different text collection manners and has no relevance, for example, the sample text may be comment content of a social platform, descriptive text in a book journal, or other text containing emotion.

Alternatively, the sample text in the training dataset may also be dialogue-like text with relevance between contexts, such as interview dialogue text, contextual dialogue text, scenario dialogue text, and the like, which is not limited by the embodiments of the application.

In one possible implementation, the computer device sequentially inputs each sample text into the emotion recognition model, first through Embedding processing layers in the emotion recognition model, maps each word in the sample text to a corresponding dictionary ID (Token ID), and treats Embedding of the dictionary ID as Embedding of the word. Furthermore, through coding learning of the multi-layer transducer encoder, emotion prediction probabilities corresponding to each emotion label task, namely sample emotion recognition results, can be output.

Schematically, as shown in fig. 4, the Input data format of the emotion recognition model may be an Input data, where the Input data includes word segmentation results (including word embedding (Token Embeddings) and position embedding (Position Embeddings)) of CLS marks and sample text, where in the embodiment of the present application, emotion tags are classified into 8 emotion types, that is, CLS is represented by eight emotion tag tasks, and each mark is used to output emotion recognition results corresponding to each emotion tag.

In a possible implementation manner, in a case that the training data set includes a large number of sample texts, in order to improve training efficiency of the emotion recognition model, the computer device may further perform batch processing on the training data set, for example, in a case that the training data set includes N sample texts, each m sample texts may be used as one batch, and N/m batches are used, so that the computer device performs emotion recognition on the sample texts of each batch in sequence, and sample emotion recognition results corresponding to the sample texts of each batch are obtained.

Step 102, determining a loss adjustment weight corresponding to the sample text according to sample characteristics of the sample text and sample emotion recognition results, wherein the sample characteristics comprise at least one of sample types and label sample numbers of emotion truth labels corresponding to the sample text, the sample types comprise easy samples and difficult samples, and the label sample numbers refer to sample text numbers with emotion truth labels in a training data set.

Optionally, considering that the emotion recognition model has different prediction difficulty levels for different sample texts, some sample texts are easier to predict by the emotion recognition model, and some sample texts are difficult to predict by the emotion recognition model, so that sample types can be divided into easy samples and difficult samples, the easy samples are sample texts easy to predict by the emotion recognition model, and the difficult samples are sample texts difficult to predict by the emotion recognition model.

Optionally, because the collection of the sample text has uncertainty, that is, there may be a large difference in the number of samples of the sample text corresponding to different emotion labels, so that the training effect on the emotion recognition model is affected, for example, a part of the emotion labels correspond to a large number of sample texts, and a part of the emotion labels correspond to only a small number of sample texts, so that the emotion labels can be divided into a head label and a tail label, wherein the head label is the emotion label corresponding to more sample texts, and the tail label is the emotion label corresponding to less sample texts.

Schematically, as shown in fig. 5, the distribution situation of the sample text numbers corresponding to eight emotion tags is shown, wherein the sample text numbers corresponding to three emotion tags, namely "worry", "suspicious" and "happy", are obviously more, and are head tags; the five emotion labels of "letter", "complaint", "terrorism", "period" and "love" correspond to a significantly smaller number of sample texts, which are tail labels.

Unlike in the related art, after the sample emotion recognition results of the respective sample texts are obtained, namely, emotion recognition loss is determined according to the sample emotion recognition results and the emotion truth value tags, so that the emotion recognition model is trained by using the emotion recognition loss. In the embodiment of the application, in order to optimize the optimizing effect of the emotion recognition model, after the sample emotion recognition result of each sample text is obtained, the emotion recognition loss is not directly determined, but the loss adjustment weight of each sample text is determined according to the sample characteristics of each sample text and the sample emotion recognition result.

The sample characteristics of the sample text comprise at least one of sample types and label sample numbers of emotion truth labels corresponding to the sample text, the sample types comprise easy samples and difficult samples, and the label sample numbers refer to the sample text numbers with the emotion truth labels in the training data set.

In one possible implementation manner, since the easy sample is easier to be predicted by the emotion recognition model and the difficult sample is harder to be predicted by the emotion recognition model, so that the accuracy of the emotion recognition result of the sample corresponding to the easy sample is higher than that of the emotion recognition result of the sample corresponding to the difficult sample, in order to optimize the learning effect of the emotion recognition model on the difficult sample in the model training process, the computer device needs to determine the loss adjustment weight of the sample text according to the sample type of the sample text, so that the loss adjustment weight of the difficult sample is greater than that of the easy sample, and the emotion recognition model can perform important learning on the difficult sample.

In another possible implementation manner, since the number of sample texts corresponding to different emotion tags is different, for a head tag with a larger number of sample texts, the emotion recognition model may repeatedly learn the emotion feature; for tail labels with a small number of sample texts, the emotion recognition model may not learn the emotion characteristics sufficiently, so that in the model training process, the learning effect of the emotion recognition model on different emotion characteristics needs to be balanced. And a single-label sample text and a multi-label sample text are respectively corresponding to one tail label, and the loss contribution of the multi-label sample text is larger than that of the single-label sample text for the tail label due to the co-occurrence effect of other head labels in the multi-label sample text, so that the learning difficulty of an emotion recognition model on the tail label is increased, and therefore, the loss of the multi-label sample text containing the tail label is also required to be adjusted in the model training process. That is, the computer device may determine the loss adjustment weight of the sample text according to the number of label samples of the emotion truth value label corresponding to the sample text, so as to reduce the contribution of the head label to model loss, avoid repeated learning of the emotion features of the head label with a large number of sample texts, and improve the label weight of the tail label in the multi-label sample text.

And step 103, determining emotion recognition loss of the sample text based on the loss adjustment weight, the sample emotion recognition result and the emotion truth value label.

In some embodiments, after determining the penalty adjustment weights for each sample text, the computer device may determine the emotion recognition penalty for each sample text based on the penalty adjustment weights, the sample emotion recognition results, and the emotion truth labels.

In one possible implementation, the computer device may determine the loss of emotion recognition according to the principle of cross entropy loss determination, first according to the emotion truth value tag of the sample text and the sample emotion recognition result, and then adjust the loss of emotion recognition by using the loss adjustment weight, so as to obtain the adjusted loss of emotion recognition.

Step 104, training an emotion recognition model based on the emotion recognition loss.

In some embodiments, after determining the emotion recognition loss of each sample text, the computer device may perform a back propagation algorithm to calculate a parameter gradient of each model parameter in the emotion recognition model based on the emotion recognition loss, and then re-determine new model parameters according to the model parameters, the parameter gradient and the learning rate, and perform parameter update on the emotion recognition model, thereby completing one training of the emotion recognition model.

In one possible implementation manner, after obtaining the emotion recognition loss of each sample text, the computer device may perform accumulation processing on the emotion recognition loss of each sample text, so as to obtain an emotion recognition total loss corresponding to the current training round, and thus train the emotion recognition model by using the emotion recognition total loss.

In another possible implementation, after obtaining the total loss of emotion recognition, the computer device may further obtain an average loss of emotion recognition for the current training round based on the number of sample texts in the training data set, so as to train the emotion recognition model using the average loss of emotion recognition.

In a possible implementation manner, in the case of batch processing of sample texts in the training data set, after determining the emotion recognition loss of each batch of sample texts, the computer device can train the emotion recognition model based on the emotion recognition loss of the current batch of sample texts, and then use the trained emotion recognition model to perform emotion recognition on the sample texts of the next batch, so that the model training efficiency is improved.

In summary, in the embodiment of the present application, after the emotion recognition results of the sample texts in the training data set are obtained by performing emotion recognition on the sample texts in the emotion recognition model, the emotion recognition loss is not directly determined according to the sample emotion recognition results and the emotion truth value labels, but the sample characteristics of each sample text in the training data set are fully considered, firstly, the loss adjustment weight is determined according to the sample characteristics of the sample texts and the sample emotion recognition results, and then the emotion recognition loss is determined by combining the loss adjustment weight, the sample emotion recognition results and the emotion truth value labels, and further the emotion recognition model is trained based on the emotion recognition loss. By adopting the scheme provided by the embodiment of the application, the loss of emotion recognition is determined by using the loss adjustment weight, so that the loss of emotion recognition of easy samples and difficult samples can be adjusted, and the loss of emotion recognition of unbalanced label samples can be adjusted under the condition that the number of sample texts of different emotion truth value labels is unbalanced, thereby training an emotion recognition model based on the adjusted loss of emotion recognition, optimizing the training effect of the emotion recognition model and improving the output accuracy of the emotion recognition model.

In some embodiments, considering that the prediction difficulty of the emotion recognition model is different for the easy sample and the difficult sample, and the learning effect of the emotion recognition model is also different for the head tag and the tail tag, in order to ensure fine learning of the sample text and improve the training efficiency of the emotion recognition model, the computer device may further divide the process of training the emotion recognition model into two stages, and in the first training stage, train the emotion recognition model by using the sample text with a single emotion truth value tag; in the second training stage, training the emotion recognition model by using a sample text with at least one emotion truth value label, namely training the emotion recognition model through two stages.

Wherein, in a first training phase, the computer device trains the emotion recognition model based on a first emotion recognition penalty, the first emotion recognition penalty being a loss of emotion recognition corresponding to the sample text in the first training data set, the sample text in the first training data set having a single emotion truth value tag.

In one illustrative example, the emotion tags may be classified into eight types, respectively indicating letter, complaint, happiness, terrorism, period, worry, doubt, love, and eight types of emotion, and the computer device obtains the sample text corresponding to each emotion type, namely, the sample text having only the emotion of "letter", the sample text having only the emotion of "complaint", the sample text having only the emotion of "happiness", the sample text having only the emotion of "terrorism", the sample text having only the emotion of "period", the sample text having only the emotion of "worry", the sample text having only the emotion of "doubt", and the sample text having only the emotion of "love", respectively, thereby forming the first training data set based on the sample text of each emotion type.

After training the emotion recognition model by the single-tag sample text, the emotion recognition model has basic emotion recognition capability. Further, given that the emotion contained in the text is rich, there may be a text containing multiple emotion types, so in order to optimize the emotion recognition model, the computer device also needs to train the emotion recognition model with the multi-tag sample text.

Thus, in a second training phase, based on the emotion recognition model trained in the first training phase, the computer device trains the emotion recognition model based on a second emotion recognition penalty, the second emotion recognition penalty being a loss of emotion recognition corresponding to the sample text in the second training data set, the sample text in the second training data set having at least one emotion truth label.

Optionally, the second training data set includes multi-label sample text in addition to single-label sample text. The multi-label sample text is a sample text containing at least two emotion types, such as a sample text with two emotions of "letter" and "complaint", a sample text with two emotions of "happy" and "letter", a sample text with two emotions of "worry" and "fear", and the like.

Schematically, as shown in fig. 6, a schematic diagram of staged training of emotion recognition models is provided according to an exemplary embodiment of the present application. First, in the first training stage, the computer device inputs the first training data set 601 into the emotion recognition model 605, so as to obtain sample emotion recognition results 606 corresponding to each sample text in the first training data set 601 output by the emotion recognition model 605, and further after determining the first emotion recognition loss 602, the computer device performs first stage training on the emotion recognition model 605 based on the first emotion recognition loss 602. On the basis of completing the first-stage training, the computer device inputs the second training data set 603 into the emotion recognition model 605, so as to obtain sample emotion recognition results 607 corresponding to each sample text in the second training data set 603 output by the emotion recognition model 605, and further, after determining the second emotion recognition loss 604, the computer device performs the second-stage training on the emotion recognition model 605 based on the second emotion recognition loss 604.

Referring to fig. 7, a flowchart of a training method of emotion recognition model according to another exemplary embodiment of the present application is shown, where the method is used for a computer device (including a terminal and/or a server) as an example, and the method includes the following steps:

in step 701, emotion recognition is performed on the sample texts in the first training data set through the emotion recognition model, so as to obtain sample emotion recognition results corresponding to each sample text, wherein the first training data set comprises the sample texts and single emotion truth labels corresponding to the sample texts.

In some embodiments, to improve the model training efficiency, before using the emotion recognition model to perform emotion recognition on the sample text, the computer device may perform parameter initialization processing on the emotion recognition model, for example, in the case of using a basic model structure of the Chinese-BERT-WWM, the computer device may perform parameter initialization processing on the emotion recognition model using the pre-training parameters of the Chinese-BERT-WWM model, and set the learning rate of the emotion recognition model, so that after completing parameter initialization and setting the learning parameters, the computer device performs pre-training on the emotion recognition model.

Optionally, the first training data set includes a sample text and a single emotion truth value tag corresponding to the sample text, and in the embodiment of the present application, emotion types are classified into eight types, namely, eight emotions of belief, complaint, happiness, terrorism, period, worry, doubt and love. The number of the sample texts corresponding to each emotion type may be the same or different, which is not limited in the embodiment of the present application.

Optionally, in the case that the sample text has a certain emotion type, the corresponding emotion truth value label may be 1; in the case where the sample text does not have a certain emotion type, its corresponding emotion truth value tag may be 0. Illustratively, for "yes, I believe she-! "this sample text, its corresponding emotion truth value tag is signal-1.

In one possible implementation, the computer device inputs the sample text and the emotion truth value label corresponding to the sample text into the emotion recognition model, so that the Embedding processing layer in the emotion recognition model firstly adopts a dictionary vocab.txt, maps each word in the sample text onto its corresponding dictionary ID (Token ID), and adds a label CLS as the beginning of a text sequence, and considering the text length difference of different sample texts, can also perform normalization processing on the model input, for example, adopts token=0 as the complement, so that the number of tokens of the sample text reaches a certain fixed value, thereby Embedding of the dictionary ID can be used as the word Embedding, and the model corresponding to the sample text can be output through coding learning of a multi-layer transducer encoder in the emotion recognition model.

The model output comprises a sample emotion recognition result and text learning content, the sample emotion recognition result corresponds to a mark CLS, the CLS is represented by eight emotion label tasks, namely, CLS1, CLS2, CLS3, CLS4, CLS5, CLS6, CLS7 and CLS8, and the output probability corresponding to each mark represents the emotion recognition result of the emotion recognition model on the emotion indicated by the mark.

Step 702, determining a first loss adjustment weight corresponding to the sample text according to the sample type of the sample text and the sample emotion recognition result, where the first loss adjustment weight is used for improving emotion recognition loss of the difficult sample.

In some embodiments, considering that the emotion recognition model has different prediction difficulty levels for different sample texts, after obtaining the sample emotion recognition result of the sample text, the computer device may further determine the first loss adjustment weight according to the sample type of the sample text and the sample emotion recognition result, so as to improve the emotion recognition loss of the difficult sample in the model training process.

Optionally, the first loss adjustment weight may have a negative correlation with the predicted difficulty level of the sample text, i.e. for easy samples, the contribution to model training loss needs to be reduced; for difficult samples, it is desirable to increase their contribution to model training loss.

Alternatively, in the case where the emotion truth value tag is 1, the first penalty adjustment weight may be expressed asIn the case where the emotion truth value tag is 0, the first penalty adjustment weight may be expressed as/>Wherein,And (3) obtaining a sample emotion recognition result, namely the prediction probability of the sample text i on the emotion label k, wherein beta is an adjustment index, and beta is preset.

For easy samples, its output probabilityAre usually larger, close to 1, and therefore/>Smaller, in the case where the index β is greater than 1, the first loss adjustment weight corresponding to the easy sample is smaller, for example, when the index β is 2,/>In the case of 0.9, the first loss adjustment weight is equal to 0.01; when the index beta is 2,/>In the case of 0.95, the first penalty adjustment weight is then equal to 0.0025, so that the contribution of the easy sample to model training penalty is reduced.

For difficult samples, since emotion recognition models are difficult to predict, they output probabilitiesIs usually smaller at this time/>Then larger, in the case where the index β is larger than 1, the first loss adjustment weight corresponding to the difficult sample is larger, for example, when the index β is 2,/>In the case of 0.5, the first loss adjustment weight is equal to 0.25; when the index beta is 2, the value of the index beta,In the case of 0.2, the first loss adjustment weight is equal to 0.64, and the first loss adjustment weight corresponding to the difficult sample increases exponentially with respect to the easy sample, so that the contribution of the difficult sample to model training loss increases.

In step 703, a second loss adjustment weight corresponding to the sample text is determined according to the number of label samples of the emotion truth value label and the first label class number in the first training data set, where the first label class number refers to the total number of label classes included in the first training data set, and the second loss adjustment weight is in a negative correlation with the number of label samples.

In some embodiments, considering the problem of unbalanced distribution of samples caused by the head tag and the tail tag, after obtaining the emotion recognition result of the sample, the computer device may further determine the second loss adjustment weight corresponding to the sample text according to the number of tag samples of the emotion truth value tag and the first tag class number in the first training data set.

The first label category number refers to the total number of label categories contained in the first training data set, the second loss adjustment weight and the label sample number are in negative correlation, and the more the number of sample texts corresponding to the emotion labels is, the larger the second loss adjustment weight corresponding to the sample texts with the emotion labels is.

Alternatively, the second loss adjustment weight may be expressed asWherein C represents the total number of label categories contained in the first training data set, C is equal to 8, and n _k represents the number of label samples corresponding to the emotion label k in the first training data set. For example, for an emotion tag with only 200 sample text, its corresponding second penalty adjustment weight is 1/8*1/200; for the emotion labels with 1 ten thousand sample texts, the corresponding second loss adjustment weight is 1/8*1/10000, so that the contribution of the sample text with the head label to the model training loss can be reduced through the second loss adjustment weight, repeated learning of the emotion recognition model to the head label information is avoided, and the contribution of the sample text with the tail label to the model training loss is improved.

Step 704, determining a first emotion recognition penalty for the sample text based on the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

In some embodiments, after determining the first penalty adjustment weight and the second penalty adjustment weight, respectively, the computer device may determine a first emotion recognition penalty corresponding to the sample text according to the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag for each sample text according to the cross entropy penalty principle.

Alternatively, the first emotion recognition loss may be expressed as Wherein/>Record for labeling of sample text i on emotion label k,/>And (3) adjusting the weight for the sample emotion recognition result, namely the prediction probability of the sample text i on the emotion label k, wherein beta is an adjustment index, and R is a second loss.

Alternatively, in the case of a binary determination of the emotional truth tag,The first emotion recognition loss may also be expressed as 0 or 1: /(I)

Step 705, training an emotion recognition model based on the first emotion recognition penalty.

In some embodiments, after determining the first emotion recognition penalty for the sample text, the computer device may pre-train the emotion recognition model with the first emotion recognition penalty. Based on the first emotion recognition loss, a back propagation algorithm is executed to calculate the parameter gradient of each model parameter in the emotion recognition model, and then new model parameters are redetermined according to the model parameters, the parameter gradient and the learning rate, and the parameter update is carried out on the emotion recognition model, namely the pre-training of the emotion recognition model is completed once.

In a possible implementation manner, in a case that the first training data set includes a large number of sample texts, in order to improve training efficiency of the emotion recognition model, the computer device may further perform batch processing on the first training data set, for example, in a case that the first training data set includes N sample texts, each m sample texts may be regarded as one batch, N/m batches are used, and each N/m batches are completed to represent that one round of (epoch) iteration is completed.

Optionally, in order to improve the model convergence accuracy and improve the model generalization capability, the computer may also perform a reduction process on the learning rate in the model training process. For example, the learning rate may be set to 0.005 in the model initialization stage, and in the model training process, the learning rate may be set to be reduced by 0.1 times after every 10 training rounds.

In one possible implementation, during training of the emotion recognition model, the computer device may record the average emotion recognition loss corresponding to the first training data set in each training round, so that the first stage training of the emotion recognition model is ended when the average emotion recognition loss is significantly reduced or no longer reduced after multiple rounds of training.

Step 706, based on the emotion recognition model obtained by training in the first training stage, performing emotion recognition on the sample text in the second training data set through the emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, where the second training data set includes the sample text and at least one emotion truth value tag corresponding to the sample text.

In some embodiments, after the first stage training is completed to obtain the emotion recognition model, the emotion recognition model already has basic emotion recognition capability, so that in order to improve the recognition accuracy of the emotion recognition model and the emotion recognition capability of the multi-label text, the computer device can also perform emotion recognition on the sample text in the second training data set through the emotion recognition model to obtain a sample emotion recognition result corresponding to each sample text.

The second training data set includes a sample text and at least one emotion truth value label corresponding to the sample text, and in the embodiment of the application, emotion types are classified into eight types, namely eight emotions of belief, complaint, happiness, terrorism, period, worry, suspicion and love.

Step 707, determining a first loss adjustment weight corresponding to the sample text according to the sample type of the sample text and the sample emotion recognition result, where the first loss adjustment weight is used to increase emotion recognition loss of the difficult sample.

Step 708, determining a second loss adjustment weight corresponding to the sample text according to the sample tag class number, the tag sample number of the emotion truth tag, and the second tag class number in the second training data set.

In some embodiments, in consideration of the problem of unbalanced distribution of samples caused by the head tag and the tail tag and the problem of co-occurrence migration existing in the multi-tag sample text, after obtaining the sample emotion recognition result, the computer device may further determine the second loss adjustment weight corresponding to the sample text according to the sample tag category number, the tag sample number of the emotion truth tag, and the second tag category number in the second training data set, where the sample text has the sample tag category number, the emotion truth tag sample number, and the second tag category number.

In a possible implementation manner, in a case that the sample text has a sample tag class number of 1, that is, the sample text is a single tag sample, the computer device may determine the second penalty adjustment weight corresponding to the sample text based on the tag sample number of the emotion truth tag and the second tag class number in the second training data set, which is the same as the second penalty adjustment weight determined in the first training stage.

Alternatively, the second loss adjustment weight may be expressed asWherein C represents the total number of label categories contained in the second training data set, C is equal to 8, and n _k represents the number of label samples corresponding to the emotion label k in the second training data set.

In another possible implementation manner, in a case where the sample text has a sample tag class number greater than 1, that is, the sample text is a multi-tag sample, considering the co-occurrence migration problem existing in the multi-tag sample text, the computer device first needs to determine the tag weight of each emotion truth tag according to the tag sample number of each emotion truth tag, determine the emotion truth tag with the smallest tag sample number (that is, the tail tag in the multi-tag sample text) as the target emotion truth tag, and further determine the second loss adjustment weight corresponding to the sample text according to the second tag class number and the weight difference between the tag weight of the target emotion truth tag and the tag weights of other emotion truth tags.

Optionally, the tag weight may be m=1/n _k, and the more the tag sample number of the emotion truth value tag is, the lower the corresponding tag weight is, that is, the tag weight of the tail tag in the multi-tag sample text is higher than the tag weight of the head tag.

Alternatively, the second loss adjustment weight may be expressed asWherein,And subtracting the tag weights of other emotion truth value tags from the tag weights of the target emotion truth value tags, so that the influence of co-occurrence migration of tail tags in the multi-tag sample text can be eliminated.

For the single-label sample text and the multi-label sample text corresponding to the tail label, the learning contribution of the emotion recognition loss of the single-label sample text to the tail label is larger than the learning contribution of the emotion recognition loss of the multi-label sample text to the tail label. Taking the example that the tail label a has a single label sample text i and a multi-label sample text x (including a head label k in addition to the tail label a), the second loss adjustment weight corresponding to the single label sample text i isThe second penalty adjustment weight corresponding to the multi-label sample text x is/>I.e. r ₁>r₂.

Illustratively, as shown in fig. 8, for the 3 emotion truth value tags, N1, N2, N3 respectively represent the number of tag samples corresponding to each emotion truth value tag in the second training data set. The tag weight corresponding to the first emotion truth value tag 801 is 1/N1, the tag weight corresponding to the second emotion truth value tag 802 is 1/N2, and the tag weight corresponding to the third emotion truth value tag 803 is 1/N3.

For sample text located in region 804, i.e., having first emotion truth label 801 and third emotion truth label 803, the corresponding second penalty adjustment weights areFor sample text located in region 805, i.e., having a second emotional truth tab 802 and a third emotional truth tab 803, the corresponding second penalty adjustment weights are/>For sample text located in region 806, i.e., having a first emotional truth tag 801 and a second emotional truth tag 802, the corresponding second penalty adjustment weights are/>For sample text in region 807, i.e., having a first emotional truth tag 801, a second emotional truth tag 802, and a third emotional truth tag 803, the corresponding second penalty adjustment weights are/>

Step 709, determining a second emotion recognition penalty for the sample text based on the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

In some embodiments, after determining the first penalty adjustment weight and the second penalty adjustment weight, respectively, the computer device may determine a second emotion recognition penalty corresponding to the sample text based on the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag for each sample text based on the cross entropy penalty principle.

Alternatively, the second emotion recognition loss may be expressed as Wherein/>Record for labeling of sample text i on emotion label k,/>And (3) adjusting the weight for the sample emotion recognition result, namely the prediction probability of the sample text i on the emotion label k, wherein beta is an adjustment index, and r is a second loss.

Alternatively, in the case of a binary determination of the emotional truth tag,The first emotion recognition loss may also be expressed as 0 or 1:

Step 710, training an emotion recognition model based on the second emotion recognition penalty.

In some embodiments, after determining the second emotion recognition penalty for the sample text, the computer device may pre-train the emotion recognition model with the second emotion recognition penalty. Based on the second emotion recognition loss, a back propagation algorithm is executed to calculate the parameter gradient of each model parameter in the emotion recognition model, then new model parameters are redetermined according to the model parameters, the parameter gradient and the learning rate, the emotion recognition model is updated, namely, the training of the emotion recognition model is completed once, and further, when the average emotion recognition loss in the current training round is obviously smaller or does not drop any more, the second-stage training of the emotion recognition model can be finished, so that the trained emotion recognition model is obtained.

In the above embodiment, the training process of the emotion recognition model is divided into two stages, and the emotion recognition model is trained by using the sample text with a single emotion truth value tag, so that the emotion recognition model has basic emotion recognition capability, and further after the first training stage is completed, the emotion recognition model is trained by using the sample text with at least one emotion truth value tag, so that the emotion recognition model has the capability of recognizing multiple emotions, and model training is performed in stages, so that the model training effect is optimized, and the model training efficiency is improved.

In addition, in the process of determining the loss adjustment weight, the first loss adjustment weight is determined according to the sample type and the sample emotion recognition result, so that the contribution of the emotion recognition loss of the difficult sample to the model training loss is increased, the contribution of the emotion recognition loss of the easy sample to the model training loss is reduced, the emotion recognition model can perform key learning on a part of the difficult samples, and the model training quality of the emotion recognition model is improved.

And determining a second loss adjustment weight according to the label category number and the label sample number, improving the contribution of the sample text with the tail label to model training loss, reducing the contribution of the sample text with the head label to model training loss, avoiding repeated learning of the emotion recognition model to head label information, and simultaneously controlling the second loss adjustment weight of the single-label sample text with the tail label to be greater than the second loss adjustment weight of the multi-label sample text with the tail label in the second training stage, thereby reducing the co-occurrence migration problem existing in unbalanced samples and further improving the model training quality of the emotion recognition model.

In some embodiments, to improve the accuracy of the output of the emotion recognition model, the computer device may further set a text feature learning network and an emotion detail learning network in the emotion recognition model, respectively, where the text feature learning network is used to perform basic text semantic understanding on the sample text, and the emotion detail learning network is used to further learn fine emotion information from the text semantic understanding result.

In one possible implementation manner, the computer device inputs each sample text in the training data set into the text characteristic learning network in the emotion recognition model to obtain sample semantic understanding results corresponding to each sample text output by the text characteristic learning network, and further inputs the sample semantic understanding results into the emotion detail learning network in the emotion recognition model to obtain sample emotion recognition results corresponding to each sample text output by the emotion detail learning network.

Optionally, the network structure of the text characteristic learning network may adopt a basic model structure of Chinese-BERT-WWM, and the emotion detail learning network may be realized by adding multiple layers of transform encoders in the basic model structure of Chinese-BERT-WWM.

Before training the emotion recognition model, the computer device may perform parameter initialization processing on the newly added multi-layer transducer encoder by using 0-1 gaussian normal distribution, or may perform parameter initialization processing by using pre-training parameters with the same structure in the Chinese-BERT-WWM model, which is not limited in the embodiment of the present application.

Schematically, as shown in fig. 9, the computer device first inputs each sample text in the training data set 901 into the text characteristic learning network 902 in the emotion recognition model to obtain sample semantic understanding results corresponding to each sample text output by the text characteristic learning network 902, and further inputs the sample semantic understanding results into the emotion detail learning network 903 in the emotion recognition model to obtain sample emotion recognition results 904 corresponding to each sample text output by the emotion detail learning network 903.

In the above embodiment, on the basis of applying the existing text characteristic learning network, the emotion detail learning network is further added to the emotion recognition model, so that the emotion recognition model can learn finer emotion information, the model performance is optimized, and the model training effect is improved.

In some embodiments, after training the emotion recognition model is completed, the computer device may use the emotion recognition model for emotion recognition of any text, and obtain a corresponding emotion recognition result.

In one possible implementation, the emotion recognition model may be applied in social network analysis. For example, the emotion recognition model is utilized to carry out emotion recognition on comment texts posted by users in the social network platform, so that emotion tendencies of the users are analyzed according to emotion recognition labels corresponding to the comment texts, and the comment texts are used for marketing and product positioning.

In another possible implementation, emotion recognition models may also be applied in the customer feedback analysis. For example, emotion recognition is performed on the customer feedback text by using an emotion recognition model, so that customer psychology is analyzed according to emotion recognition labels corresponding to the customer feedback text, and customer satisfaction of products and services is known, so that customer satisfaction is improved.

In another possible implementation, a mood recognition model may also be applied to advertisement evaluation. For example, emotion analysis is performed on the advertisement text by using an emotion recognition model, so that clients are helped to know advertisement effects and advertisement strategies are optimized according to emotion recognition labels corresponding to the advertisement text.

In one possible implementation, to determine whether the emotion design of the scenario character is reasonable, and evaluate the emotion development of the scenario character, the computer device may apply an emotion recognition model to emotion recognize the dialogue text in the scenario, so as to evaluate the quality of the scenario based on the emotion recognition result of the dialogue text, and the process may include the following steps:

In step 1001, emotion recognition is performed on the target dialogue text of each key character in the scenario by using the trained emotion recognition model, and a target emotion label corresponding to the target dialogue text of each key character is determined, where the scenario includes at least one scenario and target dialogue texts of at least two key characters.

Optionally, the scenario includes at least one scenario and target dialogue text of at least two key characters. The scenario is a text describing the scenario of the movie and television work and is used for guiding shooting of the movie and television work, one scenario comprises a plurality of scenario occasions, the scenario occasions can be divided into scenes or time duration, and the embodiment of the application is not limited to the scenario occasions. And each scenario contains at least two key characters, namely characters with a large proportion of characters in the scenario, which are commonly called as "principal angles".

In some embodiments, to evaluate the quality of the scenario, the computer device uses the trained emotion recognition model to perform emotion recognition on the target dialogue text of each key person in the scenario, thereby determining target emotion tags corresponding to the target dialogue text of each key person respectively.

In one possible implementation manner, considering that the output of the emotion recognition model is the emotion prediction probability of each emotion label corresponding to the text, that is, the probability that the text contains the emotion, but the learning ability of the emotion recognition model for different emotion labels is different, if the emotion label corresponding prediction probability is directly determined to be less than 0.5 and the emotion label corresponding prediction probability is determined to be greater than 0.5 and the emotion is included, the determination accuracy of the emotion label may be reduced, so in order to improve the determination accuracy of the emotion label, before applying the emotion recognition model to perform emotion recognition, the computer device may further perform emotion recognition on each verification text in the verification data set by using the emotion recognition model, so that the probability decision threshold corresponding to each emotion label is determined according to the emotion recognition result corresponding to the verification text and the optimal value search.

Optionally, after outputting the emotion recognition result corresponding to each verification text through the emotion recognition model, the computer device may perform a threshold search between 0 and 1 according to each 0.05 step, so as to determine an optimal probability decision threshold corresponding to each emotion label.

Further, after performing emotion recognition on the target dialogue text of each key person in the scenario by using the emotion recognition model to obtain target emotion recognition results corresponding to each target dialogue text, the computer device may determine whether the target dialogue text contains the emotion according to probability judgment thresholds corresponding to each emotion label, and if the target emotion recognition result is greater than the probability judgment thresholds, it indicates that the target dialogue text contains the emotion, that is, the target dialogue text has the emotion label; and under the condition that the target emotion recognition result is smaller than the probability judgment threshold value, namely, the target dialogue text does not contain the emotion, namely, the target dialogue text does not have the emotion label, so that the target emotion labels corresponding to the target dialogue text of each key person can be determined.

And step 1002, performing quality evaluation on the script based on the target emotion labels corresponding to the target dialogue texts of the key characters to obtain a script quality evaluation result.

In some embodiments, after obtaining the target emotion tags corresponding to the target dialogue texts of the key characters, the computer device may determine the emotion change trend of the key characters in the scenario according to the target emotion tags, and perform quality assessment on the scenario, so as to obtain a scenario quality assessment result.

In one possible implementation manner, in order to analyze the emotion change trend of the key characters more intuitively, the computer device may generate emotion development curves of the key characters in the script according to the target emotion labels corresponding to the target dialogue texts of the key characters in the same script and the emotion prediction probabilities corresponding to the target emotion labels, so as to perform quality assessment on the script based on the emotion development curves corresponding to the key characters, and obtain a script quality assessment result.

Optionally, after performing emotion recognition on the target dialogue text through the emotion recognition model, a model output result, namely, emotion prediction probability of each emotion label by the emotion recognition model, is obtained, wherein the emotion prediction probability comprises emotion prediction probability corresponding to the target emotion label.

Regarding the manner in which the emotion development profile is generated, the computer device may generate the emotion development profile from the emotion scores by converting the emotion prediction probabilities corresponding to the target emotion tags into the emotion scores. In a possible implementation manner, in a case that the target dialog text has a single target emotion tag, the computer device may determine, according to the emotion prediction probability corresponding to the single target emotion tag, an emotion score of the target dialog text, for example, the emotion prediction probability is 100 and the emotion score is obtained by rounding. In the case that the target dialogue text has at least two target emotion labels, the computer device may determine an emotion score of the target dialogue text according to a probability average value between emotion prediction probabilities corresponding to the at least two target emotion labels; a probability of emotion prediction corresponding to one of the at least two target emotion tags may also be selected for emotion score calculation.

Optionally, the computer device may determine the emotion significance level of each target emotion tag according to the emotion prediction probability corresponding to each target emotion tag, so as to determine the emotion score of the target dialogue text according to the emotion prediction probability corresponding to the target emotion tag with the highest emotion significance level.

Regarding the manner of determining the emotion significance level of the target emotion tags, in one possible implementation manner, the computer device may generalize the emotion prediction probability corresponding to each target emotion tag to obtain the emotion significance level of each target emotion tag, where the emotion significance level and the emotion prediction probability are in a positive correlation, that is, according to the emotion prediction probability corresponding to each target emotion tag, the target emotion tag with the highest emotion prediction probability is determined as the target emotion tag with the highest emotion significance level.

In another possible implementation manner, the computer device may further determine, according to the emotion prediction probability and the probability decision threshold value corresponding to each target emotion tag, an emotion bias score corresponding to each target emotion tag, and further determine, according to the emotion bias scores corresponding to each target emotion tag, an emotion significance degree of each target emotion tag, where the emotion significance degree and the emotion bias score are in a positive correlation.

Optionally, the higher the emotion bias score = (emotion prediction probability-probability decision threshold)/probability decision threshold, the more likely the emotion label is contained in the target dialog text, the more intense the emotion.

Further, after determining the emotion scores of the target dialog texts of the key characters, the computer device may generate an emotion development curve of the key characters in the scenario according to the appearance sequence and the emotion scores of the target dialog texts of the key characters in the scenario, with the appearance time of the target dialog texts in the scenario as an abscissa and the emotion scores of the target dialog texts as an ordinate.

Optionally, the computer device may further generate an emotion development curve of the key person in the episode of the episode by taking the episode of the episode as a unit; the emotion development curve of the key person in the whole script can also be directly generated, and the embodiment of the application is not limited to the emotion development curve.

Optionally, the computer device may further divide the emotion into a positive emotion and a negative emotion, wherein the positive emotion includes belief, happiness, period, love, and the negative emotion includes complaint, fear, worry, doubt, thereby respectively counting positive emotion scores and negative emotion scores of the key person, and generating a positive emotion development curve and a negative emotion development curve.

Further, after the emotion development curves corresponding to the key characters are obtained, the computer equipment can evaluate the quality of the script according to the emotion development trend of the key characters in the script. Alternatively, the computer device may divide the emotion development quality analysis of the characters into single emotion quality analysis and multi-emotion quality analysis, and generate scenario quality evaluation results based on single emotion evaluation results of the respective key characters and multi-emotion evaluation results between the key characters.

In one possible implementation manner, the computer device determines the emotion fluctuation amplitude of each key person in the scenario according to the emotion development curve corresponding to each key person, so that single emotion quality analysis is performed on each key person based on the emotion fluctuation amplitude, and a single emotion assessment result of each key person is obtained. Wherein, the large emotion fluctuation range indicates that the emotion elasticity of the character is large, the emotion change of the pause is obviously suppressed, and the character belongs to excellent character emotion modeling; for smaller emotion fluctuation amplitude, the emotion modeling change is relatively small, and the emotion modeling method belongs to more common person emotion modeling.

In one possible implementation manner, in order to ensure differentiation of emotion modeling of people in the scenario, avoid identical emotion development tracks of people, or determine whether emotion variation differences among people with similar character characters are too large, the computer device may further determine emotion comparison results among the key characters according to emotion development curves corresponding to the key characters, where the emotion comparison results include at least one of emotion development similarity and emotion development differences, so as to perform multi-people emotion quality analysis based on the emotion comparison results among the key characters, and obtain multi-people emotion assessment results among the key characters. Wherein, the character characters are similar in character but have large emotion change difference, and the character is proved to be out of logic; the similar trend of emotion development of the main character and the secondary character shows that the character modeling is low in differentiation and is not vivid.

Schematically, as shown in fig. 11, a schematic diagram of emotion development curves corresponding to two key characters provided in an exemplary embodiment is shown. Wherein, for the first emotion development curve 1101, the emotion fluctuation range is larger, the emotion change of the pause is obviously suppressed, and the emotion modeling of the excellent character is realized; for the second emotion development curve 1102, the emotion fluctuation amplitude is smaller, the emotion modeling change is relatively smaller, and the emotion modeling belongs to more common person emotion modeling and needs to be optimized and adjusted.

In the above embodiment, the emotion recognition model is applied to the scenario analysis field, so that the emotion recognition model outputs the target emotion labels corresponding to each target dialogue text, and further generates the emotion development curve of the key person based on the target emotion labels, so that the emotion trend of the key person in the scenario can be more intuitively analyzed, the person modeling quality in the scenario is evaluated, scenario understanding can be assisted, and scenario creation and modification efficiency can be improved.

Referring to fig. 12, a block diagram of a training apparatus for emotion recognition model according to an exemplary embodiment of the present application is shown, the apparatus includes:

The first emotion recognition module 1201 is configured to perform emotion recognition on sample texts in a training data set through an emotion recognition model to obtain sample emotion recognition results corresponding to each sample text, where the training data set includes the sample texts and emotion truth labels corresponding to the sample texts;

A weight determining module 1202, configured to determine, according to a sample feature of the sample text and the sample emotion recognition result, a loss adjustment weight corresponding to the sample text, where the sample feature includes at least one of a sample type and a number of label samples of emotion truth labels corresponding to the sample text, the sample type includes an easy sample and a difficult sample, and the number of label samples refers to a number of sample texts in the training dataset that have the emotion truth labels;

a penalty determination module 1203 configured to determine a loss of emotion recognition of the sample text based on the penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag;

a model training module 1204 for training the emotion recognition model based on the emotion recognition penalty.

Optionally, the emotion recognition model is obtained through two-stage training; the model training module 1204 includes:

The first model training unit is used for training the emotion recognition model based on first emotion recognition loss in a first training stage, wherein the first emotion recognition loss is the emotion recognition loss corresponding to a sample text in a first training data set, and the sample text in the first training data set is provided with a single emotion truth value label;

The second model training unit is used for training the emotion recognition model based on second emotion recognition loss on the basis of the emotion recognition model obtained through training in the first training stage in a second training stage, wherein the second emotion recognition loss is the emotion recognition loss corresponding to the sample text in the second training data set, and the sample text in the second training data set is provided with at least one emotion truth label.

Optionally, during the first training phase, the weight determining module 1202 includes:

A first weight determining unit, configured to determine a first loss adjustment weight corresponding to the sample text according to the sample type of the sample text and the sample emotion recognition result, where the first loss adjustment weight is used to increase emotion recognition loss of the difficult sample;

the second weight determining unit is used for determining a second loss adjusting weight corresponding to the sample text according to the label sample number of the emotion truth value label and the first label class number in the first training data set, wherein the first label class number refers to the total number of label classes contained in the first training data set, and the second loss adjusting weight and the label sample number are in a negative correlation;

The loss determination module 1203 includes:

A first loss determination unit configured to determine the first emotion recognition loss of the sample text based on the first loss adjustment weight, the second loss adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

Optionally, during the second training phase, the weight determining module 1202 includes:

a third weight determining unit, configured to determine, according to the sample type of the sample text and the sample emotion recognition result, a first loss adjustment weight corresponding to the sample text, where the first loss adjustment weight is used to increase emotion recognition loss of the difficult sample;

a fourth weight determining unit, configured to determine a second loss adjustment weight corresponding to the sample text according to the number of sample label categories of the sample text, the number of label samples of the emotion truth value label, and the number of second label categories in the second training dataset;

The loss determination module 1203 includes:

and a second loss determination unit configured to determine a second emotion recognition loss of the sample text based on the first loss adjustment weight, the second loss adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

Optionally, the fourth weight determining unit is configured to:

determining the second loss adjustment weight corresponding to the sample text based on the number of label samples of the emotion truth value label and the second number of label categories in the second training data set when the sample text has a sample label category number of 1;

Under the condition that the sample label category number of the sample text is larger than 1, determining the label weight of each emotion truth value label according to the label sample number of each emotion truth value label, and determining the emotion truth value label with the least label sample number as a target emotion truth value label; and determining the second loss adjustment weight corresponding to the sample text according to the second tag class number and the weight difference value between the tag weight of the target emotion truth value tag and the tag weights of other emotion truth value tags.

Optionally, the first emotion recognition module 1201 is configured to:

Inputting each sample text in the training data set into a text characteristic learning network in the emotion recognition model to obtain sample semantic understanding results corresponding to each sample text output by the text characteristic learning network;

Inputting the sample semantic understanding result into an emotion detail learning network in the emotion recognition model to obtain the sample emotion recognition result corresponding to each sample text output by the emotion detail learning network.

Optionally, the apparatus further includes:

The second emotion recognition module is used for performing emotion recognition on target dialogue texts of all key characters in the script by using the trained emotion recognition model, and determining target emotion labels corresponding to the target dialogue texts of all the key characters, wherein the script comprises at least one script session and target dialogue texts of at least two key characters;

and the quality evaluation module is used for performing quality evaluation on the script based on the target emotion labels corresponding to the target dialogue texts of the key characters to obtain a script quality evaluation result.

Optionally, the quality evaluation module includes:

The curve generation unit is used for generating an emotion development curve of each key person in the scenario based on a target emotion label corresponding to a target dialogue text of each key person in the same scenario and emotion prediction probability corresponding to the target emotion label, wherein the emotion prediction probability is a model output result obtained by performing emotion recognition on the target dialogue text through the emotion recognition model;

And the quality evaluation unit is used for performing quality evaluation on the script based on emotion development curves corresponding to the key characters to obtain a script quality evaluation result.

Optionally, the curve generating unit is configured to:

Determining an emotion score of the target dialogue text based on emotion prediction probabilities corresponding to the target emotion tags under the condition that the target dialogue text has a single target emotion tag;

Under the condition that the target dialogue text has at least two target emotion labels, determining the emotion significance degree of each target emotion label based on emotion prediction probabilities corresponding to each target emotion label; determining the emotion score of the target dialogue text according to the emotion prediction probability corresponding to the target emotion label with the highest emotion significance degree;

Generating the emotion development curve of the key person in the scenario based on the appearance sequence of each target dialogue text of the key person in the scenario and the emotion score.

Optionally, the curve generating unit is further configured to:

Generalizing emotion prediction probabilities corresponding to all target emotion tags to obtain emotion significance degrees of all target emotion tags, wherein the emotion significance degrees and the emotion prediction probabilities are in positive correlation; or alternatively, the first and second heat exchangers may be,

Determining emotion bias scores corresponding to the target emotion tags based on emotion prediction probabilities corresponding to the target emotion tags and probability judgment thresholds; and determining the emotion significance degree of each target emotion label according to the emotion bias scores corresponding to the target emotion labels, wherein the emotion significance degree and the emotion bias scores are in positive correlation.

Optionally, the quality evaluation unit is configured to:

Determining the emotion fluctuation amplitude of each key person in the scenario based on emotion development curves corresponding to each key person;

Carrying out single emotion quality analysis on each key person based on the emotion fluctuation amplitude to obtain a single emotion assessment result of each key person;

Determining emotion comparison results among the key characters based on emotion development curves corresponding to the key characters, wherein the emotion comparison results comprise at least one of emotion development similarity and emotion development difference;

carrying out multi-person emotion quality analysis based on emotion comparison results among the key people to obtain multi-person emotion assessment results among the key people;

And generating the scenario quality evaluation result based on the single emotion evaluation result of each key person and the multi-emotion evaluation result among the key persons.

It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the method embodiments are described in the method embodiments, which are not repeated herein.

It should be noted that, in the present application, before acquiring related user data such as a scenario, a network comment text, a client feedback text, and an advertisement document, and in the process of acquiring related user data such as a scenario, a network comment text, a client feedback text, and an advertisement document, a prompt interface, a popup window, or output voice prompt information may be displayed, where the prompt interface, popup window, or voice prompt information is used to prompt a user that related data is currently being collected, so that the present application only after acquiring a confirmation operation performed by the user on the prompt interface or popup window, performs a related step of acquiring related data of the user, or, if not, ends a related step of acquiring related data of the user (i.e., does not acquire related data of the user). In other words, the information (including but not limited to user equipment information, user personal information, etc., user corresponding operation data), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or sufficiently authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the data such as the script, the network comment text, the client feedback text and the advertisement text related to the application are all acquired under the condition of full authorization.

Referring to fig. 13, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the present application relates to a method for manufacturing a semiconductor device. The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system memory 1304 including a random access memory 1302 and a read only memory 1303, and a system bus 1305 connecting the system memory 1304 and the central processing unit 1301. The computer device 1300 may also include a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between various devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315, which application programs 1314 may include application programs having the ability to train emotion recognition models.

In some embodiments, the basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (RAM, random Access Memory), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disc Read Only Memory (CD-ROM), digital versatile disc (DIGITAL VERSATILE DISC, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the above-described methods, the central processing unit 1301 executing the one or more programs to implement the training methods of the emotion recognition model provided by the respective method embodiments described above.

According to various embodiments of the application, the computer device 1300 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1311 through a network interface unit 1312 coupled to the system bus 1305, or other types of networks or remote computer systems (not shown) may be coupled using the network interface unit 1312.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction is stored in the readable storage medium, and the at least one instruction is loaded and executed by a processor to realize the training method of the emotion recognition model.

Alternatively, the computer-readable storage medium may include: ROM, RAM, solid State Disk (SSD), or optical disk, etc. The RAM may include, among other things, resistive random access memory (ReRAM, RESISTANCE RANDOM ACCESS MEMORY) and dynamic random access memory (DRAM, dynamic Random Access Memory).

Embodiments of the present application provide a computer program product comprising at least one instruction stored in a computer-readable storage medium. The processor of the computer device reads the at least one instruction from the computer-readable storage medium, and the processor executes the at least one instruction, so that the computer device performs the training method of the emotion recognition model described in the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but is intended to cover all modifications, equivalents, alternatives, and improvements falling within the spirit and principles of the application.

Claims

1. A method of training a mood recognition model, the method comprising:

training the emotion recognition model based on the emotion recognition loss.

2. The method of claim 1, wherein the emotion recognition model is obtained by two-stage training; the training the emotion recognition model based on the emotion recognition loss includes:

training the emotion recognition model based on a first emotion recognition loss in a first training stage, wherein the first emotion recognition loss is an emotion recognition loss corresponding to a sample text in a first training data set, and the sample text in the first training data set is provided with a single emotion truth value tag;

And training the emotion recognition model based on second emotion recognition loss on the basis of the emotion recognition model obtained by training in the first training stage in the second training stage, wherein the second emotion recognition loss is the emotion recognition loss corresponding to the sample text in the second training data set, and the sample text in the second training data set is provided with at least one emotion truth value label.

3. The method according to claim 2, wherein in the first training phase, the determining the loss adjustment weight corresponding to the sample text according to the sample feature of the sample text and the sample emotion recognition result includes:

determining a first loss adjustment weight corresponding to the sample text according to the sample type of the sample text and the sample emotion recognition result, wherein the first loss adjustment weight is used for improving emotion recognition loss of the difficult sample;

determining a second loss adjustment weight corresponding to the sample text according to the label sample number of the emotion truth value label and a first label class number in the first training data set, wherein the first label class number refers to the total number of label classes contained in the first training data set, and the second loss adjustment weight and the label sample number are in a negative correlation relationship;

The determining the emotion recognition penalty for the sample text based on the penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag comprises:

Determining the first emotion recognition penalty for the sample text based on the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

4. The method according to claim 2, wherein in the second training phase, the determining the loss adjustment weight corresponding to the sample text according to the sample feature of the sample text and the sample emotion recognition result includes:

Determining a second loss adjustment weight corresponding to the sample text according to the sample tag class number, the tag sample number of the emotion truth tag and a second tag class number in the second training data set;

And determining a second emotion recognition penalty for the sample text based on the first penalty adjustment weight, the second penalty adjustment weight, the sample emotion recognition result, and the emotion truth value tag.

5. The method of claim 4, wherein determining the second penalty adjustment weight corresponding to the sample text based on the sample tag class number the sample text has, the tag sample number of the emotion truth tag, and the second tag class number in the second training dataset comprises:

6. The method according to claim 1, wherein performing emotion recognition on the sample text in the training data set through the emotion recognition model to obtain a sample emotion recognition result corresponding to each sample text comprises:

7. The method according to claim 1, wherein the method further comprises:

Carrying out emotion recognition on target dialogue texts of each key person in the script by utilizing the trained emotion recognition model, and determining target emotion labels corresponding to the target dialogue texts of each key person, wherein the script comprises at least one script session and target dialogue texts of at least two key persons;

and carrying out quality evaluation on the script based on target emotion labels corresponding to target dialogue texts of the key characters to obtain a script quality evaluation result.

8. The method of claim 7, wherein the performing quality evaluation on the scenario based on the target emotion tags corresponding to the target dialogue text of each key character to obtain scenario quality evaluation results comprises:

generating an emotion development curve of each key person in the scenario based on a target emotion label corresponding to a target dialogue text of each key person in the same scenario and an emotion prediction probability corresponding to the target emotion label, wherein the emotion prediction probability is a model output result obtained by performing emotion recognition on the target dialogue text through the emotion recognition model;

and carrying out quality evaluation on the script based on emotion development curves corresponding to the key characters to obtain a script quality evaluation result.

9. The method of claim 8, wherein the generating an emotion development curve for each key character in the scenario based on the target emotion tags corresponding to the target dialogue texts of each key character in the same scenario and the emotion prediction probabilities corresponding to the target emotion tags comprises:

10. The method of claim 9, wherein determining the emotional saliency of each target emotion tag based on the emotion prediction probabilities corresponding to each target emotion tag comprises:

11. The method of claim 8, wherein the performing the quality assessment on the scenario based on the emotion development curves corresponding to the key characters to obtain the scenario quality assessment result comprises:

12. A training device for emotion recognition models, the device comprising:

13. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a training method of an emotion recognition model as claimed in any one of claims 1 to 11.

14. A computer readable storage medium storing at least one instruction for execution by a processor to implement a method of training an emotion recognition model as claimed in any one of claims 1 to 11.

15. A computer program product, characterized in that the computer program product comprises at least one instruction stored in a computer readable storage medium; a processor of a computer device reads the at least one instruction from the computer readable storage medium, the processor executing the at least one instruction causing the computer device to implement a method of training an emotion recognition model as claimed in any one of claims 1 to 11.