CN108305643B

CN108305643B - Method and device for determining emotion information

Info

Publication number: CN108305643B
Application number: CN201710527121.5A
Authority: CN
Inventors: 刘海波
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2019-12-06
Anticipated expiration: 2037-06-30
Also published as: CN108305643A

Abstract

The invention discloses a method and a device for determining emotion information. Wherein, the method comprises the following steps: acquiring target audio, wherein the target audio comprises a plurality of audio segments; identifying a plurality of first text messages from a plurality of audio segments, any one of the first text messages being identified from a corresponding one of the audio segments, the audio segments having speech characteristics, the first text messages having text characteristics; and determining target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information. The invention solves the technical problem that the emotion information of the speaker cannot be accurately identified in the related technology.

Description

Method and device for determining emotion information

Technical Field

The invention relates to the field of Internet, in particular to a method and a device for determining emotion information.

Background

with the increase of multimedia contents, content summarization technology capable of viewing and listening in a short time is now demanded from the market. In addition, the variety of contents is becoming diversified, for example, movies, dramas, home videos, news, documentaries, music contents, live scenes, novels, and news, and accordingly, the viewing requirements of the auditors are becoming more diversified.

With such diversification of viewing requirements, techniques for immediately searching for viewing requirements of a viewer, and presenting an adaptation to view and scenes are required. For example, in the content summary technology, the content is summarized based on the contained text information, and in the content summary technology, the emotion carried by the text information, such as laughing, anger, sadness, etc., is determined by analyzing the text information.

In the analysis method, the audio of the speaker can be detected by adopting an audio-based emotion detection method, the audio is used for emotion detection, the effect on the condition that the speaker has obvious emotion expression is better, when the emotion expression of the speaker is not strong, such as a very happy event, the emotion expression is expressed by flat tone, the audio hardly has the characteristic for expressing the pleasure, for the condition, the emotion detection based on the voice loses the effect, no method is used for accurately judging according to the voice characteristic, and even an error judgment result can be obtained.

aiming at the technical problem that the emotion information of a speaker cannot be accurately identified in the related technology, an effective solution is not provided at present.

disclosure of Invention

The embodiment of the invention provides a method and a device for determining emotion information, which are used for at least solving the technical problem that the emotion information of a speaker cannot be accurately identified in the related technology.

according to an aspect of the embodiments of the present invention, there is provided a method for determining emotion information, the method including: acquiring target audio, wherein the target audio comprises a plurality of audio segments; identifying a plurality of first text messages from a plurality of audio segments, any one of the first text messages being identified from a corresponding one of the audio segments, the audio segments having speech characteristics, the first text messages having text characteristics; and determining target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for determining emotion information, including: a first acquisition unit that acquires a target audio, wherein the target audio includes a plurality of audio segments; the recognition unit is used for recognizing a plurality of first text messages from a plurality of audio segments, any one of the first text messages is recognized from a corresponding one of the audio segments, the audio segments have speech characteristics, and the first text messages have text characteristics; and the first determining unit is used for determining the target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information.

In the embodiment of the invention, when the target audio is obtained, a first text message is recognized from each audio segment of the target audio, then the target emotion information of the audio segment is determined based on the text feature of the first text message and the voice feature of the audio segment, the emotion information can be determined through the text feature of the text message when the text message has obvious emotion exposure, the emotion information can be determined through the voice feature of the audio segment when the audio segment has obvious emotion exposure, and one corresponding audio segment is an emotion recognition result, so that the technical problem that the emotion information of a speaker cannot be accurately recognized in the related technology can be solved, and the technical effect of improving the accuracy of the emotion information of the speaker can be further achieved.

drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment for a method of affective information determination, according to an embodiment of the invention;

FIG. 2 is a flow chart of an alternative method of emotion information determination according to an embodiment of the present invention;

FIG. 3 is a flow diagram of an alternative model training method according to an embodiment of the present invention;

FIG. 4 is a flow diagram of an alternative model training method according to an embodiment of the present invention;

FIG. 5 is a flow chart of an alternative method of emotion information determination, according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an alternative apparatus for determining affective information, according to an embodiment of the invention;

FIG. 7 is a schematic diagram of an alternative apparatus for determining affective information, according to an embodiment of the invention; and

fig. 8 is a block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

example 1

According to the embodiment of the invention, the embodiment of the method for determining the emotion information is provided.

Alternatively, in this embodiment, the method for determining emotion information may be applied to a hardware environment formed by the server 102 and the terminal 104 as shown in fig. 1. As shown in fig. 1, a server 102 is connected to a terminal 104 via a network including, but not limited to: the terminal 104 is not limited to a PC, a mobile phone, a tablet computer, etc. in a wide area network, a metropolitan area network, or a local area network. The method for determining emotion information according to the embodiment of the present invention may be executed by server 102, or may be executed by terminal 104, or may be executed by both server 102 and terminal 104. The terminal 104 may execute the method for determining emotion information according to the embodiment of the present invention by a client installed thereon.

When the method for determining emotion information according to the embodiment of the present invention is executed by a server or a terminal alone, the program code corresponding to the method of the present application may be executed directly on the server or the terminal.

When the server and the terminal are jointly used for executing the method for determining the emotion information, the terminal initiates the requirement for identifying the target audio, at the moment, the terminal sends the target audio to be identified to the server, the server executes the program code corresponding to the method, and the identification result is fed back to the terminal.

The following describes an embodiment of the present application in detail by taking a program code corresponding to the method of the present application as an example, where the program code is executed on a server or a terminal, and fig. 2 is a flowchart of an optional method for determining emotion information according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:

Step S202, target audio is obtained, the target audio comprises a plurality of audio segments, and the target audio is used for expressing text information.

The terminal can actively acquire the target audio, or receive the target audio sent by other equipment, or acquire the target audio under the trigger of a target instruction. The target command corresponds to a command for recognizing the target audio triggered by the user or the terminal. The target audio is obtained to identify the emotion information of each audio segment in the target audio, which is the emotion information displayed when the text information is expressed by the target audio (including but not limited to the emotion information displayed by words or characters in the text, tones and timbres in the audio).

The text information refers to a Sentence or a combination of sentences, and a text includes but is not limited to a Sentence (sequence), a Paragraph (paramgraph) or a chapter (Discourse).

The emotion information is information for describing the emotion of the speaker, such as when a certain event is bored, an emotion (happy, dull, sad) related to happy is expressed, when an apology is received, an emotion (forgiveness, unfortunate) related to forgiveness is expressed, and the like.

When the target audio is a sentence, the audio segment is a phrase or a word in the sentence; when the target audio is a speech segment, the audio segment is a sentence or phrase, word in the sentence.

step S204, a plurality of first text messages are recognized from a plurality of audio segments, any one of the first text messages is recognized from a corresponding one of the audio segments, the audio segments have speech characteristics, and the first text messages have text characteristics.

The recognition of the first text information from the audio segment means that the first text information expressed by the audio segment is recognized by means of speech recognition (here, the recognized first text information may be slightly different from the text information actually expressed).

For speech recognition, speech features include the following aspects: perceptual weighted Linear prediction (PLP), (perceptual Linear prediction), Mel-Frequency Cepstral Coefficients (MFCC), FBANK (Filter-bank features), PITCH PITCH (e.g., high and low bass), speech ENERGY ENERGY, I-VECTOR (an important feature that reflects speaker acoustic differences), and the like. The features used in this application may be one or more, preferably more than one, of the above.

For text recognition, the first text information can be recognized from the audio segment through a speech recognition engine, and text characteristics of the text information comprise the characteristics of emotional type, emotional tendency, emotional intensity and the like of each phrase or word in the text, and also can be association relationship characteristics between phrases and the like.

Step S206, determining target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information.

in the process of determining the target emotion information of the target audio, text features of the first text information and voice features of the target audio are comprehensively considered, compared with the prior art that only an audio-based emotion detection method is adopted to detect the audio of a speaker, the audio can be used for emotion detection, the method has a better effect on the condition that the speaker has obvious emotion expression, but when the emotion expression of the speaker is not strong, such as a high-interest thing, and is expressed by flat tone, the audio hardly has characters for expressing the high-interest thing, for the condition, the text information in the audio of the speaker can be detected by the text-based emotion detection method, so that accurate judgment can be carried out according to the text features to make up the defect of emotion detection only through the audio, the effect of improving the accuracy of the judgment result is achieved.

Moreover, for a section of audio with emotion changes, because each audio section obtains corresponding target emotion information, the obtained result can be more accurate.

Through the steps S202 to S206, when the target audio is acquired, a first text message is recognized from each audio segment of the target audio, then the target emotion information of the audio segment is determined based on the text feature of the first text message and the voice feature of the audio segment, when the text message has obvious emotion exposure, the emotion information can be determined through the text feature of the text message, when the audio segment has obvious emotion exposure, the emotion information can be determined through the voice feature of the audio segment, and one corresponding to each audio segment is an emotion recognition result, so that the technical problem that the emotion information of a speaker cannot be accurately recognized in the related art can be solved, and the technical effect of improving the accuracy of the emotion information recognized by the speaker can be achieved.

For detecting the audio of the speaker only by adopting the audio-based emotion detection method, the method has a good effect on the condition that the speaker has obvious emotion expression, and the method has a good effect on the condition that the text information in the audio of the speaker has obvious emotion expression.

The applicant considers that if flat tone expressions are used for some texts with more obvious emotions (for example, if the texts with the emotion being happy are used for the flat tone expressions), the recognition effect of the text-based emotion detection method is obviously better, for some texts with more than flat emotions, if the language expression with obvious emotion is used (for example, the more flat texts are expressed by happy language), the emotion detection method based on audio has obviously better recognition effect, the text with more obvious emotion can be expressed by flat tone or tone with more obvious emotion, the text with less emotion can also be expressed by tone with obvious emotion or flat tone, and the text with more obvious forward emotion can not be expressed by tone with reverse emotion, for example, the text with happy emotion color is expressed by sad tone.

the method can be used for summing the text output result and the audio output result by using a weight to obtain a final result, and the final result is not the summation of the whole section but the summation of the sections.

Based on the above knowledge, the target speech can be determined to be speech with emotional colors as long as the speech or text has obvious emotional colors (i.e., emotional information of the first emotional level). Determining the target emotion information for the plurality of audio segments based on the speech features of the plurality of audio segments and the text features possessed by the plurality of first text information comprises determining the target emotion information for each audio segment as follows: acquiring a first recognition result determined according to the text feature of the first text information, wherein the first recognition result is used for representing emotion information recognized according to the text feature; acquiring a second recognition result determined according to the voice characteristics of the voice frequency section corresponding to the first text information, wherein the second recognition result is used for expressing the emotion information recognized according to the voice characteristics; and when the emotion information represented by at least one of the first recognition result and the second recognition result is the emotion information of the first emotion level, determining the target emotion information of the audio segment as the emotion information of the first emotion level.

The first emotion level is a level with relatively obvious emotion information, but not information tending to middle lightness (without obvious emotion), for example, for a group of emotion information of happiness, lightness and sadness, the emotion information of the first emotion level refers to happiness or sadness instead of lightness, and for other types of emotion information, the description is omitted.

In the above-mentioned technical solution of the present application, the identification includes, but is not limited to, performing the feature identification and the emotion information identification by using a general algorithm or a machine learning related algorithm, and for the purpose of improving the accuracy, the feature identification and the emotion information identification may be performed by using a machine learning related algorithm.

(1) Training process based on text recognition

Before executing the above steps S202 to S206 of the present application, the algorithm model may be trained: before the target audio is obtained, a second convolutional neural network model (an original convolutional neural network model) is trained by using second text information (a training text) and first emotion information to determine the value of a parameter in the second convolutional neural network model, and the second convolutional neural network model after the value of the parameter is determined is set as the first convolutional neural network model, wherein the first emotion information is emotion information of the second text information. As shown in fig. 3:

Step S301, performing word segmentation on the second text.

the word segmentation is carried out on the training sentence, for example, for an example sentence, the word segmentation result is as follows: today, pay, get, me, extraordinary, happy. The emotion label (actual emotion information) of this trained sentence is happy.

And step S302, performing Word vectorization on the Word after Word segmentation through Word2 vector.

Word vectors, as the name implies, represent a word in the form of a vector. Because the machine learning task needs to quantize the input into a numerical representation and then compute the final desired result by fully utilizing the computing power of the computer, it is necessary to quantize the words.

And forming an n x k matrix according to the number of the participles in the training sentence, wherein n is the number of the training sentence words, k is the dimension of the vector, and the type of the matrix can be fixed or dynamic and is selected according to specific conditions.

At present, word2 vectors have more and stable algorithms, CBOW and Skip-gram can be selected for realization, for a CBOW algorithm model and a Skip-gram algorithm model, a Huffman tree can be used as a basis, the initialization value of an intermediate vector stored by a non-leaf node in the Huffman tree is a zero vector, and word vectors of words corresponding to the leaf nodes are initialized randomly.

Step S303, feature extraction is carried out on the convolution layer of the second convolution neural network model.

The n x k matrix generated in the previous step is subjected to convolution layer to obtain a plurality of matrixes with the column number of 1, the layer is similar to a feature extraction layer, n words and k dimensional vector matrixes are generated, and the sentence can be expressed as:

I + j is a combination of words x1, x 2., xi + j, the sign represents a boolean exclusive or logical operation, a convolution operation corresponds to a filter, a word of window length l is used to generate a new feature, which can be denoted by ci, the convolution operation is:

The filter can generate a new characteristic sequence c ═ c1, c2, cn-l +1 for different word combinations { x1: l, x2: l., xn-l +1: n }, and can generate a plurality of matrixes listed as 1 by using a plurality of filters corresponding to different window lengths.

And step S304, performing pooling processing on the pool layer of the second convolutional neural network model.

The matrixes which are generated in the previous step and are listed as 1 can be selected as new features according to actual conditions, and the features with fixed dimensions are formed after passing through the matrix, so that the problem of sentence length can be solved.

in step S305, the neural network layer of the second convolutional neural network model processes to obtain a classification result (i.e., a second text feature).

By using m filters in the previous step, if each filter selects the maximum value as a new feature through a pool operation, a new feature with m dimensions is formed (the feature with the maximum feature value in the feature sequence c of the mth filter is represented, and the value of m is greater than 1), and the final output (i.e., the second text feature) is obtained through a plurality of NN layers by setting yi to w · z + b (w represents weight and b represents deviation) in one NN layer.

and step S306, adjusting and optimizing parameters through Back-Propagation (BP layer) of the second convolutional neural network model.

The output generated in the previous step and the real output are subjected to a proper loss function (usually, a maximum entropy and a minimum mean square error function are used as the loss function), the parameters of the CNN model are updated by using a random gradient descent method, and the model is optimized through multiple iterations.

The random gradient descent Wi +1 is Wi- η Δ Wi, where η is the learning rate, Wi is the pre-iteration weight (i.e., the parameter in the model), and Wi +1 is the post-iteration weight.

Maximum entropy loss function: and (4) solving the partial derivatives of the weight w and the deviation b of the loss function, and updating the weight w and the deviation b one by using a random gradient descent method.

The BP algorithm updates the w and b of different layers in front from the last layer by layer, and a CNN model (a first convolution neural network model) is obtained after the training process is completed.

It should be noted that in the training process, the association relationship between the emotion information and the text features is mined, so that the obtained first convolutional neural network model can identify the emotion information according to the association relationship.

(2) Speech based DNN training process

Before executing the above steps S202 to S206 of the present application, the training of the algorithm model may further include: before the target audio is obtained, training the second deep neural network model by using the training audio (or the training voice) and the second emotion information to determine the value of the parameter in the second deep neural network model, and setting the second deep neural network model after the value of the parameter is determined as the first deep neural network model, wherein the second emotion information is the emotion information of the training audio. The following is detailed in conjunction with fig. 4:

Step S401, framing the training audio.

The speech signal is a quasi-steady signal, and the signal is usually framed during processing, each frame is about 20ms-30ms in length, the speech signal is regarded as a steady signal in the interval, and only steady information can be processed, so framing is required first.

And step S402, extracting the characteristics of the voice frame after the training audio is framed, and sending the voice characteristics, the emotion annotation and the text characteristics to the DNN model.

The training speech is subjected to feature extraction, the extracted features can be various, such as PLP, MFCC, FBANK, PITCH, ENERGY, I-VECTOR and the like, one or more of the various features can be extracted, and the feature preferentially used in the application is fusion of the various features.

step S403, train the DNN model.

According to the characteristics extracted in the first step, expanding front and rear frames, and then sending the expanded front and rear frames into DNN, wherein the transfer between DNN intermediate layers is the same as that of an NN layer in CNN, the weight updating method is the same as that of CNN, the partial derivatives of w and b are obtained by using a loss function according to the error between the output generated by training characteristics and actual labels, the w and b are updated by using a Back-Propagation (BP layer) and a random gradient descent method, and the DNN model is optimized through multiple rounds of iteration as the method of CNN. And obtaining a DNN model (a first deep neural network model) after the training process is completed.

It should be noted that in the training process, the association relationship between the emotion information and the speech features is mined, so that the obtained first deep neural network model can recognize the emotion information according to the association relationship.

in the technical solution provided in step S202, a target audio is obtained, for example, a piece of audio input by a user through an audio input device (e.g., a microphone) is obtained on a terminal.

In the technical solution provided in step S204, before a plurality of first text messages are identified from a plurality of audio segments, silence detection is performed on a target audio, and a silence segment in the target audio is detected; and identifying a plurality of audio segments included in the target audio according to the mute segments, wherein a mute segment is arranged between any two adjacent audio segments.

For training audio, the audio is divided into different sections according to the condition of silence in the audio, and the silence detection can be realized by methods based on energy, zero crossing rate, models and the like.

After the plurality of audio segments are determined, a plurality of first text messages are identified from the plurality of audio segments, any one of the first text messages being identified from a corresponding one of the audio segments, the audio segments having speech characteristics (i.e., acoustic characteristics), and the first text messages having text characteristics.

The extraction and selection of the acoustic features are an important link of voice recognition, and the extraction of the acoustic features is a process of greatly compressing information and a signal deconvolution process, so that the mode divider can better divide the information. Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of the speech signal, i.e., a short analysis. This segment of the analysis, which is considered stationary, is called a frame, and the offset from frame to frame is typically 1/2 or 1/3 of the frame length. In general, in the process of extracting speech features in target audio, signals can be pre-emphasized to raise high frequencies, and windowed to avoid the influence of short-time speech segment edges. The above process of obtaining the first text information may be implemented by a speech recognition engine.

In the technical solution provided in step S206, target emotion information of the plurality of audio segments is determined based on the speech features of the plurality of audio segments and the text features of the plurality of first text information. The technical scheme provided by step S206 includes at least the following two implementation manners:

(1) In a first mode

Determining the target emotion information for the plurality of audio segments based on the speech features of the plurality of audio segments and the text features possessed by the plurality of first text information comprises determining the target emotion information for each audio segment as follows: acquiring a first recognition result determined according to the text feature of the first text information, wherein the first recognition result is used for representing emotion information recognized according to the text feature; acquiring a second recognition result determined according to the voice characteristics of the voice frequency section corresponding to the first text information, wherein the second recognition result is used for expressing the emotion information recognized according to the voice characteristics; and when the emotion information represented by at least one of the first recognition result and the second recognition result is the emotion information of the first emotion level, determining the target emotion information of the audio segment as the emotion information of the first emotion level. For example, for a group of emotional information of happy, calm, sad, if only one of the first recognition result and the second recognition result is happy or sad, the final result (target emotional information) is happy or sad, and the influence of the emotional information of the calm first level having no obvious emotional tendency is ignored.

The first recognition result and the second recognition result may be directly recognized emotion information, or may be other information (e.g., emotion score, emotion type, etc.) indicating recognized emotion information.

Optionally, the text feature is identified by a first convolutional neural network model, and when a first identification result determined according to the text feature of the first text information is obtained, the first identification result determined according to the text feature identified from the first text information is directly obtained from the first convolutional neural network model.

The obtaining of the first recognition result determined by the first convolutional neural network model according to the text feature recognized from the first text information includes: performing feature extraction on the first text information on a plurality of feature dimensions through a feature extraction layer of the first convolutional neural network model to obtain a plurality of text features, wherein one text feature (namely one or more features with the largest feature value) is obtained on each feature dimension; and performing feature recognition on a first text feature in the plurality of text features through a classification layer of the first convolutional neural network model to obtain a first recognition result, wherein the text features comprise the first text feature and a second text feature, and the feature value of the first text feature is greater than the feature value of any one second text feature.

The recognition of the voice features is realized through the first deep neural network model, and when a second recognition result determined according to the voice features of the audio segment corresponding to the first text information is obtained, the second recognition result determined according to the voice features recognized from the audio segment is directly obtained from the first deep neural network model.

(2) Mode two

Determining the target emotion information of the plurality of audio segments based on the speech features of the plurality of audio segments and the text features of the plurality of first text information can be realized by the following steps: acquiring a first identification result determined according to the text characteristic, wherein the first identification result comprises a first emotion parameter used for indicating emotion information identified according to the text characteristic; acquiring a second recognition result determined according to the voice feature, wherein the second recognition result comprises a second emotion parameter used for indicating emotion information recognized according to the voice feature; setting a third emotion parameter final _ score for indicating target emotion information as: the first emotional parameter Score1 is set as the weight a of the first emotional parameter and the second emotional parameter Score2 is set as the weight (1-a) of the second emotional parameter; and determining the emotional information at the second emotional level as target emotional information, wherein the second emotional level is the emotional level corresponding to the emotional parameter interval where the third emotional parameter is located, and each emotional level corresponds to one emotional parameter interval.

note that, when the first recognition result determined from the text feature is acquired, and when the second recognition result determined from the speech feature is acquired, the calculation may be performed by referring to the model used in the above-described mode one.

Optionally, after determining the target emotion information of the plurality of audio segments based on the voice features of the plurality of audio segments and the text features of the plurality of first text information, playing the audio segments one by one and displaying the target emotion information of the audio segments; and receiving feedback information of the user, wherein the feedback information comprises indication information used for indicating whether the identified target emotion information is correct or not, and the feedback information also comprises actual emotion information identified by the user according to the played audio segment under the condition of incorrect.

If the recognized target emotion information is incorrect, it is indicated that the recognition accuracy of the convolutional neural network model and the deep neural network model needs to be improved, especially for the audio information with the recognition error, the recognition rate is worse, at this time, the recognition rate is improved by using a negative feedback mechanism, specifically, the audio information with the recognition error can be used for retraining the convolutional neural network model and the deep neural network model according to the above mode, and the parameters in the two models are assigned again, so that the recognition accuracy is improved.

As an alternative embodiment, an embodiment of the present application is detailed below with reference to fig. 5:

Step S501, performing silence detection, and dividing the target audio into a plurality of audio segments.

Step S502, the audio segment is framed.

During processing, the signal is divided into frames with the length of about 20ms-30ms, and the speech signal can be regarded as a steady signal in the interval, so that the signal processing is facilitated.

In step S503, speech features (i.e., acoustic features) in the audio segment are extracted.

The identified speech features include, but are not limited to, a plurality of perceptual weighted linear prediction PLP, Mel frequency cepstral coefficients MFCC, FBANK, PITCH PITCH, speech ENERGY ENERGY, I-VECTOR.

step S504, the voice characteristics of the audio segment are identified through the DNN model.

the DNN model performs recognition processing based on the above recognized speech features (a plurality of perceptual weighted linear prediction PLP, Mel frequency cepstral coefficients MFCC, FBANK, PITCH PITCH, speech ENERGY ENERGY, I-VECTOR).

In step S505, a second recognition result score2 is obtained.

in step S506, voice recognition is performed on the audio segment by the voice recognition engine.

In the training phase of the speech recognition engine, each word in the vocabulary can be spoken in turn and its feature vectors can be stored as templates in the template library.

And in the stage of carrying out voice recognition through a voice recognition engine, sequentially carrying out similarity comparison on the acoustic feature vector of the input voice and each template in the template library, and outputting the voice with the highest similarity as a recognition result.

In step S507, a character recognition result (i.e., first text information) is obtained.

step S508, performing word segmentation on the first text message, if "leave no holiday in the daytime, i prefer to leave good" the word segmentation result is: tomorrow, about, leave, me, good, happy and o.

Step S509, the multiple words obtained by the word segmentation are used as input of a CNN model, and the CNN model performs convolution, classification, and recognition processing on the multiple words.

Step S510, obtaining a first recognition result score1 output by the CNN model.

And step S511, fusing the recognition results to obtain a final result.

The method comprises the steps that input target audio is subjected to feature extraction, the feature extraction is divided into two types for voice recognition, a voice recognition result is obtained through a voice recognition engine, the voice recognition result is subjected to word segmentation and is sent to a text emotion detection engine, and a text emotion score1 is obtained; another method is to extract features from the audio emotion detection score to be sent to the audio emotion detection to obtain an audio score2, and then obtain a final score final _ score by a weighting factor:

final_score＝a*score1+(1-a)*score2。

a is the weight value obtained by training the development set, and the final score is a score between 0 and 1.

For example, the score interval corresponding to sadness is [0,0.3 ], the score interval corresponding to lightness is [0.3,0.7 ], and the score interval corresponding to happiness is [0.7,1], so that the actual emotion can be determined to be happiness, sadness or lightness according to the finally obtained score.

In the embodiment of the application, the method based on the fusion of the text and the voice can be adopted to make up the defects of the separate different methods, and a weight factor can be added in the process of the fusion of the text and the voice to adjust the weights of the two methods so as to be suitable for different occasions. The application can be divided into two modules, a training module and an identification module, the training module can be used for training independently, different texts and audios are selected according to different conditions, three emotion characteristics in the application, such as happy, normal and unhappy, the happy and unhappy degrees can be represented by scores, the score of emotion is between 0 and 1, the more the emotion is close to zero, the more the emotion is close to 1, the more the emotion is positive, and the specific application can be emotion judgment of an audio frequency segment.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

according to the embodiment of the invention, the invention also provides a device for determining the emotion information, which is used for implementing the method for determining the emotion information. Fig. 6 is a schematic diagram of an alternative apparatus for determining emotion information, according to an embodiment of the present invention, as shown in fig. 6, the apparatus may include: a first acquisition unit 61, a recognition unit 62 and a first determination unit 63.

The first acquisition unit 61 acquires target audio, wherein the target audio includes a plurality of audio segments.

The recognition unit 62 is configured to recognize a plurality of first text information from a plurality of audio segments, any one of the first text information being recognized from a corresponding one of the audio segments, the audio segment having speech characteristics, the first text information having text characteristics.

And a first determining unit 63, configured to determine target emotion information of the plurality of audio segments based on the speech features of the plurality of audio segments and the text features that the plurality of first text information have.

It should be noted that the first obtaining unit 61 in this embodiment may be configured to execute step S202 in embodiment 1 of this application, the identifying unit 62 in this embodiment may be configured to execute step S204 in embodiment 1 of this application, and the first determining unit 63 in this embodiment may be configured to execute step S206 in embodiment 1 of this application.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of embodiment 1 described above. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

through the modules, when the target audio is obtained, a first text message is recognized from each audio segment of the target audio, then the target emotion information of the audio segment is determined based on the text feature of the first text message and the voice feature of the audio segment, the emotion information can be determined through the text feature of the text message when the text message has obvious emotion exposure, the emotion information can be determined through the voice feature of the audio segment when the audio segment has obvious emotion exposure, and one corresponding audio segment is an emotion recognition result, so that the technical problem that the emotion information of a speaker cannot be accurately recognized in the related technology can be solved, and the technical effect of improving the accuracy of the emotion information of the speaker can be further achieved.

based on the above knowledge, the target speech can be determined to be speech with emotional colors as long as the speech or text has obvious emotional colors (i.e., emotional information of the first emotional level).

optionally, as shown in fig. 7, the apparatus of the present application may further include: a second obtaining unit 64, configured to obtain an emotion level to which each target emotion information in the plurality of target emotion information belongs, after determining the target emotion information of the plurality of audio segments based on the speech features of the plurality of audio segments and the text features of the plurality of first text information; and a second determining unit 65 configured to determine that the emotion information of the target audio is emotion information of the first emotion level when the plurality of target emotion information includes emotion information of the first emotion level.

the first determining unit of the application determines the target emotion information of each audio segment as follows: acquiring a first recognition result determined according to the text feature of the first text information, wherein the first recognition result is used for representing emotion information recognized according to the text feature; acquiring a second recognition result determined according to the voice characteristics of the voice frequency section corresponding to the first text information, wherein the second recognition result is used for expressing the emotion information recognized according to the voice characteristics; and when the emotion information represented by at least one of the first recognition result and the second recognition result is the emotion information of the first emotion level, determining the target emotion information of the audio segment as the emotion information of the first emotion level.

the first determination unit obtains a first recognition result determined from the text feature recognized from the first text information from the first convolutional neural network model when obtaining the first recognition result determined from the text feature of the first text information.

In the process of obtaining a first recognition result determined by a first convolutional neural network model according to text features recognized from first text information, performing feature extraction on the first text information on a plurality of feature dimensions through a feature extraction layer of the first convolutional neural network model to obtain a plurality of text features, wherein one text feature is extracted on each feature dimension; and performing feature recognition on a first text feature in the plurality of text features through a classification layer of the first convolutional neural network model to obtain a first recognition result, wherein the text features comprise the first text feature and a second text feature, and the feature value of the first text feature is greater than the feature value of any one second text feature.

The first determination unit acquires a second recognition result determined based on a speech feature of an audio segment corresponding to the first text information from the first deep neural network model.

optionally, the apparatus of the present application may further include: the detection unit is used for carrying out mute detection on the target audio before a plurality of first text messages are identified from a plurality of audio segments, and detecting a mute segment in the target audio; and the third determining unit is used for identifying a plurality of audio segments included in the target audio according to the mute segments, wherein a mute segment is arranged between any two adjacent audio segments.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of embodiment 1 described above. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.

Example 3

According to the embodiment of the invention, a server or a terminal (namely an electronic device) for implementing the method for determining the emotion information is also provided.

Fig. 8 is a block diagram of a terminal according to an embodiment of the present invention, and as shown in fig. 8, the terminal may include: one or more processors 801 (only one shown in fig. 8), a memory 803, and a transmission apparatus 805 (such as the transmission apparatus in the above embodiment) as shown in fig. 8, the terminal may further include an input-output device 807.

the memory 803 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining emotion information in the embodiments of the present invention, and the processor 801 executes various functional applications and data processing by running the software programs and modules stored in the memory 803, so as to implement the method for determining emotion information. The memory 803 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 803 may further include memory located remotely from the processor 801, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

the above-mentioned transmission device 805 is used for receiving or sending data via a network, and may also be used for data transmission between a processor and a memory. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 805 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 805 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

among them, the memory 803 is used to store an application program, in particular.

The processor 801 may call an application stored in the memory 803 via the transmission means 805 to perform the following steps: acquiring target audio, wherein the target audio comprises a plurality of audio segments; identifying a plurality of first text messages from a plurality of audio segments, any one of the first text messages being identified from a corresponding one of the audio segments, the audio segments having speech characteristics, the first text messages having text characteristics; and determining target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information.

The processor 801 is further configured to perform the following steps: acquiring the emotion level of each target emotion information in a plurality of target emotion information; and when the plurality of target emotion information comprise the emotion information of the first emotion level, determining the emotion information of the target audio as the emotion information of the first emotion level.

By adopting the embodiment of the invention, when the target audio is obtained, a first text message is identified from each audio segment of the target audio, then the target emotion information of the audio segment is determined based on the text feature of the first text message and the voice feature of the audio segment, the emotion information can be determined through the text feature of the text message when the text message has obvious emotion exposure, the emotion information can be determined through the voice feature of the audio segment when the audio segment has obvious emotion exposure, and one corresponding audio segment is the emotion identification result, so that the technical problem that the emotion information of a speaker cannot be accurately identified in the related technology can be solved, and the technical effect of improving the accuracy of the emotion information of the speaker can be further achieved.

Optionally, the specific examples in this embodiment may refer to the examples described in embodiment 1 and embodiment 2, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the terminal may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, etc. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be a program code for executing the method for determining emotion information.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

s11, obtaining a target audio, wherein the target audio comprises a plurality of audio segments;

S12, identifying a plurality of first text messages from a plurality of audio segments, wherein any one of the first text messages is identified from a corresponding one of the audio segments, the audio segments have speech characteristics, and the first text messages have text characteristics;

and S13, determining target emotion information of the plurality of audio segments based on the voice characteristics of the plurality of audio segments and the text characteristics of the plurality of first text information.

optionally, the storage medium is further arranged to store program code for performing the steps of:

S21, acquiring the emotion level to which each target emotion information belongs in the target emotion information;

And S22, when the plurality of target emotion information include emotion information of a first emotion level, determining the emotion information of the target audio as the emotion information of the first emotion level.

optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for determining emotion information, comprising:

Obtaining target audio, wherein the target audio comprises a plurality of audio segments;

Identifying a plurality of first text messages from a plurality of the audio segments, wherein any one of the first text messages is identified from a corresponding one of the audio segments, the audio segment has speech characteristics, and the first text message has text characteristics;

Determining target emotion information of a plurality of audio segments based on voice characteristics of the audio segments and text characteristics of the first text information;

acquiring the emotion grade of each target emotion information in the target emotion information;

When the target emotion information comprises emotion information of a first emotion level, determining the emotion information of the target audio as the emotion information of the first emotion level, wherein the first emotion level is a level with obvious emotion information.

2. The method of claim 1, wherein determining the target emotion information for a plurality of the audio segments based on speech characteristics of the plurality of audio segments and text characteristics possessed by a plurality of the first text information comprises determining the target emotion information for each of the audio segments as follows:

Acquiring a first recognition result determined according to the text feature of the first text message, wherein the first recognition result is used for representing emotion information recognized according to the text feature;

Acquiring a second recognition result determined according to the voice feature of the audio segment corresponding to the first text information, wherein the second recognition result is used for representing emotion information recognized according to the voice feature;

and when the emotional information represented by at least one of the first recognition result and the second recognition result is the emotional information of the first emotional level, determining the target emotional information of the audio segment as the emotional information of the first emotional level.

3. The method of claim 2,

The acquiring of the first recognition result determined according to the text feature of the first text information includes: acquiring a first recognition result determined by a first convolution neural network model according to the text features recognized from the first text information;

Acquiring a second recognition result determined according to the voice feature of the audio segment corresponding to the first text information comprises: and acquiring the second recognition result determined by the first deep neural network model according to the voice feature recognized from the audio segment.

4. The method of claim 3, wherein obtaining the first recognition result determined by the first convolutional neural network model according to the text feature recognized from the first text information comprises:

Performing feature extraction on the first text information on a plurality of feature dimensions through a feature extraction layer of the first convolution neural network model to obtain a plurality of text features, wherein one text feature is obtained on each feature dimension;

And performing feature recognition on a first text feature in the plurality of text features through a classification layer of the first convolution neural network model to obtain the first recognition result, wherein the text features comprise the first text feature and a second text feature, and a feature value of the first text feature is greater than a feature value of any one of the second text features.

5. The method of claim 1, wherein prior to identifying a plurality of first textual information from a plurality of the audio segments, the method further comprises:

carrying out silence detection on the target audio, and detecting a silence segment in the target audio;

and identifying a plurality of audio segments included in the target audio according to the mute segments, wherein one mute segment is arranged between any two adjacent audio segments.

6. An apparatus for determining emotion information, comprising:

A first acquisition unit configured to acquire target audio, wherein the target audio includes a plurality of audio segments;

a recognition unit for recognizing a plurality of first text information from a plurality of the audio segments, wherein any one of the first text information is recognized from a corresponding one of the audio segments, the audio segment has a speech feature, and the first text information has a text feature;

A first determining unit, configured to determine target emotion information of a plurality of audio segments based on speech features of the audio segments and text features of the first text information;

A second obtaining unit, configured to obtain an emotion level to which each of the target emotion information in the plurality of target emotion information belongs, after determining target emotion information of the plurality of audio segments based on speech features of the plurality of audio segments and text features that the plurality of first text information has;

And a second determining unit, configured to determine, when the plurality of target emotion information includes emotion information of a first emotion level, that the emotion information of the target audio is emotion information of the first emotion level, where the first emotion level is a level with significant emotion information.

7. The apparatus of claim 6, wherein the first determining unit determines the target emotion information of each of the audio segments as follows:

8. The apparatus of claim 6, further comprising:

a detecting unit, configured to perform silence detection on the target audio before identifying a plurality of first text messages from the plurality of audio segments, and detect a silence segment in the target audio;

And the third determining unit is used for identifying a plurality of audio segments included in the target audio according to the mute segments, wherein one mute segment is arranged between any two adjacent audio segments.

9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any of the preceding claims 1 to 5 by means of the computer program.