CN108305643A

CN108305643A - The determination method and apparatus of emotion information

Info

Publication number: CN108305643A
Application number: CN201710527121.5A
Authority: CN
Inventors: 刘海波
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2018-07-20
Anticipated expiration: 2037-06-30
Also published as: CN108305643B

Abstract

The invention discloses a kind of determination method and apparatus of emotion information.Wherein, this method includes：Target audio is obtained, target audio includes multiple audio sections；Multiple first text messages are identified from multiple audio sections, any one first text message is identified from a corresponding audio section, and there is audio section phonetic feature, the first text message to have text feature；The text feature that phonetic feature and multiple first text messages based on multiple audio sections have determines the target emotion information of multiple audio sections.The present invention solves the technical issues of emotion information that can not accurately identify speaker in the related technology.

Description

The determination method and apparatus of emotion information

Technical field

The present invention relates to internet arenas, in particular to a kind of determination method and apparatus of emotion information.

Background technology

Now, along with the increase of multimedia content, from the market demand can carry out the audiovisual in the short time content it is general Want technology.In addition, the type of content is presented diversified trend, for example, film, serial, home video, news, documentary film, Music content, life real-time scene, the network novel, text news etc. correspond to this, and the audiovisual requirement for trying hearer is also more and more Sample.

Along with the diversification that this audiovisual requires, need for being retrieved immediately to the audiovisual requirement for trying hearer, prompting to want The adaptation of viewing, the technology of scene.Such as content summary technology, i.e., based on comprising text information and summary content, in content In summarization techniques, by analyzing text information, so that it is determined that the emotion that text information carries, such as laughs at, is angry, is sad Deng.

In above-mentioned analysis method, the emotion detection method based on audio can be used, the audio of speaker is detected, Emotion detection is carried out using audio, has the function of the case where obvious emotional expression with relatively good to speaker, when The emotional expression of speaker is not strong, such as a thing being very glad, is expressed with the very flat tone, at this time in audio Hardly band is useful for the glad feature of expression, and in this case, voice-based emotion detection is just ineffective, does not do Method is accurately adjudicated according to phonetic feature, in some instances it may even be possible to obtain the court verdict of mistake.

The technical issues of for the emotion information that can not accurately identify speaker in the related technology, not yet proposes effective at present Solution.

Invention content

An embodiment of the present invention provides a kind of determination method and apparatus of emotion information, at least solve in the related technology without Method accurately identifies the technical issues of emotion information of speaker.

One side according to the ... of the embodiment of the present invention provides a kind of determination method of emotion information, the determination method packet It includes：Target audio is obtained, target audio includes multiple audio sections；Multiple first text messages are identified from multiple audio sections, Any one first text message identifies that audio section has phonetic feature, the first text from a corresponding audio section This information has text feature；The text feature that phonetic feature and multiple first text messages based on multiple audio sections have is true The target emotion information of fixed multiple audio sections.

Another aspect according to the ... of the embodiment of the present invention additionally provides a kind of determining device of emotion information, the determining device Including：First acquisition unit obtains target audio, wherein target audio includes multiple audio sections；Recognition unit is used for from more Multiple first text messages are identified in a audio section, any one first text message is known from a corresponding audio section Do not go out, there is audio section phonetic feature, the first text message to have text feature；First determination unit, for based on multiple The text feature that the phonetic feature of audio section and multiple first text messages have determines the target emotion information of multiple audio sections.

In embodiments of the present invention, when obtaining target audio, one is identified from each audio section of target audio First text message, is then based on the text feature that the first text message has and the phonetic feature that audio section has determines audio The target emotion information of section, can be by the text feature of text message come really when text message is showed with apparent emotion Determine emotion information, can determine that emotion is believed by the phonetic feature of audio section when audio section is showed with apparent emotion Breath, and each audio section there are corresponding one is emotion recognition as a result, can solve in the related technology can not be accurate The technical issues of identifying the emotion information of speaker, and then reach the technology for the accuracy for improving the emotion information of identification speaker Effect.

Description of the drawings

Attached drawing described herein is used to provide further understanding of the present invention, and is constituted part of this application, this hair Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is the schematic diagram of the hardware environment of the determination method of emotion information according to the ... of the embodiment of the present invention；

Fig. 2 is a kind of flow chart of the determination method of optional emotion information according to the ... of the embodiment of the present invention；

Fig. 3 is a kind of flow chart of optional model training method according to the ... of the embodiment of the present invention；

Fig. 4 is a kind of flow chart of optional model training method according to the ... of the embodiment of the present invention；

Fig. 5 is a kind of flow chart of the determination method of optional emotion information according to the ... of the embodiment of the present invention；

Fig. 6 is a kind of schematic diagram of the determining device of optional emotion information according to the ... of the embodiment of the present invention；

Fig. 7 is a kind of schematic diagram of the determining device of optional emotion information according to the ... of the embodiment of the present invention；And

Fig. 8 is a kind of structure diagram of terminal according to the ... of the embodiment of the present invention.

Specific implementation mode

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The every other embodiment that member is obtained without making creative work should all belong to the model that the present invention protects It encloses.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product Or the other steps or unit that equipment is intrinsic.

Embodiment 1

According to embodiments of the present invention, a kind of embodiment of the method for the determination method of emotion information is provided.

Optionally, in the present embodiment, the determination method of above-mentioned emotion information can be applied to as shown in Figure 1 by servicing In the hardware environment that device 102 and terminal 104 are constituted.As shown in Figure 1, server 102 is connected by network and terminal 104 It connects, above-mentioned network includes but not limited to：Wide area network, Metropolitan Area Network (MAN) or LAN, terminal 104 are not limited to PC, mobile phone, tablet electricity Brain etc..The determination method of the emotion information of the embodiment of the present invention can be executed by server 102, can also by terminal 104 It executes, can also be and executed jointly by server 102 and terminal 104.Wherein, terminal 104 executes the emotion of the embodiment of the present invention The determination method of information can also be to be executed by client mounted thereto.

When the determination method of the emotion information of the embodiment of the present invention when being individually performed, is directly being taken by server or terminal Program code corresponding with the present processes is executed on business device or terminal.

When the determination method of the emotion information of the embodiment of the present invention when executing, is sent out by server and terminal by terminal jointly The demand for playing identification target audio, at this point, target voice to be identified is sent to server by terminal, then is held by server Row program code corresponding with the present processes, and the result of identification is fed back into terminal.

This Shen is described in detail for executing program code corresponding with the present processes on server or terminal below Embodiment please, Fig. 2 are a kind of flow chart of the determination method of optional emotion information according to the ... of the embodiment of the present invention, such as Fig. 2 Shown, this method may comprise steps of：

Step S202 obtains target audio, and target audio includes multiple audio sections, and target audio is for stating text envelope Breath.

It can be terminal active obtaining target audio, either receive the target audio of miscellaneous equipment transmission or in target Target audio is obtained under the triggering of instruction.Target instruction target word is equivalent to the finger of the target audio for identification of user or terminal triggering It enables.It is to identify that the emotion information of each audio section in target audio, the emotion information are to pass through target to obtain target audio Show when audio presentation text message (including but not limited to by the wording in text or the tone in word, audio, sound What color etc. was showed) out emotion information.

Above-mentioned text message refers to the combination of a sentence or multiple sentences, and a text includes but is not limited to one Sentence (Sentence), a paragraph (Paragraph) or a chapter (Discourse).

Emotion information is the information for describing speaker's emotion, such as when chatting to something, is given expression to related to happiness Emotion (glad, flat, sad), when such as receiving others and apologizing, give expression to that (pardon, do not set can with relevant emotion is forgiven It is no, do not forgive) etc..

When target audio is a sentence, phrase or word in the audio section i.e. sentence；It is one in target audio When a paragraph, a sentence in the audio section i.e. sentence or phrase, word.

Step S204 identifies multiple first text messages from multiple audio sections, any one first text message is It is identified from a corresponding audio section, there is audio section phonetic feature, the first text message to have text feature.

The first text message is identified from audio section, refers to being identified expressed by audio section by way of speech recognition The first text message (the first text message identified herein may there are technicalities with the text message actually stated).

For speech recognition, phonetic feature includes the feature of following several respects：Perceptual weighting linear prediction PLP (Perceptual Linear Predictive), Mel frequency cepstral coefficients MFCC (Mel-Frequency Cepstral Coefficients), FBANK (Filter-bank features), tone PITCH (such as high bass), speech energy ENERGY, I- VECTOR a kind of important feature of vocal acoustics' difference (reflection speak) etc..The feature used in the application can be among the above It is one or more, it is preferred to use multiple.

For text identification, the first above-mentioned text envelope can be identified from audio section by speech recognition engine Breath, the text feature of text message includes the spies such as affective style, Sentiment orientation, emotional intensity of each phrase or vocabulary in text Sign, can also be the incidence relation feature etc. between phrase.

Step S206, the text feature that phonetic feature and multiple first text messages based on multiple audio sections have determine The target emotion information of multiple audio sections.

In the target emotion information for determining target audio, considered text feature that the first text message has and The phonetic feature that target audio has, relative in the related technology, only with the emotion detection method based on audio to speaker Audio be detected for, both can use audio carry out emotion detection, to speaker have obvious emotional expression The case where, has the function of relatively good, but when the emotional expression of speaker is not strong, for example a thing being very glad, use is very flat When the light tone is expressed, hardly band is useful for the glad feature of expression and can also use in this case in audio Text based emotion detection method is detected the text message in the audio of speaker, so as to according to text feature It is accurately adjudicated, to make up the deficiency for carrying out emotion detection only by audio, reaches the accuracy for improving court verdict Effect.

Moreover, for there are the section audios that mood changes, due to being that each audio section obtains a corresponding target Emotion information, the result enabled to are more accurate.

S202 to step S206 through the above steps, when obtaining target audio, from each audio section of target audio It identifies first text message, is then based on the text feature that the first text message has and the voice that audio section has is special Sign determines the target emotion information of audio section, can pass through the text of text message when there is text message apparent emotion to show Eigen determines emotion information, can be by the phonetic feature of audio section come really when audio section is showed with apparent emotion Determine emotion information, and each audio section there are corresponding one is emotion recognition as a result, can solve in the related technology The technical issues of emotion information of speaker can not be accurately identified, and then reach the accurate of the emotion information for improving identification speaker The technique effect of degree.

Only with the emotion detection method based on audio for the audio of speaker is detected, to speaker have than The case where obvious emotional expression have the function of it is relatively good, using text based emotion detection method to the sound of speaker Text message in frequency has the function of the case where obvious emotional expression with relatively good, however, when (i.e. what Scene or which type of voice) be detected using the emotion detection method based on audio, when utilize text based feelings It is unknown that sense detection method, which is detected, it is impossible to is predicted in advance using any method come the inspection to current audio to be detected It is more preferable to survey effect.

If applicant is it is considered that using flat tone statement, (such as emotion is for the obvious text of certain emotions If glad text is stated using the flat tone), the recognition effect using text based emotion detection method is obviously preferable, If being stated (as more flat text is glad using the tone with apparent emotion than flat aobvious text for certain emotions The tone is stated), the recognition effect of the emotion detection method based on audio is obviously preferable, and the obvious text of above-mentioned emotion can be with Stated using the obvious tone of the flat tone or emotion, the more flat text of emotion can also use with notable emotion The tone or the statement of the flat tone, be not in the tone table that the obvious text of certain positive emotions uses reversed emotion It states, the text such as with happiness emotional color is stated using the sad tone.

The application can be made up and be lacked using what single features were identified based on the method blended herein with voice The fusion of point, the two is that text and audio training blend, and the method for fusion can be text output result and audio output knot Fruit centre sums up to obtain final result, and not instead of whole section of adduction, the adduction of segmentation, because saying using a weight Impossible one whole section of emotion for talking about people remains unchanged, but can be risen and fallen, and may be with regard to several keywords in one section of word Emotion than stronger, can recognize that in this way whole section words in different phase speaker emotional characteristics.

On the basis of above-mentioned cognition, as long as voice or the apparent emotional color of word band (the i.e. feelings of the first emotion grade Feel information), then it can determine that target voice is the voice with emotional color.Phonetic feature based on multiple audio sections and multiple The text feature that first text message has determines that the target emotion information of multiple audio sections includes determining as follows often The target emotion information of a audio section：Obtain the first recognition result determined according to the text feature of the first text message, wherein The emotion information that first recognition result is used to indicate to be identified according to text feature；It obtains according to corresponding with the first text message The second recognition result that the phonetic feature of audio section determines, wherein the second recognition result is identified for indicating according to phonetic feature The emotion information gone out；It is the first emotion in the emotion information that at least one of the first recognition result and the second recognition result indicate When the emotion information of grade, the target emotion information of institute's audio section is determined as to the emotion information of the first emotion grade.

The first above-mentioned emotion grade is the grade with obvious emotion information, rather than tend to it is intermediate it is flat (without Apparent emotion) information, such as glad, flat, sad this group of emotion information, the emotion information of the first emotion grade Refer to glad or sadness, rather than it is flat, it is similar for other kinds of emotion information, it repeats no more.

In the above-mentioned technical solution being identified of the application, common algorithm or machine are including but not limited to used Engineering may be used for the accuracy of raising in the identification that device learns relevant algorithm progress feature recognition and emotion information Relevant algorithm is practised to carry out the identification of feature recognition and emotion information.

(1) the training flow based on text identification

Before the above-mentioned steps S202 to step S206 for executing the application, first algorithm model can be trained：It is obtaining Before taking target audio, the second text message (training text) and first emotion information pair the second convolution neural network model are used (original convolution neural network model) is trained, to determine the value of parameter in the second convolution neural network model, and will be true The second convolution neural network model after the value of parameter is determined and has been set as the first convolution neural network model, wherein first Emotion information is the emotion information of the second text message.As shown in Figure 3：

Step S301 segments the second text.

Training sentence is segmented, for example is to the result of example sentence " today pays out wages, I am very happy " participle：It is modern It, pay out wages, I, very, happily.The affective tag (practical emotion information) of the sentence of this training is glad.

Step S302 carries out term vector by Word2vector to the word after participle.

Term vector is as its name suggests to indicate a word with a vectorial form.Since machine learning task is needed input It is quantized into numerical value expression, then by making full use of the computing capability of computer, is calculated finally wanting as a result, so needing By term vector.

According to the number segmented in training sentence, the matrix of a n*k is formed, wherein n is the number of trained sentence word, k Type for the dimension of vector v ector, this matrix can be fixed, can also be it is dynamic, according to specific circumstances into Row selection.

Word2vector has the algorithm that comparison is more and stablizes, the application that CBOW and Skip-gram can be selected real at present It is existing, can be based on Huffman trees for CBOW algorithm models and Skip-gram algorithm models, n omicronn-leaf in Huffman trees The initialization value of the intermediate vector of node storage is null vector, and the term vector of the corresponding word of leaf node is random initializtion 's.

The convolutional layer of step S303, the second convolution neural network model carry out feature extraction.

The n*k matrixes that back generates obtain the matrix that several columns are 1, this layer is similar to one by convolutional layer Feature extraction layer, generates n word, the vector matrixes of k dimensions, this sentence can be expressed as：

x_i:i+jIt is word x₁,x₂,...,x_i+jCombination, symbolIndicate Boolean calculation exclusive or Logical operation, a convolution operation are equivalent to a filter, and a new feature, new spy are generated using the word of a length of l of window Sign can use c_iIt indicates, then convolution operation is：

c_i=f (wx_i:i+l-1+ b), this filter can be to { x_1:l,x_2:l,...,x_n-l+1:nDifference word combination generation One new characteristic sequence c=[c₁,c₂,...,c_n-l+1], using multiple filters correspond to different window length can generate it is multiple It is classified as 1 matrix.

The pond layer of step S304, the second convolution neural network model carry out pond processing.

Back generate several be classified as 1 matrix, can be chosen according to actual conditions maximum or maximum several As new feature sentence length can be solved the problems, such as by forming the feature of fixed dimension after this layer.

Step S305, the neural net layer of the second convolution neural network model handle to obtain classification results (namely the second text Eigen).

M filter is used by back, is used as newly if each filter chooses maximum value by pond operation Feature, then just forming the new feature of m dimensions(In the characteristic sequence c for indicating m-th of filter The maximum feature of characteristic value, 1) value of m is more than, pass through a NN layers of y_i=wz+b (w indicates that weight, b indicate deviation), By NN layers multiple, final output (i.e. the second text feature) is obtained.

Step S306 adjusts parameter by the Back-Propagation (BP layers) of the second convolution neural network model Whole optimization.

The output that back generates (is typically maximum entropy and lowest mean square by suitable loss function with true output Error function is as loss function), using stochastic gradient descent method, the parameter of CNN models is updated, through excessively taking turns iteration Model is set to be optimal.

Stochastic gradient descent W_i+1=W_i-η△W_i, wherein η is learning rate, W_iFor the weight (ginseng i.e. in model before iteration Number), W_i+1For the weight after iteration.

Maximum entropy loss function：Local derviation is asked to weight w and deviation b to loss function, is used Stochastic gradient descent method is updated w and b by wheel.

BP algorithm is from last layer successively as the w and b to front different layers are updated, and is obtained after the completion of training process CNN models (the first convolution neural network model).

It should be noted that above-mentioned training process is really to excavate being associated between emotion information and text feature System, so that the first obtained convolution neural network model can identify emotion information according to incidence relation.

(2) voice-based DNN training process

Before the above-mentioned steps S202 to step S206 for executing the application, first algorithm model can be trained and also wrapped It includes：Before obtaining target audio, training audio (or training voice) and second emotion information pair the second depth nerve net are used Network model is trained, with determine the second deep neural network model in parameter value, and by determine parameter value it The second deep neural network model afterwards is set as the first deep neural network model, wherein the second emotion information is training sound The emotion information of frequency.It is described in detail with reference to Fig. 4：

Step S401 carries out framing to training audio.

Voice signal is because be quasi-steady state signal, in processing often signal framing, per frame length about 20ms-30ms, Voice signal is seen as steady-state signal in this section, only the information of stable state could carry out signal processing, so first to divide Frame.

Step S402 carries out feature extraction, phonetic feature, emotion annotation, text to the speech frame after training audio framing Feature is sent in DNN models.

Feature extraction carried out to training voice, the feature of extraction can there are many kinds of, such as PLP, MFCC, FBANK, PITCH, ENERGY, I-VECTOR etc. can extract one or more in this various features, the spy that the application preferentially uses Sign is manifold fusion.

Step S403 is trained DNN models.

According to the feature that the first step is extracted, the extension of front and back frame is carried out, is then delivered in DNN, the biography between DNN middle layers It is the same to pass with the NN layers in CNN, the output and reality that the newer method of weight generates as CNN according to training characteristics Error between mark seeks local derviation using loss function to w and b, using under Back-Propagation (BP layers) and stochastic gradient Drop method is updated w and b, and method makes DNN models be optimal such as CNN, through excessively taking turns iteration.Training process is completed After obtain DNN models (the first deep neural network model).

It should be noted that above-mentioned training process is really to excavate being associated between emotion information and phonetic feature System, so that the first obtained deep neural network model can identify emotion information according to incidence relation.

In the technical solution that step S202 is provided, target audio is obtained, it is defeated by audio that user is obtained such as in terminal Enter a section audio of equipment (such as microphone) input.

In the technical solution that step S204 is provided, identified from multiple audio sections multiple first text messages it Before, mute detection is carried out to target audio, detects mute section in target audio；Target audio packet is identified according to mute section The multiple audio sections included, there are one mute section at interval between the adjacent audio section of any two.

Audio is divided by different sections according to situation mute in audio to training audio, can be used based on energy, zero passage The methods of rate, model realize mute detection, and the application's is the mute detection based on model.

After multiple audio sections are determined, you can identify multiple first text messages from multiple audio sections, arbitrarily One the first text message identifies from a corresponding audio section, and audio section has phonetic feature, and (namely acoustics is special Sign), the first text message has text feature.

The extraction of the extraction of acoustic feature and the important link that selection is speech recognition, acoustic feature is both a letter Cease the process significantly compressed and a signal uncoiling process, it is therefore an objective to mode division device be enable preferably to divide.Due to language The time-varying characteristics of sound signal, feature extraction must carry out on a bit of voice signal, namely carry out short-time analysis.This section of quilt It is considered that stable analystal section is referred to as frame, the offset between frame and frame usually takes the 1/2 or 1/3 of frame length.Usually extraction mesh During phonetic feature in mark with phonetic symbols frequency, preemphasis can be carried out to signal to promote high frequency, to signal adding window to avoid in short-term The influence at voice segments edge.The above-mentioned process for obtaining the first text message can be realized by speech recognition engine.

In the technical solution that step S206 is provided, the phonetic feature based on multiple audio sections and multiple first text messages The text feature having determines the target emotion information of multiple audio sections.The technical solution that step S206 is provided includes at least following Two kinds of realization methods：

(1) mode one

The text feature that phonetic feature and multiple first text messages based on multiple audio sections have determines multiple audios The target emotion information of section includes the target emotion information for determining each audio section as follows：It obtains according to the first text The first recognition result that the text feature of information determines, wherein the first recognition result is identified for indicating according to text feature Emotion information；The second recognition result determined according to the phonetic feature of audio section corresponding with the first text message is obtained, In, the second recognition result is for indicating the emotion information identified according to phonetic feature；It is identified in the first recognition result and second At least one of as a result when the emotion information indicated is the emotion information of the first emotion grade, by the target emotion of institute's audio section Information is determined as the emotion information of the first emotion grade.Such as glad, flat, sad this group of emotion information, As long as there are one to be glad or sad in a recognition result and the second recognition result, then final result (target emotion information) is It is glad or sad, and ignore the influence of the emotion information of flat the first estate without apparent Sentiment orientation.

Above-mentioned the first recognition result and the second recognition result can directly be the emotion information identified, can also be use In the other information (such as emotion score, affective style) for the emotion information that instruction identifies.

Optionally, the first convolution neural network model that is identified by of text feature is realized, is obtained according to the first text envelope When the first recognition result that the text feature of breath determines, basis directly is obtained from the first text from the first convolution neural network model The first recognition result that the text feature identified in information determines.

Above-mentioned acquisition the first convolution neural network model is true according to the text feature identified from the first text message The first fixed recognition result includes：By the feature extraction layer of the first convolution neural network model in multiple characteristic dimensions One text message carries out feature extraction, obtains multiple text features, wherein extraction obtains a text in each characteristic dimension Feature (namely selected characteristic is worth one or several maximum features)；Classification layer by the first convolution neural network model is right The first text feature in multiple text features carries out feature recognition, obtains the first recognition result, wherein text feature includes the One text feature and the second text feature, the characteristic value of the first text feature are more than the feature of any one the second text feature Value.

Phonetic feature is identified by the realization of the first deep neural network model, is obtaining basis and the first text message pair When the second recognition result that the phonetic feature of the audio section answered determines, directly from the first deep neural network model obtain according to from The second recognition result that the phonetic feature that audio section identifies determines.

(2) mode two

The text feature that phonetic feature and multiple first text messages based on multiple audio sections have determines multiple audios The target emotion information of section can be realized in the following way：The first recognition result that acquisition is determined according to text feature, first Recognition result includes the first emotion parameter for being used to indicate the emotion information identified according to text feature；It obtains according to voice spy Levy the second determining recognition result, the second recognition result includes be used to indicate the emotion information identified according to phonetic feature the Two emotion parameters；It sets the third emotion parameter final_score for being used to indicate target emotion information to：First emotion parameter Score1* is that the second emotion parameters of weight a+ Score2* of the first emotion parameter setting is the weight of the second emotion parameter setting (1-a)；It will be determined as target emotion information positioned at the emotion information of the second emotion grade, the second emotion grade is and third emotion The corresponding emotion grade in emotion parameter section where parameter, there are one emotion parameter sections for each emotion grade correspondence.

It should be noted that in the first recognition result that acquisition is determined according to text feature, and obtain according to voice spy When levying the second determining recognition result, reference can be made to the model used in above-mentioned mode one is calculated.

Optionally, based on multiple audio sections phonetic feature and the text feature that has of multiple first text messages determine After the target emotion information of multiple audio sections, played audio segment and the target emotion information of the audio section is shown one by one；It receives The feedback information of user, feedback information include being used to indicate whether the target emotion information identified correctly indicates information, Further include the practical emotion information that user identifies according to the audio section of broadcasting in feedback information in the case that incorrect.

If the target emotion information identified is incorrect, illustrate convolutional neural networks model and deep neural network model Recognition accuracy it is to be improved, especially for this one kind identification mistake audio-frequency information, discrimination is worse, at this point, sharp Discrimination is improved with negative feedback mechanism, can specifically utilize the audio of this one kind identification mistake in a manner mentioned above to convolution god Re -training is carried out through network model and deep neural network model, assignment is carried out to the parameter in two models again, is improved Its recognition accuracy.

As a kind of optional embodiment, embodiments herein is described in detail with reference to Fig. 5：

Step S501 carries out mute detection, and target audio is divided into multiple audio sections.

Step S502 carries out framing to audio section.

It, can voice signal in this section in processing the speech frame that signal framing is length about 20ms-30ms It sees steady-state signal as, and then is convenient for signal processing.

Step S503 extracts the phonetic feature (namely acoustic feature) in audio section.

The phonetic feature identified includes but is not limited to perceptual weighting linear prediction PLP, Mel frequency cepstral coefficient It is multiple in MFCC, FBANK, tone PITCH, speech energy ENERGY, I-VECTOR.

Processing is identified to the phonetic feature of audio section by DNN models in step S504.

DNN models are according to above-mentioned phonetic feature (perceptual weighting linear prediction PLP, Mel frequency cepstral coefficient identified It is multiple in MFCC, FBANK, tone PITCH, speech energy ENERGY, I-VECTOR) processing is identified.

Step S505 obtains the second recognition result score2.

Step S506 carries out speech recognition by speech recognition engine to audio section.

In the training stage of speech recognition engine, each word in vocabulary can be given an account of successively, and by its feature Vector is stored in template library as template.

In the stage for carrying out speech recognition by speech recognition engine, will input the acoustic feature vector of voice successively with mould Each template in plate library carries out similarity-rough set, is exported similarity soprano as recognition result.

Step S507 obtains Text region result (i.e. the first text message).

Step S508 segments the first text message, such as to " tomorrow will have a holiday or vacation, I am good happy " participle As a result it is：Tomorrow, will, have a holiday or vacation, I, it is good, happy,.

Step S509, multiple words that above-mentioned participle is obtained are as the input of CNN models, and CNN models are to multiple words Language carries out convolution, classification, identifying processing.

Step S510 obtains the first recognition result score1 of CNN models output.

Step S511 carries out fusion treatment to recognition result and obtains final result.

The target audio of input, by feature extraction, feature extraction is divided into two kinds of one kind for speech recognition, by voice It identifies engine, obtains voice recognition result, voice recognition result is sent to text emotion detecting and alarm, obtains text by participle Emotion score score1；Another is used to detect score based on audio emotion, and the detection of audio emotion is sent to by feature extraction, Audio score score2 is obtained, then obtains final score final_score by a weight factor：

Final_score=a*score1+ (1-a) * score2.

A is the weighted value trained by development set, and final score is the score between 0-1.

For example, sad corresponding score section be [0,0.3), flat corresponding score section be [0.3,0.7), it is glad right The score [0.7,1] answered, you can determine that actual emotion is glad, sad or flat according to finally obtained score value.

In embodiments herein, using based on the method blended herein with voice, individual difference can be made up The shortcomings that method, can increase weight of the weight factor for adjusting two methods during the two blends, with It is applicable in different occasions.The application can be divided into two modules, training module and identification module, and training module can be instructed individually To practice, different texts and audio are chosen according to different situations, three kinds of emotional characteristics in the application are glad, normal and unhappy, Degree glad and out of sorts can indicate that the score of emotion is more passive closer to zero mood between 0-1 with score, Closer to 1 mood it is more positive, for application can be audio section emotion differentiate.

It should be noted that for each method embodiment above-mentioned, for simple description, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the described action sequence because According to the present invention, certain steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical scheme of the present invention is substantially in other words to existing The part that technology contributes can be expressed in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.

Embodiment 2

According to embodiments of the present invention, a kind of emotion information for implementing the determination method of above-mentioned emotion information is additionally provided Determining device.Fig. 6 is a kind of schematic diagram of the determining device of optional emotion information according to the ... of the embodiment of the present invention, such as Fig. 6 Shown, which may include：First acquisition unit 61, recognition unit 62 and the first determination unit 63.

First acquisition unit 61 obtains target audio, wherein target audio includes multiple audio sections.

Recognition unit 62, for identifying multiple first text messages from multiple audio sections, any one first text Information identifies that there is audio section phonetic feature, the first text message to have text special from a corresponding audio section Sign.

First determination unit 63, for phonetic feature and the text that has of multiple first text messages based on multiple audio sections Eigen determines the target emotion information of multiple audio sections.

It should be noted that the first acquisition unit 61 in the embodiment can be used for executing in the embodiment of the present application 1 Step S202, the recognition unit 62 in the embodiment can be used for executing the step S204 in the embodiment of the present application 1, the embodiment In the first determination unit 63 can be used for execute the embodiment of the present application 1 in step S206.

Herein it should be noted that above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module as a part for device may operate in as In hardware environment shown in FIG. 1, it can also pass through hardware realization by software realization.

By above-mentioned module, when obtaining target audio, one first is identified from each audio section of target audio Text message, is then based on the text feature that the first text message has and the phonetic feature that audio section has determines audio section Target emotion information can determine feelings when text message is showed with apparent emotion by the text feature of text message Feel information, emotion information can be determined by the phonetic feature of audio section when audio section is showed with apparent emotion, and And each audio section is emotion recognition as a result, can solve can not accurately identify in the related technology there are corresponding one The technical issues of emotion information of words person, and then reach the technique effect for the accuracy for improving the emotion information of identification speaker.

On the basis of above-mentioned cognition, as long as voice or the apparent emotional color of word band (the i.e. feelings of the first emotion grade Feel information), then it can determine that target voice is the voice with emotional color.

Optionally, as shown in fig. 7, the device of the application may also include：Second acquisition unit 64, for based on multiple sounds The text feature that the phonetic feature of frequency range and multiple first text messages have determine multiple audio sections target emotion information it Afterwards, the emotion grade belonging to each target emotion information in multiple target emotion informations is obtained；Second determination unit 65 is used for When multiple target emotion informations include the emotion information of the first emotion grade, determine that the emotion information of target audio is the first feelings Feel the emotion information of grade.

The first determination unit of the application determines the target emotion information of each audio section as follows：Obtain basis The first recognition result that the text feature of first text message determines, wherein the first recognition result is for indicating according to text spy Levy the emotion information identified；Obtain the second identification determined according to the phonetic feature of audio section corresponding with the first text message As a result, wherein the emotion information that the second recognition result is used to indicate to be identified according to phonetic feature；In the first recognition result and When the emotion information that at least one of two recognition results indicate is the emotion information of the first emotion grade, by the mesh of institute's audio section Mark emotion information is determined as the emotion information of the first emotion grade.

It is from the when first determination unit obtains the first recognition result determined according to the text feature of the first text message One convolution neural network model obtains the first recognition result determined according to the text feature identified from the first text message.

Obtaining what the first convolution neural network model was determined according to the text feature identified from the first text message During first recognition result, by the feature extraction layer of the first convolution neural network model in multiple characteristic dimensions One text message carries out feature extraction, obtains multiple text features, wherein extraction obtains a text in each characteristic dimension Feature；Feature knowledge is carried out to the first text feature in multiple text features by the classification layer of the first convolution neural network model Not, the first recognition result is obtained, wherein text feature includes the first text feature and the second text feature, the first text feature Characteristic value be more than any one the second text feature characteristic value.

First determination unit is obtained to be known according to second that the phonetic feature of audio section corresponding with the first text message determines It is to obtain to be known according to second that the phonetic feature identified from audio section determines from the first deep neural network model when other result Other result.

Optionally, the device of the application can also include：Detection unit, for identified from multiple audio sections it is multiple Before first text message, mute detection is carried out to target audio, detects mute section in target audio；Third determines single Member, multiple audio sections for identifying that target audio includes according to mute section, wherein between the adjacent audio section of any two There are one mute section at interval.

Herein it should be noted that above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module as a part for device may operate in as In hardware environment shown in FIG. 1, it can also pass through hardware realization by software realization, wherein hardware environment includes network Environment.

Embodiment 3

According to embodiments of the present invention, additionally provide it is a kind of for implement above-mentioned emotion information determination method server or Terminal (namely electronic device).

Fig. 8 is a kind of structure diagram of terminal according to the ... of the embodiment of the present invention, as shown in figure 8, the terminal may include：One A or multiple (one is only shown in Fig. 8) processor 801, memory 803 and transmitting device 805 are (in such as above-described embodiment Sending device), as shown in figure 8, the terminal can also include input-output equipment 807.

Wherein, memory 803 can be used for storing software program and module, such as the emotion information in the embodiment of the present invention Determine that the corresponding program instruction/module of method and apparatus, processor 801 are stored in the software journey in memory 803 by operation Sequence and module realize the determination method of above-mentioned emotion information to perform various functions application and data processing.It deposits Reservoir 803 may include high speed random access memory, can also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 803 can further comprise relative to place The remotely located memory of device 801 is managed, these remote memories can pass through network connection to terminal.The example packet of above-mentioned network Include but be not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Above-mentioned transmitting device 805 is used to receive via network or transmission data, can be also used for processor with Data transmission between memory.Above-mentioned network specific example may include cable network and wireless network.In an example, Transmitting device 805 includes a network adapter (Network Interface Controller, NIC), can pass through cable It is connected with other network equipments with router so as to be communicated with internet or LAN.In an example, transmission dress It is radio frequency (Radio Frequency, RF) module to set 805, is used to wirelessly be communicated with internet.

Wherein, specifically, memory 803 is for storing application program.

Processor 801 can call the application program that memory 803 stores by transmitting device 805, to execute following steps Suddenly：Target audio is obtained, target audio includes multiple audio sections；Multiple first text messages are identified from multiple audio sections, Any one first text message identifies that audio section has phonetic feature, the first text from a corresponding audio section This information has text feature；The text feature that phonetic feature and multiple first text messages based on multiple audio sections have is true The target emotion information of fixed multiple audio sections.

Processor 801 is additionally operable to execute following step：Obtain each target emotion information institute in multiple target emotion informations The emotion grade of category；When multiple target emotion informations include the emotion information of the first emotion grade, target audio is determined Emotion information is the emotion information of the first emotion grade.

Using the embodiment of the present invention, when obtaining target audio, one is identified from each audio section of target audio First text message, is then based on the text feature that the first text message has and the phonetic feature that audio section has determines audio The target emotion information of section, can be by the text feature of text message come really when text message is showed with apparent emotion Determine emotion information, can determine that emotion is believed by the phonetic feature of audio section when audio section is showed with apparent emotion Breath, and each audio section there are corresponding one is emotion recognition as a result, can solve in the related technology can not be accurate The technical issues of identifying the emotion information of speaker, and then reach the technology for the accuracy for improving the emotion information of identification speaker Effect.

Optionally, the specific example in the present embodiment can refer to showing described in above-described embodiment 1 and embodiment 2 Example, details are not described herein for the present embodiment.

It will appreciated by the skilled person that structure shown in Fig. 8 is only to illustrate, terminal can be smart mobile phone (such as Android phone, iOS mobile phones), tablet computer, palm PC and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Fig. 8 it does not cause to limit to the structure of above-mentioned electronic device.For example, terminal is also May include than shown in Fig. 8 more either less components (such as network interface, display device) or with shown in Fig. 8 Different configurations.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can To be completed come command terminal device-dependent hardware by program, which can be stored in a computer readable storage medium In, storage medium may include：Flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..

Embodiment 4

The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can For the program code of the determination method of execution emotion information.

Optionally, in the present embodiment, above-mentioned storage medium can be located at multiple in network shown in above-described embodiment On at least one of network equipment network equipment.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps：

S11 obtains target audio, and target audio includes multiple audio sections；

S12 identifies multiple first text messages from multiple audio sections, any one first text message is from correspondence Audio section in identify, there is audio section phonetic feature, the first text message to have text feature；

S13, the text feature that phonetic feature and multiple first text messages based on multiple audio sections have determine multiple The target emotion information of audio section.

Optionally, storage medium is also configured to store the program code for executing following steps：

S21 obtains the emotion grade belonging to each target emotion information in multiple target emotion informations；

S22 determines the feelings of target audio when multiple target emotion informations include the emotion information of the first emotion grade Feel the emotion information that information is the first emotion grade.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to：USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or The various media that can store program code such as CD.

The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.

If the integrated unit in above-described embodiment is realized in the form of SFU software functional unit and as independent product Sale in use, can be stored in the storage medium that above computer can be read.Based on this understanding, skill of the invention Substantially all or part of the part that contributes to existing technology or the technical solution can be with soft in other words for art scheme The form of part product embodies, which is stored in a storage medium, including some instructions are used so that one Platform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) execute each embodiment institute of the present invention State all or part of step of method.

In the above embodiment of the present invention, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed client, it can be by others side Formula is realized.Wherein, the apparatus embodiments described above are merely exemplary, for example, the unit division, only one Kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component can combine or It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.

The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of determination method of emotion information, which is characterized in that including：

Obtain target audio, wherein the target audio includes multiple audio sections；

Multiple first text messages are identified from multiple audio sections, wherein any one of first text message is It is identified from a corresponding audio section, there is the audio section phonetic feature, first text message to have Text feature；

The text feature that phonetic feature and multiple first text messages based on multiple audio sections have determines multiple The target emotion information of the audio section.

2. according to the method described in claim 1, it is characterized in that, in phonetic feature based on multiple audio sections and multiple After the text feature that first text message has determines the target emotion information of multiple audio sections, the method is also Including：

Obtain the emotion grade belonging to each target emotion information in multiple target emotion informations；

When multiple target emotion informations include the emotion information of the first emotion grade, the feelings of the target audio are determined Feel the emotion information that information is the first emotion grade.

3. according to the method described in claim 1, it is characterized in that, the phonetic feature based on multiple audio sections and multiple institutes It states the text feature that the first text message has and determines that the target emotion information of multiple audio sections includes as follows Determine the target emotion information of each audio section：

Obtain the first recognition result determined according to the text feature of first text message, wherein the first identification knot The emotion information that fruit is used to indicate to be identified according to the text feature；

The second recognition result determined according to the phonetic feature of the audio section corresponding with first text message is obtained, In, second recognition result is for indicating the emotion information identified according to the phonetic feature；

It is the first emotion in the emotion information that at least one of first recognition result and second recognition result indicate When the emotion information of grade, the target emotion information of institute's audio section is determined as to the emotion information of the first emotion grade.

4. according to the method described in claim 3, it is characterized in that,

Obtaining the first recognition result determined according to the text feature of first text message includes：Obtain the first convolutional Neural First recognition result that network model is determined according to the text feature identified from first text message；

Obtain the second recognition result packet determined according to the phonetic feature of the audio section corresponding with first text message It includes：Obtain the first deep neural network model is determined according to the phonetic feature identified from the audio section described second Recognition result.

5. according to the method described in claim 4, it is characterized in that, obtaining the first convolution neural network model according to from described the First recognition result that the text feature that is identified in one text message determines includes：

By the feature extraction layer of the first convolution neural network model to first text envelope in multiple characteristic dimensions Breath carries out feature extraction, obtains multiple text features, wherein extract and obtained described in one in each characteristic dimension Text feature；

By the classification layer of the first convolution neural network model to the first text feature in multiple text features into Row feature recognition obtains first recognition result, wherein the text feature includes first text feature and the second text Eigen, the characteristic value of first text feature are more than the characteristic value of any one of second text feature.

6. according to the method described in claim 1, it is characterized in that, identifying multiple first texts from multiple audio sections Before this information, the method further includes：

Mute detection is carried out to the target audio, detects mute section in the target audio；

The multiple audio sections for identifying that the target audio includes according to described mute section, wherein any two is adjacent There are one described mute section at interval between the audio section.

7. a kind of determining device of emotion information, which is characterized in that including：

First acquisition unit, for obtaining target audio, wherein the target audio includes multiple audio sections；

Recognition unit, for identifying multiple first text messages from multiple audio sections, wherein any one of One text message identifies from a corresponding audio section, and the audio section has a phonetic feature, described first Text message has text feature；

First determination unit, for based on multiple audio sections phonetic feature and multiple first text messages have Text feature determines the target emotion information of multiple audio sections.

8. device according to claim 7, which is characterized in that described device further includes：

Second acquisition unit, for based on multiple audio sections phonetic feature and multiple first text messages have Text feature determine the target emotion informations of multiple audio sections after, obtain each in multiple target emotion informations Emotion grade belonging to the target emotion information；

Second determination unit is used for when multiple target emotion informations include the emotion information of the first emotion grade, really The emotion information of the fixed target audio is the emotion information of the first emotion grade.

9. device according to claim 7, which is characterized in that first determination unit determines each as follows The target emotion information of the audio section：

10. device according to claim 7, which is characterized in that described device further includes：

Detection unit, for before identifying multiple first text messages in multiple audio sections, to the target sound Frequency carries out mute detection, detects mute section in the target audio；

Third determination unit, multiple audio sections for identifying that the target audio includes according to described mute section, In, there are one described mute section at interval between the adjacent audio section of any two.

11. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run Execute the method described in 1 to 6 any one of the claims.

12. a kind of electronic device, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that the processor executes the claims 1 to 6 by the computer program Method described in one.