CN108305642B

CN108305642B - The determination method and apparatus of emotion information

Info

Publication number: CN108305642B
Application number: CN201710527116.4A
Authority: CN
Inventors: 刘海波
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2019-07-19
Anticipated expiration: 2037-06-30
Also published as: CN108305642A

Abstract

The invention discloses a kind of determination method and apparatus of emotion information.Wherein, this method comprises: obtaining target audio；The first text information is identified from target audio, target audio has phonetic feature, and the first text information has text feature；The phonetic feature that the text feature and target audio being had based on the first text information are had determines the target emotion information of target audio.The present invention solves the technical issues of emotion information that can not accurately identify speaker in the related technology.

Description

The determination method and apparatus of emotion information

Technical field

The present invention relates to internet areas, in particular to a kind of determination method and apparatus of emotion information.

Background technique

Now, along with the increase of multimedia content, from the market demand can carry out the audiovisual in the short time content it is general Want technology.In addition, the type of content is presented diversified trend, for example, film, serial, home video, news, documentary film, Music content, life real-time scene, the network novel, text news etc. correspond to this, and the audiovisual requirement for trying hearer is also more and more Sample.

Along with the diversification that this audiovisual requires, needs to retrieve immediately for the audiovisual requirement to examination hearer, prompts to want Adaptation, the technology of scene of viewing.Such as content summary technology, i.e., based on comprising text information and summary content, in content In summarization techniques, by analyzing text information, so that it is determined that the emotion that text information carries, such as laughs at, is angry, is sad Deng.

In above-mentioned analysis method, the emotion detection method based on audio can be used, the audio of speaker is detected, Emotion detection is carried out using audio, has the function of the case where obvious emotional expression with relatively good to speaker, when The emotional expression of speaker is not strong, such as a thing being very glad, is expressed with the very flat tone, at this time in audio Hardly with for expressing glad feature, in this case, voice-based emotion detection is just ineffective, does not do Method is accurately adjudicated according to phonetic feature, in some instances it may even be possible to obtain the court verdict of mistake.

The technical issues of for the emotion information that can not accurately identify speaker in the related technology, not yet proposes effective at present Solution.

Summary of the invention

The embodiment of the invention provides a kind of determination method and apparatus of emotion information, at least solve in the related technology without Method accurately identifies the technical issues of emotion information of speaker.

According to an aspect of an embodiment of the present invention, a kind of determination method of emotion information is provided, the determination method packet It includes: obtaining target audio；The first text information is identified from target audio, target audio has phonetic feature, the first text Information has text feature；The phonetic feature that the text feature and target audio being had based on the first text information are had determines mesh The target emotion information of mark with phonetic symbols frequency.

According to another aspect of an embodiment of the present invention, a kind of determining device of emotion information is additionally provided, the determining device It include: acquiring unit, for obtaining target audio；Recognition unit, for identifying the first text information, mesh from target audio Mark with phonetic symbols frequency has phonetic feature, and the first text information has text feature；Determination unit, for being had based on the first text information Text feature and the phonetic feature that has of target audio determine the target emotion information of target audio.

In embodiments of the present invention, when getting target audio, the first text information is identified from target audio, so The phonetic feature that the text feature and target audio being had afterwards based on the first text information are had determines the target feelings of target audio Feel information, namely in text information there is apparent emotion can determine emotion by the text feature of text information when showing Information in target audio there is apparent emotion can determine emotion information by the phonetic feature of target audio when showing, The technical issues of can solve the emotion information that can not accurately identify speaker in the related technology, and then reach raising identification and speak The technical effect of the accuracy of the emotion information of person.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the schematic diagram of the hardware environment of the determination method of emotion information according to an embodiment of the present invention；

Fig. 2 is a kind of flow chart of the determination method of optional emotion information according to an embodiment of the present invention；

Fig. 3 is the flow chart of optional training convolutional neural networks model according to an embodiment of the present invention；

Fig. 4 is the flow chart of optional trained deep neural network model according to an embodiment of the present invention；

Fig. 5 is a kind of flow chart of the determination method of optional emotion information according to an embodiment of the present invention；

Fig. 6 is a kind of schematic diagram of the determining device of optional emotion information according to an embodiment of the present invention；

Fig. 7 is a kind of schematic diagram of the determining device of optional emotion information according to an embodiment of the present invention；And

Fig. 8 is a kind of structural block diagram of terminal according to an embodiment of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

Embodiment 1

According to embodiments of the present invention, the embodiment of the method for a kind of determination method of emotion information is provided.

Optionally, in the present embodiment, the determination method of above-mentioned emotion information can be applied to as shown in Figure 1 by servicing In the hardware environment that device 102 and terminal 104 are constituted.As shown in Figure 1, server 102 is connected by network with terminal 104 Connect, above-mentioned network includes but is not limited to: wide area network, Metropolitan Area Network (MAN) or local area network, terminal 104 are not limited to PC, mobile phone, plate electricity Brain etc..The determination method of the emotion information of the embodiment of the present invention can be executed by server 102, can also by terminal 104 It executes, can also be and executed jointly by server 102 and terminal 104.Wherein, terminal 104 executes the emotion of the embodiment of the present invention The determination method of information is also possible to be executed by client mounted thereto.

When the determination method of the emotion information of the embodiment of the present invention by server or terminal to be individually performed when, directly taking Program code corresponding with the present processes is executed on business device or terminal.

When the determination method of the emotion information of the embodiment of the present invention by server and terminal jointly to execute when, sent out by terminal The demand for playing identification target audio, at this point, target voice to be identified is sent to server by terminal, then is held by server Row program code corresponding with the present processes, and the result of identification is fed back into terminal.

This Shen is described in detail for executing program code corresponding with the present processes on server or terminal below Embodiment please, Fig. 2 are a kind of flow chart of the determination method of optional emotion information according to an embodiment of the present invention, such as Fig. 2 Shown, this method may comprise steps of:

Step S202 obtains target audio.

It can be terminal and actively obtain target audio, perhaps receive the target audio of other equipment transmissions or in target Target audio is obtained under the triggering of instruction.Target instruction target word is equivalent to the finger of the target audio for identification of user or terminal triggering It enables.Obtaining target audio is the emotion information in order to identify target audio, which is to state text by target audio Show when information and (is including but not limited to showed by the wording in text or tone, tone color in text, audio etc. The emotion information come out out).

Above-mentioned text information refers to the combination of a sentence or multiple sentences, and a text includes but is not limited to one Sentence (Sentence), a paragraph (Paragraph) or a chapter (Discourse).

Emotion information is the information for describing speaker's emotion, such as when merely arriving something, is given expression to related to happiness Emotion (glad, flat, sad), when such as receiving others and apologizing, giving expression to relevant to pardon emotion, (pardon, do not set can It is no, do not forgive) etc..

Step S204, identifies the first text information from target audio, and target audio has phonetic feature, the first text Information has text feature.

The first text information is identified from target audio, is referred to and is identified target audio institute by way of speech recognition (the first text information identified herein may there are subtle areas with the text information actually stated for first text information of expression Not).

For speech recognition, phonetic feature includes the feature of following several respects: perceptual weighting linear prediction PLP (Perceptual Linear Predictive), Mel frequency cepstral coefficient MFCC (Mel-Frequency Cepstral Coefficients), FBANK (Filter-bank feature), tone PITCH (such as high bass), speech energy ENERGY, I- VECTOR a kind of important feature of vocal acoustics' difference (reflection speak) etc..The feature used in the application can be among the above It is one or more, it is preferred to use multiple.

For text identification, the first above-mentioned text can be identified from target audio by speech recognition engine Information, the text feature of text information include the affective style of each phrase or vocabulary in text, Sentiment orientation, emotional intensity etc. Feature can also be the incidence relation feature etc. between phrase.

Step S206, the phonetic feature that the text feature and target audio being had based on the first text information are had determine mesh The target emotion information of mark with phonetic symbols frequency.

In the target emotion information for determining target audio, comprehensively considered text feature that the first text information has and The phonetic feature that target audio has, relative in the related technology, only with the emotion detection method based on audio to speaker Audio detected for, both can be used audio carry out emotion detection, to speaker have obvious emotional expression The case where, has the function of relatively good, but when the emotional expression of speaker is not strong, for example a thing being very glad, use is very flat When the light tone is expressed, it can also hardly be used in this case with the feature for expressing happiness in audio Text based emotion detection method detects the text information in the audio of speaker, so as to according to text feature It is accurately adjudicated, to make up only by audio the deficiency for carrying out emotion detection, reaches the accuracy for improving court verdict Effect.

S202 to step S206 identifies first when getting target audio from target audio through the above steps Text information, is then based on the text feature that the first text information has and the phonetic feature that target audio has determines target sound The target emotion information of frequency, namely in text information there is apparent emotion can pass through the text feature of text information when showing It determines emotion information, in target audio there is apparent emotion can determine by the phonetic feature of target audio when showing Emotion information, the technical issues of can solve the emotion information that can not accurately identify speaker in the related technology, and then reach and mention The technical effect of the accuracy of the emotion information of height identification speaker.

Only with the emotion detection method based on audio for the audio of speaker detects, to speaker have than The case where obvious emotional expression have the function of it is relatively good, using text based emotion detection method to the sound of speaker Text information in frequency has the function of the case where obvious emotional expression with relatively good, however, when (i.e. what Scene or which type of voice) detected using the emotion detection method based on audio, when utilize text based feelings It is unknown that sense detection method, which carries out detection, it is impossible to is predicted in advance using any method come the inspection to current audio to be detected It is more preferable to survey effect.

If applicant is it is considered that the text obvious for certain emotions is stated using the flat tone, (such as emotion is If glad text is stated using the flat tone), the recognition effect using text based emotion detection method is obviously preferable, If being stated (as more flat text is glad than flat aobvious text using the tone with obvious emotion for certain emotions Tone statement), the recognition effect of the emotion detection method based on audio is obviously preferable, and the obvious text of above-mentioned emotion can be with It is stated using the obvious tone of the flat tone or emotion, the more flat text of emotion also can be used with significant emotion The tone or the statement of the flat tone, be not in the tone table that the obvious text of certain positive emotions uses reversed emotion It states, the text such as with happiness emotional color is stated using the sad tone.

Therefore, on the basis of above-mentioned cognition, as long as one of voice and text are with apparent emotional color (i.e. the first emotion The emotion information of grade), then it can determine that target voice is the voice with emotional color.Had based on the first text information When the phonetic feature that text feature and target audio have determines the target emotion information of target audio, obtain according to text feature The first determining recognition result, the first recognition result are used for the emotion information for indicating to identify according to text feature；Obtain basis The second recognition result that phonetic feature determines, the second recognition result are used for the emotion information for indicating to identify according to phonetic feature； Believe in the emotion that the emotion information that at least one of the first recognition result and the second recognition result indicate is the first emotion grade When breath, the target emotion information of target audio is determined as to the emotion information of the first emotion grade.

The first above-mentioned emotion grade is the grade with obvious emotion information, rather than tend to it is intermediate it is flat (without Obvious emotion) information, such as glad, flat, sad this group of emotion information, the emotion information of the first emotion grade Refer to glad or sadness, rather than it is flat, it is similar for other kinds of emotion information, it repeats no more.

In the above-mentioned technical solution identified of the application, common algorithm or machine are including but not limited to used Device, which learns relevant algorithm, which carries out feature identification and the identification of emotion information, can use engineering for the accuracy of raising Relevant algorithm is practised to carry out the identification of feature identification and emotion information.It is specifically described below:

(1) the CNN training process based on text identification

Before the above-mentioned steps S202 to step S206 for executing the application, first algorithm model can be trained: being obtained Before taking target audio, using the second text information (training text) and the first emotion information to the second convolution neural network model (original convolution neural network model) is trained, to determine the value of parameter in the second convolution neural network model, and will be true The second convolution neural network model after the value of parameter is determined and has been set as the first convolution neural network model, wherein first Emotion information is the emotion information of the second text information.It is as shown in Figure 3:

Step S301, segments training text.

Training sentence is segmented, such as to the result of example sentence " today pays out wages, I am very happy " participle are as follows: modern It, pay out wages, I, very, happily.The affective tag (practical emotion information) of the sentence of this training is glad.

Step S302, training CNN model (i.e. the second convolution neural network model).

Step S3021, Word2vector (term vector).

Term vector is as its name suggests to indicate a word with the form of a vector.Since machine learning task is needed input It is quantized into numerical value expression, then by making full use of the computing capability of computer, is calculated finally wanting as a result, so needing By term vector.

According to the number segmented in training sentence, the matrix of a n*k is formed, wherein n is the number of trained sentence word, k Type for the dimension of vector v ector, this matrix can be fixed, be also possible to it is dynamic, according to specific circumstances into Row selection.

Word2vector has comparison more and stable algorithm at present, and it is real that the application can choose CBOW and Skip-gram It is existing, can be based on Huffman tree for CBOW algorithm model and Skip-gram algorithm model, n omicronn-leaf in Huffman tree The initialization value of the intermediate vector of node storage is null vector, and the term vector of the corresponding word of leaf node is random initializtion 's.

Step S3022, convolutional layer carry out feature extraction.

The n*k matrix that step S3021 is generated obtains the matrix that several columns are 1 by convolutional layer, this layer is similar One feature extraction layer carries out feature extraction.

Step S3023, pond layer carry out pond processing.

Several of step S3022 generation are classified as 1 matrix, can according to the actual situation selected characteristic value maximum one Or it is maximum several as new feature, by forming the feature of fixed dimension after this layer, to solve sentence length Problem.

The processing that S3024, NN layers of step.

The new feature that step S3023 is generated can be according to the actual situation by one layer or the neural net layer of multilayer, most Later layer is softmax layers, by NN layers, obtains the label or score of an attribute.

The processing of step S3025, Back-Propagation (BP).

When step S3024 obtains an attribute tags or score, according to the practical affective tag of training sentence and know Error between other attribute, which retracts, is updated parameter, is optimal model into excessively a few wheel iteration, thus trained Journey is completed, and CNN model (the first convolution neural network model) is obtained.

(2) voice-based DNN training process

Before the above-mentioned steps S202 to step S206 for executing the application, first algorithm model can be trained and also wrapped It includes: before obtaining target audio, using training audio (or training voice) and the second emotion information to the second depth nerve net Network model is trained, with determine the second deep neural network model in parameter value, and by determined parameter value it The second deep neural network model afterwards is set as the first deep neural network model, wherein the second emotion information is training sound The emotion information of frequency.It is described in detail below with reference to Fig. 4:

Step S401 carries out feature extraction to training audio.

Feature extraction carried out to training voice, the feature of extraction can there are many kinds of, such as PLP, MFCC, FBANK, PITCH, ENERGY, I-VECTOR etc. can extract one or more in this various features, the spy that the application preferentially uses Sign is manifold fusion.

Step S402 is trained DNN (the second deep neural network model) using the feature of extraction.

Selection includes the DNN model of one layer or multilayer neural network layer, the last layer of DNN model according to the actual situation It is softmax layers (regression models), the fusion feature that previous step obtains is expanded into depth nerve net by before and after frames DNN layers of network, then by softmax layers of output.

DNN model may also include Back-Propagation (BP layers of back-propagation algorithm), and BP layers defeated by softmax layers Enter the label either difference of score and the result of affective tag to be handled using BP algorithm, the parameter of DNN be updated, Into excessively a few wheel iteration make model reach one it is optimal, obtain the first deep neural network model；Identification process does not need this Step.

(3) joint training based on voice and text

It is that two models are trained respectively in above-mentioned (1) and (2), does not excavate voice and text in identification Internal association, but text and voice are identified respectively.It can be used to excavate the internal association of voice and text Voice and text carry out joint training to model:

Before obtaining target audio, using training audio and the second text information to the second deep neural network model into Row training, to determine the value of parameter in the second deep neural network model, and by second after the value that parameter has been determined Deep neural network model is set as the first deep neural network model.

Above-mentioned training audio has the second phonetic feature, and the second text information has the second text feature, wherein uses Training audio and the second text information are trained the second deep neural network model, to determine the second deep neural network mould The value of parameter in type, and the first depth mind is set by the second deep neural network model after the value that parameter has been determined It is specifically included through network model:

Step 1, using the second phonetic feature and the second text feature as the input of the second deep neural network model, with right Second deep neural network model is trained, wherein is trained the second deep neural network model including deep for second The parameter assignment in neural network model is spent, training audio carries the first emotion information；

Step 2, in the second emotion information and the matched feelings of the first emotion information of the output of the second deep neural network model Under condition, the first deep neural network model will be set as to the second deep neural network model after parameter assignment, wherein first Deep neural network model is used to identify that emotion information, incidence relation are special for describing emotion information and voice according to incidence relation Incidence relation between sign, the first text feature；

Step 3, under the second emotion information and the unmatched situation of the first emotion information, adjustment assigns the second depth nerve The value of parameter in network model, so that second of the second deep neural network model output after the value of adjustment imparting parameter Emotion information is matched with the first emotion information；

Step 4, (text having based on the first text information is executed when being identified using the model after training When the phonetic feature that eigen and target audio have determines the target emotion information of target audio), by the first phonetic feature and Input of first text feature as the first deep neural network model, and the first deep neural network model is obtained according to first The target emotion information for the target audio that phonetic feature and the first text feature determine.

In the technical solution that step S202 is provided, target audio is obtained, it is defeated by audio such as to obtain user at the terminal Enter a segment of audio of equipment (such as microphone) input.

In the technical solution that step S204 is provided, the first text information, target audio tool are identified from target audio There is phonetic feature, the first text information has text feature.

The extraction and selection of acoustic feature are an important links of speech recognition, and the extraction of acoustic feature is both a letter Cease the process and a signal uncoiling process significantly compressed, it is therefore an objective to mode division device be enable preferably to divide.Due to language The time-varying characteristics of sound signal, feature extraction must carry out on a bit of voice signal, namely carry out short-time analysis.This section of quilt It is considered that stable analystal section is referred to as frame, the offset between frame and frame usually takes the 1/2 or 1/3 of frame length.Usually extract mesh During phonetic feature in mark with phonetic symbols frequency, preemphasis can be carried out to signal to promote high frequency, to signal adding window to avoid in short-term The influence at voice segments edge.The above-mentioned process for obtaining the first text information can be realized by speech recognition engine.

In the technical solution that step S206 is provided, the text feature and target audio being had based on the first text information are had Some phonetic features determine the target emotion information of target audio.The technical solution that step S206 is provided includes at least following two Implementation:

(1) mode one

The phonetic feature that the text feature and target audio being had based on the first text information are had determines target audio When target emotion information, the first recognition result determined according to text feature is obtained, the first recognition result is for indicating according to text The emotion information that eigen identifies；The second recognition result determined according to phonetic feature is obtained, the second recognition result is used for table Show the emotion information identified according to phonetic feature；It is indicated at least one of the first recognition result and the second recognition result When emotion information is the emotion information of the first emotion grade, the target emotion information of target audio is determined as the first emotion grade Emotion information.Such as glad, flat, sad this group of emotion information, in the first recognition result and the second identification knot As long as having one in fruit is glad or sad, final result (target emotion information) to be glad or sad, and is ignored without bright The influence of the emotion information of flat the first estate of aobvious Sentiment orientation.

Above-mentioned the first recognition result and the second recognition result can directly be the emotion information identified, be also possible to use In the other information (such as emotion score, affective style) for the emotion information that instruction identifies.

Optionally, the first convolution neural network model that is identified by of text feature is realized, is being obtained according to text feature When determining first recognition result, directly obtains from the first convolution neural network model and identified according to from the first text information Text feature determine the first recognition result.

Above-mentioned acquisition the first convolution neural network model is true according to the text feature identified from the first text information The first fixed recognition result include: by the feature extraction layer of the first convolution neural network model in multiple characteristic dimensions to One text information carries out feature extraction, obtains multiple text features, extracts in each characteristic dimension and obtains a text feature； Feature identification is carried out to the first text feature in multiple text features by the classification layer of the first convolution neural network model, is obtained To the first recognition result (namely selected characteristic is worth one or several maximum features), text feature includes the first text feature With the second text feature, the characteristic value of the first text feature is greater than the characteristic value of any one the second text feature.

Phonetic feature is identified by the realization of the first deep neural network model, is obtaining the determined according to phonetic feature When two recognition results, directly obtains from the first deep neural network model and determined according to the phonetic feature identified from target audio The second recognition result.

(2) mode two

The phonetic feature that the text feature and target audio being had based on the first text information are had determines target audio Target emotion information includes: the first recognition result for obtaining and being determined according to text feature, and the first recognition result includes being used to indicate According to the first emotion parameter of the emotion information that text feature identifies；It obtains and is tied according to the second identification that phonetic feature determines Fruit, the second recognition result include the second emotion parameter for being used to indicate the emotion information identified according to phonetic feature；It will be used for Indicate the third emotion parameter final_score setting of target emotion information are as follows: the first emotion parameter Score1* is the first emotion Weight a+ the second emotion parameter Score2* of parameter setting is the weight (1-a) of the second emotion parameter setting；The second feelings will be located at The emotion information of sense grade is determined as target emotion information, and the second emotion grade is the emotion parameter where with third emotion parameter The corresponding emotion grade in section, each emotion grade are corresponding with an emotion parameter section.

It should be noted that when obtaining according to the first determining recognition result of text feature, and obtain according to voice spy When levying the second determining recognition result, reference can be made to model used in above-mentioned mode one is calculated.

Optionally, mesh is determined in the phonetic feature that the text feature and target audio that are had based on the first text information are had After the target emotion information of mark with phonetic symbols frequency, plays target audio and show the target emotion information of target audio；Receive user's Feedback information includes being used to indicate whether the target emotion information identified correctly indicates information in feedback information, not just It further include the practical emotion information that user identifies according to the target audio of broadcasting in feedback information in the case where really.

If the target emotion information identified is incorrect, illustrate convolutional neural networks model and deep neural network model Recognition accuracy it is to be improved, especially for the audio-frequency informations of this kind of identification mistakes, discrimination is worse, at this point, sharp Discrimination is improved with negative feedback mechanism, specifically using the audio-frequency information of this kind of identification mistakes in a manner mentioned above to volume Product neural network model and deep neural network model carry out re -training, to adjust the value of two Model Parameters, improve it Recognition accuracy.

Optionally, the phonetic feature that the text feature and target audio being had based on the first text information are had determines target When the target emotion information of audio, target audio can be divided into several audio sections, identified from multiple audio sections multiple First text information, wherein any one first text information is identified from a corresponding audio section, audio section tool There is phonetic feature, the first text information has text feature, so as to the phonetic feature and multiple first based on multiple audio sections The text feature that text information has determines the target emotion information of multiple audio sections.

The text feature that phonetic feature and multiple first text informations based on multiple audio sections have determines multiple audios The target emotion information of section includes the target emotion information for determining each audio section as follows: being obtained according to the first text The first recognition result that the text feature of information determines (namely the convolutional neural networks model obtained is according to from the first text information In the first recognition result for determining of the text feature that identifies), the first recognition result is identified for indicating according to text feature Emotion information；The second determining recognition result of the phonetic feature of acquisition basis audio section corresponding with the first text information ( The second recognition result that the deep neural network model obtained is determined according to the phonetic feature identified from audio section), wherein Second recognition result is used for the emotion information for indicating to identify according to phonetic feature；In the first recognition result and the second recognition result At least one of indicate emotion information be the first emotion grade emotion information when, by the target emotion information of institute's audio section It is determined as the emotion information of the first emotion grade.

Above-mentioned convolutional neural networks model is according to the first of the text feature determination identified from the first text information Recognition result can be accomplished in that through the feature extraction layer of convolutional neural networks model in multiple characteristic dimensions Feature extraction is carried out to the first text information, obtains multiple text features, wherein is extracted in each characteristic dimension and obtains one Text feature；Feature knowledge is carried out to the first text feature in multiple text features by the classification layer of convolutional neural networks model Not, the first recognition result is obtained, wherein text feature includes the first text feature and the second text feature, the first text feature Characteristic value be greater than any one the second text feature characteristic value.

The second identification determined for the deep neural network model of acquisition according to the phonetic feature identified from audio section As a result, similar with the mode of above-mentioned the first recognition result of acquisition, details are not described herein.

It in this scenario, can be made up and be identified using single features based on the method blended herein with voice The shortcomings that, the fusion of the two is that text and audio training blend, and the method for fusion can be text output result and audio is defeated It sums up to obtain final result, and not instead of whole section of adduction using a weight among result out, the adduction being segmented, because Impossible one whole section of emotion for speaker remains unchanged, but can be risen and fallen, and may be with regard to several passes in one section of word The emotion of keyword can recognize that the emotional characteristics of different phase speaker in whole section of words than stronger in this way.

On the basis of above-mentioned cognition, as long as voice or the apparent emotional color of text band (the i.e. feelings of the first emotion grade Feel information), then it can determine that target voice is the voice with emotional color.In phonetic feature based on multiple audio sections and more After the text feature that a first text information has determines the target emotion information of multiple audio sections, multiple target feelings can be obtained Feel emotion grade belonging to each target emotion information in information；Including the first emotion grade in multiple target emotion informations When emotion information, determine that the emotion information of target audio is the emotion information of the first emotion grade.

As a kind of optional embodiment, embodiments herein is described in detail below with reference to Fig. 5:

Step S501 extracts the phonetic feature (namely acoustic feature) in target audio.

Step S502 carries out speech recognition by speech recognition engine.

In the training stage of speech recognition engine, each word in vocabulary can successively be given an account of, and by its feature Vector is stored in template library as template.

In the stage for carrying out speech recognition by speech recognition engine, will input the acoustic feature vector of voice successively with mould Each template in plate library carries out similarity-rough set, exports similarity soprano as recognition result.

Step S503 obtains Text region result (i.e. the first text information).

Step S504 segments the first text information, such as to " tomorrow will have a holiday or vacation, I am good happy " participle As a result are as follows: tomorrow, will, have a holiday or vacation, I, it is good, happy,.

Step S505, multiple words that above-mentioned participle is obtained are as the input of CNN model, and CNN model is to multiple words Language carries out convolution, classification, identifying processing.

Step S506 obtains the first recognition result score1 of CNN model output.

Step S507 is handled by phonetic feature of the DNN model to target audio.

DNN model is according to above-mentioned phonetic feature (perceptual weighting linear prediction PLP, Mel frequency cepstral coefficient identified It is MFCC, FBANK, tone PITCH, multiple in speech energy ENERGY, I-VECTOR) carry out identifying processing.

Step S508 obtains the second recognition result score2.

Convolution, classification processing are carried out to these fusion features (multiple features) using the convolutional layer of DNN model, obtained final Recognition result score2.

Step S509 carries out fusion treatment to recognition result and obtains final result.

The target audio of input, by feature extraction, feature extraction is divided into two kinds of one kind for speech recognition, by voice It identifies engine, obtains speech recognition result, speech recognition result is sent to text emotion detecting and alarm, obtains text by participle Emotion score score1；Another is used to detect score based on audio emotion, is sent to the detection of audio emotion by feature extraction, Audio score score2 is obtained, then obtains final score final_score by a weight factor:

Final_score=a*score1+ (1-a) * score2.

A is the weighted value obtained by development set training, and final score is the score between 0-1.

For example, sad corresponding score section be [0,0.3), flat corresponding score section be [0.3,0.7), it is glad right The score [0.7,1] answered can determine that actual emotion is glad, sad or flat according to finally obtained score value.

In embodiments herein, using based on the method blended herein with voice, individual difference can be made up The shortcomings that method, can increase the weight that a weight factor is used to adjust two methods during the two blends, with It is applicable in different occasions.The application can be divided into two modules, training module and identification module, and training module can be instructed individually To practice, different text and audio are chosen according to different situations, three kinds of emotional characteristics in the application are glad, normal and unhappy, Degree glad and out of sorts can indicate that the score of emotion is more passive closer to zero mood between 0-1 with score, It is more positive closer to 1 mood, for application can be whole sentence and differentiate.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.

Embodiment 2

According to embodiments of the present invention, it additionally provides a kind of for implementing the emotion information of the determination method of above-mentioned emotion information Determining device.Fig. 6 is a kind of schematic diagram of the determining device of optional emotion information according to an embodiment of the present invention, such as Fig. 6 It is shown, the apparatus may include: acquiring unit 61, recognition unit 62 and determination unit 63.

Acquiring unit 61, for obtaining target audio.

It can be terminal and actively obtain target audio, perhaps receive the target audio of other equipment transmissions or in target Target audio is obtained under the triggering of instruction.Target instruction target word is equivalent to the finger of the target audio for identification of user or terminal triggering It enables.Obtaining target audio is the emotion information in order to identify target audio, which is to state text by target audio Show when information and (is including but not limited to showed by the wording in text or tone, tone color in text, audio etc. ) come out emotion information.

Recognition unit 62, for identifying the first text information from target audio, target audio has phonetic feature, the One text information has text feature.

Determination unit 63, the phonetic feature that text feature and target audio for being had based on the first text information are had Determine the target emotion information of target audio.

It should be noted that the acquiring unit 61 in the embodiment can be used for executing the step in the embodiment of the present application 1 S202, the recognition unit 62 in the embodiment can be used for executing the step S204 in the embodiment of the present application 1, in the embodiment Determination unit 63 can be used for executing the step S206 in the embodiment of the present application 1.

Herein it should be noted that above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module as a part of device may operate in as In hardware environment shown in FIG. 1, hardware realization can also be passed through by software realization.

By above-mentioned module, when getting target audio, the first text information is identified from target audio, then base The phonetic feature that the text feature and target audio having in the first text information have determines the target emotion letter of target audio Breath, namely in text information there is apparent emotion can determine that emotion is believed when showing by the text feature of text information Breath in target audio there is apparent emotion can determine emotion information by the phonetic feature of target audio when showing, can To solve the technical issues of can not accurately identifying the emotion information of speaker in the related technology, and then reach raising identification speaker Emotion information accuracy technical effect.

Therefore, on the basis of above-mentioned cognition, as long as voice or the apparent emotional color of text band (i.e. the first emotion grade Emotion information), then can determine target voice be the voice with emotional color.As shown in fig. 7, determination unit can pass through Following module realizes above-mentioned technical proposal: first obtains module 631, for obtaining the first identification knot determined according to text feature Fruit, wherein the first recognition result is used for the emotion information for indicating to identify according to text feature；Second obtains module 632, is used for Obtain the second recognition result determined according to phonetic feature, wherein the second recognition result is identified for indicating according to phonetic feature Emotion information out；First determining module 633, for being indicated at least one of the first recognition result and the second recognition result Emotion information be the first emotion grade emotion information when, the target emotion information of target audio is determined as first emotion etc. The emotion information of grade.

In the above-mentioned technical solution identified of the application, common algorithm or machine are including but not limited to used Device, which learns relevant algorithm, which carries out feature identification and the identification of emotion information, can use engineering for the accuracy of raising Relevant algorithm is practised to carry out the identification of feature identification and emotion information.

Optionally, before acquiring unit gets target audio, the first training unit uses the second text information and the One emotion information is trained the second convolution neural network model, and to determine, parameter is taken in the second convolution neural network model Value, and the first convolution neural network model is set by the second convolution neural network model after the value that parameter has been determined, Wherein, the first emotion information is the emotion information of the second text information.

Optionally, before acquiring unit gets target audio, the second training unit uses training audio and the second feelings Sense information is trained the second deep neural network model, to determine the value of parameter in the second deep neural network model, And the first deep neural network model is set by the second deep neural network model after the value that parameter has been determined, In, the second emotion information is the emotion information of training audio.

After having trained identification model, above-mentioned first obtain module obtain the first convolution neural network model according to When the first recognition result that the text feature identified from the first text information determines, the first convolution neural network model is obtained The first recognition result determined according to the text feature identified from the first text information.Pass through the first convolutional neural networks mould The feature extraction layer of type carries out feature extraction to the first text information in multiple characteristic dimensions, obtains multiple text features, In, it is extracted in each characteristic dimension and obtains a text feature；By the classification layer of the first convolution neural network model to more The first text feature in a text feature carries out feature identification, obtains the first recognition result, wherein text feature includes first Text feature and the second text feature, the characteristic value of the first text feature are greater than the characteristic value of any one the second text feature.

It is to obtain the first depth mind according to the second recognition result that phonetic feature determines that the second above-mentioned acquisition module, which is obtained, The second recognition result determined through network model according to the phonetic feature identified from target audio.

Optionally, the determination unit of the application may also include that third obtains module, be determined for obtaining according to text feature The first recognition result, wherein the first recognition result includes be used to indicate the emotion information identified according to text feature One emotion parameter；4th obtains module, for obtaining the second recognition result determined according to phonetic feature, wherein the second identification It as a result include the second emotion parameter for being used to indicate the emotion information identified according to phonetic feature；Setup module, for that will use In the third emotion parameter setting of instruction target emotion information are as follows: the weight that the first emotion parameter * is arranged for the first emotion parameter+ Second emotion parameter * is the weight of the second emotion parameter setting；Second determining module, for the feelings of the second emotion grade will to be located at Sense information is determined as target emotion information, wherein the second emotion grade is and the emotion parameter section where third emotion parameter Corresponding emotion grade, each emotion grade are corresponding with an emotion parameter section.

Final_score=a*score1+ (1-a) * score2.

Herein it should be noted that above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module as a part of device may operate in as In hardware environment shown in FIG. 1, hardware realization can also be passed through by software realization, wherein hardware environment includes network Environment.

Embodiment 3

According to embodiments of the present invention, additionally provide a kind of server for implementing the determination method of above-mentioned emotion information or Terminal (namely electronic device).

Fig. 8 is a kind of structural block diagram of terminal according to an embodiment of the present invention, as shown in figure 8, the terminal may include: one A or multiple (one is only shown in Fig. 8) processor 801, memory 803 and transmitting device 805 are (in such as above-described embodiment Sending device), as shown in figure 8, the terminal can also include input-output equipment 807.

Wherein, memory 803 can be used for storing software program and module, such as the emotion information in the embodiment of the present invention Determine the corresponding program instruction/module of method and apparatus, the software journey that processor 801 is stored in memory 803 by operation Sequence and module realize the determination method of above-mentioned emotion information thereby executing various function application and data processing.It deposits Reservoir 803 may include high speed random access memory, can also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 803 can further comprise relative to place The remotely located memory of device 801 is managed, these remote memories can pass through network connection to terminal.The example packet of above-mentioned network Include but be not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Above-mentioned transmitting device 805 is used to that data to be received or sent via network, can be also used for processor with Data transmission between memory.Above-mentioned network specific example may include cable network and wireless network.In an example, Transmitting device 805 includes a network adapter (Network Interface Controller, NIC), can pass through cable It is connected with other network equipments with router so as to be communicated with internet or local area network.In an example, transmission dress 805 are set as radio frequency (Radio Frequency, RF) module, is used to wirelessly be communicated with internet.

Wherein, specifically, memory 803 is for storing application program.

The application program that processor 801 can call memory 803 to store by transmitting device 805, to execute following steps It is rapid: to obtain target audio；The first text information is identified from target audio, target audio has phonetic feature, the first text Information has text feature；The phonetic feature that the text feature and target audio being had based on the first text information are had determines mesh The target emotion information of mark with phonetic symbols frequency.

Processor 801 is also used to execute following step: obtaining the first recognition result determined according to text feature, wherein First recognition result is used for the emotion information for indicating to identify according to text feature；It obtains and is known according to second that phonetic feature determines Other result, wherein the second recognition result is used for the emotion information for indicating to identify according to phonetic feature；In the first recognition result and When the emotion information that at least one of second recognition result indicates is the emotion information of the first emotion grade, by target audio Target emotion information is determined as the emotion information of the first emotion grade.

Using the embodiment of the present invention, when getting target audio, the first text information is identified from target audio, so The phonetic feature that the text feature and target audio being had afterwards based on the first text information are had determines the target feelings of target audio Feel information, namely in text information there is apparent emotion can determine emotion by the text feature of text information when showing Information in target audio there is apparent emotion can determine emotion information by the phonetic feature of target audio when showing, The technical issues of can solve the emotion information that can not accurately identify speaker in the related technology, and then reach raising identification and speak The technical effect of the accuracy of the emotion information of person.

Optionally, the specific example in the present embodiment can be shown with reference to described in above-described embodiment 1 and embodiment 2 Example, details are not described herein for the present embodiment.

It will appreciated by the skilled person that structure shown in Fig. 8 is only to illustrate, terminal can be smart phone (such as Android phone, iOS mobile phone), tablet computer, palm PC and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Fig. 8 it does not cause to limit to the structure of above-mentioned electronic device.For example, terminal is also May include than shown in Fig. 8 more perhaps less component (such as network interface, display device) or have with shown in Fig. 8 Different configurations.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing the relevant hardware of terminal device by program, which can store in a computer readable storage medium In, storage medium may include: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..

Embodiment 4

The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can With the program code of the determination method for executing emotion information.

Optionally, in the present embodiment, above-mentioned storage medium can be located at multiple in network shown in above-described embodiment On at least one network equipment in the network equipment.

Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps:

S11 obtains target audio；

S12, identifies the first text information from target audio, and target audio has phonetic feature, the first text information With text feature；

S13, the phonetic feature that the text feature and target audio being had based on the first text information are had determine target sound The target emotion information of frequency.

Optionally, storage medium is also configured to store the program code for executing following steps:

S21 obtains the first recognition result determined according to text feature, wherein the first recognition result is for indicating basis The emotion information that text feature identifies；

S22 obtains the second recognition result determined according to phonetic feature, wherein the second recognition result is for indicating basis The emotion information that phonetic feature identifies；

S23 is the first emotion in the emotion information that at least one of the first recognition result and the second recognition result indicate When the emotion information of grade, the target emotion information of target audio is determined as to the emotion information of the first emotion grade.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or The various media that can store program code such as CD.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

If the integrated unit in above-described embodiment is realized in the form of SFU software functional unit and as independent product When selling or using, it can store in above-mentioned computer-readable storage medium.Based on this understanding, skill of the invention Substantially all or part of the part that contributes to existing technology or the technical solution can be with soft in other words for art scheme The form of part product embodies, which is stored in a storage medium, including some instructions are used so that one Platform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) execute each embodiment institute of the present invention State all or part of the steps of method.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed client, it can be by others side Formula is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, and only one Kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of determination method of emotion information characterized by comprising

Obtain target audio；

The first text information is identified from the target audio, wherein the target audio have phonetic feature, described first Text information has text feature；

The phonetic feature that the text feature and the target audio being had based on first text information are had determines the mesh The target emotion information of mark with phonetic symbols frequency, the language that the text feature and the target audio being had based on first text information are had Sound feature determines that the target emotion information of the target audio includes: to obtain to tie according to the first identification that the text feature determines Fruit, wherein first recognition result is used for the emotion information for indicating to identify according to the text feature；It obtains according to The second recognition result that phonetic feature determines, wherein second recognition result is identified for indicating according to the phonetic feature Emotion information out；It is in the emotion information that at least one of first recognition result and second recognition result indicate When the emotion information of the first emotion grade, the target emotion information of the target audio is determined as the first emotion grade Emotion information, the first emotion grade are the grade with obvious emotion information, rather than without the grade of obvious emotion.

2. the method according to claim 1, wherein

Obtain according to the text feature determine the first recognition result include: obtain the first convolution neural network model according to from First recognition result that the text feature identified in first text information determines；

Obtain according to the phonetic feature determine the second recognition result include: obtain the first deep neural network model according to from Second recognition result that the phonetic feature that the target audio identifies determines.

3. according to the method described in claim 2, it is characterized in that, obtaining the first convolution neural network model according to from described the First recognition result that the text feature that identifies in one text information determines includes:

By the feature extraction layer of the first convolution neural network model to first text envelope in multiple characteristic dimensions Breath carries out feature extraction, obtains multiple text features, wherein extract and obtained described in one in each characteristic dimension Text feature；

By the classification layer of the first convolution neural network model to the first text feature in multiple text features into The identification of row feature, obtains first recognition result, wherein the text feature includes first text feature and the second text Eigen, the characteristic value of first text feature are greater than the characteristic value of any one of second text feature.

4. according to the method described in claim 2, it is characterized in that, before obtaining target audio, the method also includes:

The second convolution neural network model is trained using the second text information and the first emotion information, with determination described The value of two convolutional neural networks Model Parameters, and by second convolutional Neural after the value that the parameter has been determined Network model is set as the first convolution neural network model, wherein first emotion information is second text envelope The emotion information of breath.

5. according to the method described in claim 2, it is characterized in that, before obtaining target audio, the method also includes:

The second deep neural network model is trained using training audio and the second emotion information, it is deep with determination described second The value of parameter in neural network model is spent, and by second deep neural network after the value that the parameter has been determined Model is set as first deep neural network model, wherein second emotion information is the emotion of the trained audio Information.

6. the method according to claim 1, wherein the text feature that is had based on first text information and The phonetic feature that the target audio has determines that the target emotion information of the target audio includes:

Obtain the first recognition result determined according to the text feature, wherein first recognition result includes being used to indicate According to the first emotion parameter of the emotion information that the text feature identifies；

Obtain the second recognition result determined according to the phonetic feature, wherein second recognition result includes being used to indicate According to the second emotion parameter of the emotion information that the phonetic feature identifies；

The third emotion parameter setting of the target emotion information will be used to indicate are as follows: the first emotion parameter * is described the Weight+the second emotion parameter * of one emotion parameter setting is the weight of second emotion parameter setting；

The emotion information for being located at the second emotion grade is determined as the target emotion information, wherein the second emotion grade It is emotion grade corresponding with the emotion parameter section where the third emotion parameter, each emotion grade is corresponding with a feelings Feel parameter section.

7. a kind of determining device of emotion information characterized by comprising

Acquiring unit, for obtaining target audio；

Recognition unit, for identifying the first text information from the target audio, wherein the target audio has voice Feature, first text information have text feature；

Determination unit, the voice that text feature and the target audio for being had based on first text information are had are special Sign determines the target emotion information of the target audio, and the determination unit includes: the first acquisition module, for obtaining according to institute State the first recognition result that text feature determines, wherein first recognition result is known for indicating according to the text feature Not Chu emotion information；Second obtains module, for obtaining the second recognition result determined according to the phonetic feature, wherein Second recognition result is used for the emotion information for indicating to identify according to the phonetic feature；First determining module is used for The emotion information that at least one of first recognition result and second recognition result indicate is the first emotion grade When emotion information, the target emotion information of the target audio is determined as to the emotion information of the first emotion grade, it is described First emotion grade is the grade with obvious emotion information, rather than without the grade of obvious emotion.

8. device according to claim 7, which is characterized in that the determination unit includes:

Third obtains module, for obtaining the first recognition result determined according to the text feature, wherein first identification It as a result include the first emotion parameter for being used to indicate the emotion information identified according to the text feature；

4th obtains module, for obtaining the second recognition result determined according to the phonetic feature, wherein second identification It as a result include the second emotion parameter for being used to indicate the emotion information identified according to the phonetic feature；

Setup module, the third emotion parameter for that will be used to indicate the target emotion information are arranged are as follows: first emotion Parameter * is the power that weight+second emotion parameter * of the first emotion parameter setting is the second emotion parameter setting Weight；

Second determining module, for the emotion information for being located at the second emotion grade to be determined as the target emotion information, wherein The second emotion grade is emotion grade corresponding with the emotion parameter section where the third emotion parameter, each emotion Grade is corresponding with an emotion parameter section.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run Execute method described in 1 to 6 any one of the claims.

10. a kind of electronic device, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that the processor executes the claims 1 to 6 by the computer program Method described in one.