CN108039181A

CN108039181A - The emotion information analysis method and device of a kind of voice signal

Info

Publication number: CN108039181A
Application number: CN201711065483.3A
Authority: CN
Inventors: 王富田; 李健; 张连毅; 武卫东
Original assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Current assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP; Beijing Sinovoice Technology Co Ltd
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2018-05-15
Anticipated expiration: 2037-11-02
Also published as: CN108039181B

Abstract

An embodiment of the present invention provides the emotion information analysis method and device of a kind of voice signal, when its method is included in the emotion information expressed by the voice signal that analysis user sends, text message and speech parameter information in voice signal are extracted；Text emotion is carried out to text information to analyze to obtain the emotion information expressed by text information, and speech emotional is carried out to the speech parameter information and analyzes to obtain the emotion information expressed by the speech parameter；The emotion information expressed by emotion information and the speech parameter information according to expressed by text information obtains the expressed emotion information of the voice signal.The embodiment of the present invention can improve the accuracy of the emotion information expressed by definite voice signal.

Description

The emotion information analysis method and device of a kind of voice signal

Technical field

The present invention relates to field of computer technology, more particularly to the emotion information analysis method and dress of a kind of voice signal Put.

Background technology

People can express various emotion informations when speaking, such as, glad, angry, shock, it is sad and in Property etc..

With the rapid development of technology, intelligent sound interactive terminal is widely used, and more and more enterprises utilize Intelligent sound interactive terminal services to provide a user, in order to improve service quality during service is provided a user, intelligence Energy interactive voice terminal generally requires the voice signal emotion to be expressed that analysis user sends.

Wherein, in the prior art, the voice signal that intelligent sound interactive terminal can be sent according to user analyzes the sound Emotion information expressed by sound signal, such as size, intonation and word speed of sound when being spoken by user etc. determine use Emotion information expressed by family.For example, user is very angry at this time, " your this way is said with loud, quick and high intonation Make the people very indignant " with the angry emotion information of expression, sound when intelligent sound interactive terminal says this word according to user is big It is very angry at this time that small, word speed and intonation analyze user.

However, it is found by the inventors that if user is very angry at this time, but user but says " you with the more tranquil tone This way makes the people very indignant ", the not up to angry standard of sound size, intonation and word speed when being spoken due to user, Therefore, the emotion information expressed by the words will not be determined as anger by intelligent sound interactive terminal, and is likely in being determined as Property, so as to definite mistake occur, cause to determine that the accuracy of emotion information expressed by voice signal that user sends is relatively low.

The content of the invention

The embodiment of the present invention the technical problem to be solved is that：Determine the emotion letter expressed by the voice signal that user sends The accuracy of breath is relatively low.

In order to improve the accuracy of the emotion information expressed by the voice signal that definite user sends, the embodiment of the present invention carries A kind of sentiment analysis method and apparatus of voice signal are supplied.

In a first aspect, an embodiment of the present invention provides a kind of sentiment analysis method of voice signal, the described method includes：

Extract the text message and speech parameter information in voice signal；

Text emotion is carried out to the text message to analyze to obtain the emotion information expressed by the text message；

The emotion that speech emotional is analyzed to obtain expressed by the speech parameter information is carried out to the speech parameter information to believe Breath；

The emotion information expressed by emotion information and the speech parameter information according to expressed by the text message obtains Take the emotion information expressed by the voice signal.

Wherein, the emotion for analyzing to obtain expressed by the text message to text message progress text emotion is believed Breath, including：

Text emotion analysis is carried out to the text message using LSTM algorithms, is obtained each expressed by the text message The probable value of a emotion information.

Wherein, it is described that speech parameter information progress speech emotional is analyzed to obtain expressed by the speech parameter information Emotion information, including：

Speech emotional analysis is carried out to the speech parameter using CNN algorithms, is obtained each expressed by the speech parameter The probable value of a emotion information.

Wherein, the emotion information according to expressed by the text message and the feelings expressed by the speech parameter information Feel the emotion information expressed by voice signal described in acquisition of information, including：

For each emotion information, the probable value of the emotion information according to expressed by the text message with it is described The probable value of the emotion information expressed by speech parameter information, calculates the emotion information expressed by the voice signal Combined chance value；

The highest emotion information of combined chance value is determined as to the expressed emotion information of the voice signal.

Wherein, the probable value of the emotion information according to expressed by the text message is believed with the speech parameter The probable value of the expressed emotion information of breath, including：

Calculate between the probable value of the emotion information expressed by the text message and pre-set text emotion coefficient First product；

Calculate the probable value of the emotion information expressed by the speech parameter information and default speech emotional coefficient it Between the second product；

Calculate the 3rd product between first product and the default matrix-vector of the emotion information；

Calculate the 4th product between second product and the default matrix-vector of the emotion information；

The synthesis of the emotion according to expressed by the 3rd product, the 4th product obtain the voice signal is general Rate value.

Second aspect, an embodiment of the present invention provides a kind of emotion information analytical equipment of voice signal, described device bag Include：

Extraction module, for extracting text message and speech parameter information in voice signal；

First analysis module, analyzes to obtain expressed by the text message for carrying out text emotion to the text message Emotion information；

Second analysis module, analyzes to obtain the speech parameter letter for carrying out the speech parameter information speech emotional The expressed emotion information of breath；

Acquisition module, for expressed by the emotion information according to expressed by the text message and the speech parameter information Emotion information obtain emotion information expressed by the voice signal.

Wherein, first analysis module is specifically used for：Text emotion is carried out to the text message using LSTM algorithms Analysis, obtains the probable value of each emotion information expressed by the text message.

Wherein, second analysis module is specifically used for：Speech emotional point is carried out to the speech parameter using CNN algorithms Analysis, obtains the probable value of each emotion information expressed by the speech parameter.

Wherein, the acquisition module includes：

Computing unit, for for each emotion information, the emotion information according to expressed by the text message Probable value and the speech parameter information expressed by the emotion information probable value, calculate expressed by the voice signal The emotion information combined chance value；

Determination unit, for the highest emotion information of combined chance value to be determined as to the expressed feelings of the voice signal Feel information.

Wherein, the computing unit includes：

First computation subunit, for calculating the probable value of the emotion information expressed by the text message with presetting The first product between text emotion coefficient；

Second computation subunit, for calculate the probable value of the emotion information expressed by the speech parameter information with The second product between default speech emotional coefficient；

3rd computation subunit, for calculating between first product and the default matrix-vector of the emotion information 3rd product；

4th computation subunit, for calculating between second product and the default matrix-vector of the emotion information 4th product；

Subelement is obtained, expressed by obtaining the voice signal according to the 3rd product, the 4th product The combined chance value of the emotion information.

Compared with prior art, the embodiment of the present invention includes advantages below：

In embodiments of the present invention, in the emotion information expressed by the voice signal that analysis user sends, sound is extracted Text message and speech parameter information in signal；Text emotion is carried out to text information to analyze to obtain text information institute table The emotion information reached, and the emotion that speech emotional is analyzed to obtain expressed by the speech parameter is carried out to the speech parameter information and is believed Breath；The emotion information expressed by emotion information and the speech parameter information according to expressed by text information obtains sound letter Number expressed emotion information.

When determining the expressed emotion information of the voice signal, the prior art is only according to big in the voice signal Small, intonation and word speed determine the expressed emotion information of the voice signal, and the embodiment of the present invention is believed according to the sound Text message and speech parameter information in number determine the expressed emotion information of the voice signal.

Compared with the prior art, the embodiment of the present invention is in addition to according to speech parameter information, in combination with text Information, more fully hereinafter analyzes the emotion information expressed by the voice signal, therefore can avoid the occurrence of the prior art In erroneous judgement situation, therefore, the embodiment of the present invention can improve the accuracy of the emotion information expressed by definite voice signal.

Brief description of the drawings

Fig. 1 is a kind of step flow chart of the emotion information analysis method embodiment of voice signal of the present invention；

Fig. 2 is a kind of structure diagram of the emotion information analytical equipment embodiment of voice signal of the present invention.

Embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, it is below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

Reference Fig. 1, shows a kind of step flow of the emotion information analysis method embodiment of voice signal of the present invention Figure, specifically may include steps of：

In step S101, text message and speech parameter information in voice signal are extracted；

In embodiments of the present invention, can be carried using DNN (Deep Neural Network, deep neural network) algorithm Take the text message and speech parameter information in the voice signal, alternatively, using LSTM (Long Short-Term Memory, Shot and long term memory network) algorithm and CTC (Connectionist temporal classification, the classification of connection sequential) Text message and speech parameter information in the model extraction voice signal.

Wherein, text message includes the content expressed by voice signal, for example, user says in short：" you so allow me It is very angry ", " you so allow me very angry " this eight words can be the text message of the voice signal.

Speech parameter information includes speech speed, signal-to-noise ratio, voice size, tone, average pitch, the fundamental tone of voice signal Scope and fundamental tone change etc..

In embodiments of the present invention, after voice signal is sent by the face and nose of user, in some frequencies Signal strength can reduce, for example, the signal strength of high frequency treatment reduces, and less than the signal strength at low frequency, can so cause sound Sound signal distortion, and then the accuracy of the emotion information expressed by voice signal can be reduced.Therefore, believe to improve definite sound The accuracy of emotion information expressed by number is, it is necessary to detect signal strength of the voice signal in each frequency, when detecting Signal strength can then strengthen signal strength on these frequencies when relatively low in some frequencies.

In an alternative embodiment of the invention, it is necessary to which voice signal is split as multiple short voice signals, soldier couple according to the time Multiple short voice signals carry out short signal intensive analysis, in short-term zero passage analysis, in short-term average signal strength analysis, correlation respectively Property analysis and average signal strength difference analysis, to determine the voiceless sound and voiced sound etc. in voice signal, in order to afterwards Extract the speech parameter information of the voice signal.

Secondly, the environment where when user speaks is also typically present noise, and usual noise is existing always, and sound is believed Number it is not existing always, therefore, it is necessary to detect whether there are voice signal, is detecting whether there are during voice signal, can be with The starting point and ending point of voice signal is detected using the methods of double threshold diagnostic method, and then determines the voice signal, is kept away Exempt from that excessive noise is mixed in the voice signal while is handled, it is possible to reduce the data volume of processing and time, secondly may be used also The influence brought to avoid the sentiment analysis result of the noise on analysis voice signal, to improve the emotion of voice signal parsing knot The accuracy of fruit.

In step s 102, the emotion that text emotion is analyzed to obtain expressed by text information is carried out to text information to believe Breath；

In embodiments of the present invention, sentiment analysis can be carried out to text information text using LSTM algorithms, obtains text The probable value of each emotion information expressed by this information, and the emotion information expressed by as text information.

Certainly, when carrying out text emotion analysis to text information, the embodiment of the present invention can also use other texts Sentiment analysis method, the embodiment of the present invention do not limit the text emotion to being utilized during the progress text emotion analysis of text information Analysis method.

In embodiments of the present invention, a variety of emotions can be locally located in advance in technical staff, for example, glad, angry, shake It is frightened, sad, worried and neutral etc..In this way, after by analyzing text message, can obtain expressed by text information Angry probable value, the glad probable value expressed by text information, the probable value of shock expressed by text information, Worried probable value expressed by sad probable value, text information and text information institute expressed by text information The neutral probable value of expression.

In step s 103, speech emotional is carried out to the speech parameter information to analyze to obtain the feelings expressed by the speech parameter Feel information；

In embodiments of the present invention, calculated using CNN (Convolutional Neural Network, convolutional neural networks) Method carries out speech emotional analysis to the speech parameter, obtains the probable value of each emotion information expressed by the speech parameter, and As the emotion information expressed by the speech parameter information.

For example, obtain the angry probable value expressed by the speech parameter, the glad probability expressed by the speech parameter Value, the probable value of shock expressed by the speech parameter, the sad probable value expressed by the speech parameter, the speech parameter institute Neutral probable value expressed by the worried probable value of expression and the speech parameter.

Certainly, when carrying out speech emotional analysis to the speech parameter, the embodiment of the present invention can also use other voices Sentiment analysis method, the embodiment of the present invention do not limit the speech emotional to being utilized during voice messaging progress text emotion analysis Analysis method.

In step S104, the feelings expressed by emotion information and the speech parameter information according to expressed by text information Emotion information expressed by the sense acquisition of information voice signal.

In embodiments of the present invention, can be according to this article for any one emotion in pre-set multiple emotions The probable value of the emotion information expressed by this information and the probable value of the emotion information expressed by the speech parameter information, meter Calculate the combined chance value of the emotion information expressed by the voice signal；For in pre-set multiple emotions other are each A emotion, equally performs aforesaid operations, and the synthesis that can so respectively obtain each emotion expressed by the voice signal is general The highest emotion information of combined chance value, is then determined as the expressed emotion information of the voice signal by rate value.

Wherein, the probable value of the emotion according to expressed by text information and the feelings expressed by the speech parameter information The probable value of sense, the specific steps for calculating the combined chance value of the emotion expressed by the voice signal can be by following flow Realize, including：

Calculate first between the probable value of the emotion information expressed by text information and pre-set text emotion coefficient Product；Calculate second between the probable value of the emotion information expressed by the speech parameter information and default speech emotional coefficient Product；Calculate the 3rd product between the first product and the default matrix-vector of the emotion information；Calculate the second product and the feelings Feel the 4th product between the default matrix-vector of information；According to expressed by the 3rd product, the 4th product obtain the voice signal The emotion information combined chance value.For example, by the 3rd product and the 4th product input tanh functions, sound letter is obtained The combined chance value of the emotion information expressed by number.

Wherein, in embodiments of the present invention, presetting speech emotional coefficient can be identical with pre-set text emotion coefficient, also may be used With difference.

Technical staff in advance can count the voice signal of substantial amounts of expression user feeling, count text message The weight to show emotion is each able to speech parameter, if the weight that text message can show emotion is joined more than voice The weight that number information can show emotion, then can set pre-set text emotion coefficient to be more than default speech emotional coefficient；If The weight that text message can show emotion is less than the weight that speech parameter information can show emotion, then can set default text This emotion coefficient is less than default speech emotional coefficient；If the weight that text message can show emotion is equal to speech parameter information The weight that can be showed emotion, then can set pre-set text emotion coefficient to be equal to default speech emotional coefficient.Afterwards, will set Good pre-set text emotion coefficient and default speech emotional coefficient is respectively stored in local so that in step S104 can directly from It is local to obtain pre-set text emotion coefficient and default speech emotional coefficient, then calculate the emotion letter expressed by text information The first product between the probable value and pre-set text emotion coefficient of breath；Calculate the emotion letter expressed by the speech parameter information The second product between the probable value of breath and default speech emotional coefficient；Calculate the default matrix of the second product and the emotion information The 4th product between vector；The emotion information according to expressed by the 3rd product, the 4th product obtain the voice signal it is comprehensive Close probable value.For example, by the 3rd product and the 4th product input tanh functions, the emotion expressed by the voice signal is obtained The combined chance value of information.

It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as to a series of action group Close, but those skilled in the art should know, the embodiment of the present invention and from the limitation of described sequence of movement, because according to According to the embodiment of the present invention, some steps can use other orders or be carried out at the same time.Secondly, those skilled in the art also should Know, embodiment described in this description belongs to preferred embodiment, and the involved action not necessarily present invention is implemented Necessary to example.

With reference to Fig. 2, a kind of structure diagram of the emotion information analytical equipment embodiment of voice signal of the present invention, tool are shown Body can include following module：

Extraction module 11, for extracting text message and speech parameter information in voice signal；

First analysis module 12, analyzes to obtain text message institute table for carrying out text emotion to the text message The emotion information reached；

Second analysis module 13, analyzes to obtain the speech parameter for carrying out speech emotional to the speech parameter information Emotion information expressed by information；

Acquisition module 14, for the emotion information according to expressed by the text message and speech parameter information institute table The emotion information reached obtains the emotion information expressed by the voice signal.

Wherein, first analysis module 12 is specifically used for：Using shot and long term memory network LSTM algorithms to the text Information carries out text emotion analysis, obtains the probable value of each emotion information expressed by the text message.

Wherein, second analysis module 13 is specifically used for：Using convolutional neural networks CNN algorithms to the speech parameter Speech emotional analysis is carried out, obtains the probable value of each emotion information expressed by the speech parameter.

Wherein, the acquisition module 14 includes：

Wherein, the computing unit includes：

Compared with the prior art, the embodiment of the present invention is in addition to according to speech parameter information, in combination with text Information, more fully hereinafter analyzes the emotion information expressed by voice signal, therefore can avoid the occurrence of in the prior art Erroneous judgement situation, therefore, the embodiment of the present invention can improve the accuracy of the emotion information expressed by definite voice signal.

For device embodiment, since it is substantially similar to embodiment of the method, so description is fairly simple, it is related Part illustrates referring to the part of embodiment of the method.

Each embodiment in this specification is described by the way of progressive, what each embodiment stressed be with The difference of other embodiment, between each embodiment identical similar part mutually referring to.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can be provided as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can use complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can use one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be with reference to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that it can realize flowchart and/or the block diagram by computer program instructions In each flow and/or block and flowchart and/or the block diagram in flow and/or square frame combination.These can be provided Computer program instructions are set to all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to produce a machine so that is held by the processor of computer or other programmable data processing terminal equipments Capable instruction is produced and is used for realization in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames The device for the function of specifying.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing terminal equipments In the computer-readable memory to work in a specific way so that the instruction being stored in the computer-readable memory produces bag The manufacture of command device is included, which realizes in one flow of flow chart or multiple flows and/or one side of block diagram The function of being specified in frame or multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing terminal equipments so that Series of operation steps is performed on computer or other programmable terminal equipments to produce computer implemented processing, so that The instruction performed on computer or other programmable terminal equipments is provided and is used for realization in one flow of flow chart or multiple flows And/or specified in one square frame of block diagram or multiple square frames function the step of.

Although having been described for the preferred embodiment of the embodiment of the present invention, those skilled in the art once know base This creative concept, then can make these embodiments other change and modification.So appended claims are intended to be construed to Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or order.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements are not only wrapped Those key elements are included, but also including other elements that are not explicitly listed, or further include as this process, method, article Or the key element that terminal device is intrinsic.In the absence of more restrictions, wanted by what sentence "including a ..." limited Element, it is not excluded that also there are other identical element in the process including the key element, method, article or terminal device.

Above to the emotion information analysis method and device of a kind of voice signal provided by the present invention, detailed Jie has been carried out Continue, specific case used herein is set forth the principle of the present invention and embodiment, and the explanation of above example is only It is the method and its core concept for being used to help understand the present invention；Meanwhile for those of ordinary skill in the art, according to this hair Bright thought, there will be changes in specific embodiments and applications, in conclusion this specification content should not manage Solve as limitation of the present invention.

Claims

A kind of 1. emotion information analysis method of voice signal, it is characterised in that the described method includes：

Extract the text message and speech parameter information in voice signal；

Text emotion is carried out to the text message to analyze to obtain the emotion information expressed by the text message；

Speech emotional is carried out to the speech parameter information to analyze to obtain the emotion information expressed by the speech parameter information；

Emotion information expressed by emotion information and the speech parameter information according to expressed by the text message obtains institute State the emotion information expressed by voice signal.
2. according to the method described in claim 1, it is characterized in that, described analyze text message progress text emotion To the emotion information expressed by the text message, including：

Text emotion analysis is carried out to the text message using shot and long term memory network LSTM algorithms, obtains the text message The probable value of expressed each emotion information.
3. according to the method described in claim 2, it is characterized in that, described carry out speech emotional point to the speech parameter information Analysis obtains the emotion information expressed by the speech parameter information, including：

Speech emotional analysis is carried out to the speech parameter using convolutional neural networks CNN algorithms, obtains the speech parameter institute The probable value of each emotion information of expression.
4. the according to the method described in claim 3, it is characterized in that, emotion information according to expressed by the text message The emotion information expressed by the voice signal is obtained with the emotion information expressed by the speech parameter information, including：

For each emotion information, probable value and the voice of the emotion information according to expressed by the text message The probable value of the emotion information expressed by parameter information, calculates the comprehensive of the emotion information expressed by the voice signal Close probable value；

The highest emotion information of combined chance value is determined as to the expressed emotion information of the voice signal.
5. the according to the method described in claim 4, it is characterized in that, emotion according to expressed by the text message The probable value of information and the probable value of the emotion information expressed by the speech parameter information, including：

Calculate first between the probable value of the emotion information expressed by the text message and pre-set text emotion coefficient Product；

Calculate between the probable value of the emotion information expressed by the speech parameter information and default speech emotional coefficient Second product；

Calculate the 3rd product between first product and the default matrix-vector of the emotion information；

Calculate the 4th product between second product and the default matrix-vector of the emotion information；

The synthesis of the emotion information according to expressed by the 3rd product, the 4th product obtain the voice signal is general Rate value.
6. the emotion information analytical equipment of a kind of voice signal, it is characterised in that described device includes：

Extraction module, for extracting text message and speech parameter information in voice signal；

First analysis module, analyzes to obtain the feelings expressed by the text message for carrying out text emotion to the text message Feel information；

Second analysis module, analyzes to obtain the speech parameter information institute for carrying out speech emotional to the speech parameter information The emotion information of expression；

Acquisition module, for the emotion information according to expressed by the text message and the feelings expressed by the speech parameter information Feel the emotion information expressed by voice signal described in acquisition of information.
7. device according to claim 6, it is characterised in that first analysis module is specifically used for：Utilize shot and long term Memory network LSTM algorithms carry out text emotion analysis to the text message, obtain each feelings expressed by the text message Feel the probable value of information.
8. device according to claim 7, it is characterised in that second analysis module is specifically used for：Utilize convolution god Speech emotional analysis is carried out to the speech parameter through network C NN algorithms, obtains each emotion expressed by the speech parameter The probable value of information.
9. device according to claim 8, it is characterised in that the acquisition module includes：

Computing unit, for for each emotion information, the emotion information according to expressed by the text message it is general Rate value and the probable value of the emotion information expressed by the speech parameter information, calculate the institute expressed by the voice signal State the combined chance value of emotion information；

Determination unit, the expressed emotion for the highest emotion information of combined chance value to be determined as to the voice signal are believed Breath.
10. device according to claim 9, it is characterised in that the computing unit includes：

First computation subunit, for calculating the probable value and pre-set text of the emotion information expressed by the text message The first product between emotion coefficient；

Second computation subunit, for calculating the probable value of the emotion information expressed by the speech parameter information with presetting The second product between speech emotional coefficient；

3rd computation subunit, for calculating the 3rd between first product and the default matrix-vector of the emotion information Product；

4th computation subunit, for calculating the 4th between second product and the default matrix-vector of the emotion information Product；

Subelement is obtained, for according to expressed by the 3rd product, the 4th product acquisition voice signal The combined chance value of emotion information.