CN108039181B

CN108039181B - Method and device for analyzing emotion information of sound signal

Info

Publication number: CN108039181B
Application number: CN201711065483.3A
Authority: CN
Inventors: 王富田; 李健; 张连毅; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2021-02-12
Anticipated expiration: 2037-11-02
Also published as: CN108039181A

Abstract

The embodiment of the invention provides a method and a device for analyzing emotion information of a sound signal, wherein the method comprises the steps of extracting text information and voice parameter information in the sound signal when analyzing emotion information expressed by the sound signal sent by a user; performing text emotion analysis on the text information to obtain emotion information expressed by the text information, and performing voice emotion analysis on the voice parameter information to obtain emotion information expressed by the voice parameter; and acquiring the expressed emotion information of the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information. The embodiment of the invention can improve the accuracy of determining the emotion information expressed by the sound signal.

Description

Method and device for analyzing emotion information of sound signal

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for analyzing emotion information of a sound signal.

Background

When a person speaks, a variety of emotional information may be expressed, such as, happy, angry, frightened, sad, neutral, and so on.

With the rapid development of the technology, the intelligent voice interaction terminal is widely used, more and more enterprises provide services for users by using the intelligent voice interaction terminal, and in order to improve the service quality in the process of providing the services for the users, the intelligent voice interaction terminal often needs to analyze the emotion to be expressed by the voice signals sent by the users.

In the prior art, the intelligent voice interaction terminal may analyze emotion information expressed by a voice signal sent by a user according to the voice signal, for example, the emotion information expressed by the user is determined according to the size, intonation, speed, and the like of the voice when the user speaks. For example, the user is angry at this time, saying "you are angry with this practice" in loud, quick and expensive intonation to express angry emotion information, and the intelligent voice interaction terminal analyzes the user's angry at this time according to the sound size, the speech rate and the intonation at the time the user says this sentence.

However, the inventor found that if the user is angry at this time, but the user says "you are angry with this practice" in a calmer voice, because the voice size, intonation, and voice speed of the user speaking do not meet the angry standard, the intelligent voice interactive terminal will not determine the emotion information expressed in this sentence as angry, but will likely determine as neutral, and thus a determination error occurs, resulting in a low accuracy rate of determining the emotion information expressed by the voice signal uttered by the user.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is as follows: the accuracy of the emotion information expressed by the voice signal sent by the user is determined to be low.

In order to improve the accuracy of determining emotion information expressed by a sound signal emitted by a user, the embodiment of the invention provides an emotion analysis method and device for the sound signal.

In a first aspect, an embodiment of the present invention provides an emotion analysis method for a sound signal, where the method includes:

extracting text information and voice parameter information in the sound signal;

performing text emotion analysis on the text information to obtain emotion information expressed by the text information;

performing voice emotion analysis on the voice parameter information to obtain emotion information expressed by the voice parameter information;

and acquiring the emotion information expressed by the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information.

The obtaining of emotion information expressed by the text information by performing text emotion analysis on the text information includes:

and carrying out text emotion analysis on the text information by using an LSTM algorithm to obtain probability values of all emotion information expressed by the text information.

Wherein, the obtaining the emotion information expressed by the voice parameter information by performing voice emotion analysis on the voice parameter information includes:

and performing voice emotion analysis on the voice parameters by using a CNN algorithm to obtain probability values of all emotion information expressed by the voice parameters.

The acquiring the emotion information expressed by the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information includes:

for each emotional information, calculating a comprehensive probability value of the emotional information expressed by the sound signal according to the probability value of the emotional information expressed by the text information and the probability value of the emotional information expressed by the voice parameter information;

and determining the emotion information with the highest comprehensive probability value as the expressed emotion information of the sound signal.

Wherein the probability value of the emotion information expressed according to the text information and the probability value of the emotion information expressed according to the voice parameter information comprises:

calculating a first product between the probability value of the emotion information expressed by the text information and a preset text emotion coefficient;

calculating a second product between the probability value of the emotion information expressed by the voice parameter information and a preset voice emotion coefficient;

calculating a third product between the first product and a preset matrix vector of the emotion information;

calculating a fourth product between the second product and a preset matrix vector of the emotion information;

and acquiring the comprehensive probability value of the emotion expressed by the sound signal according to the third product and the fourth product.

In a second aspect, an embodiment of the present invention provides an apparatus for analyzing emotion information of a sound signal, where the apparatus includes:

the extraction module is used for extracting text information and voice parameter information in the sound signal;

the first analysis module is used for carrying out text emotion analysis on the text information to obtain emotion information expressed by the text information;

the second analysis module is used for carrying out voice emotion analysis on the voice parameter information to obtain emotion information expressed by the voice parameter information;

and the obtaining module is used for obtaining the emotion information expressed by the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information.

Wherein the first analysis module is specifically configured to: and carrying out text emotion analysis on the text information by using an LSTM algorithm to obtain probability values of all emotion information expressed by the text information.

Wherein the second analysis module is specifically configured to: and performing voice emotion analysis on the voice parameters by using a CNN algorithm to obtain probability values of all emotion information expressed by the voice parameters.

Wherein the acquisition module comprises:

a calculating unit, configured to calculate, for each piece of emotion information, a comprehensive probability value of the emotion information expressed by the sound signal according to a probability value of the emotion information expressed by the text information and a probability value of the emotion information expressed by the speech parameter information;

and the determining unit is used for determining the emotion information with the highest comprehensive probability value as the expressed emotion information of the sound signal.

Wherein the calculation unit includes:

the first calculating subunit is used for calculating a first product between the probability value of the emotion information expressed by the text information and a preset text emotion coefficient;

the second calculating subunit is used for calculating a second product between the probability value of the emotion information expressed by the voice parameter information and a preset voice emotion coefficient;

the third calculation subunit is used for calculating a third product between the first product and a preset matrix vector of the emotion information;

the fourth calculating subunit is used for calculating a fourth product between the second product and the preset matrix vector of the emotion information;

and the obtaining subunit is used for obtaining the comprehensive probability value of the emotion information expressed by the sound signal according to the third product and the fourth product.

Compared with the prior art, the embodiment of the invention has the following advantages:

in the embodiment of the invention, when the emotion information expressed by a sound signal sent by a user is analyzed, text information and voice parameter information in the sound signal are extracted; performing text emotion analysis on the text information to obtain emotion information expressed by the text information, and performing voice emotion analysis on the voice parameter information to obtain emotion information expressed by the voice parameter; and acquiring the expressed emotion information of the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information.

In determining the expressed emotion information of the sound signal, the prior art determines the expressed emotion information of the sound signal only according to the size, intonation and speech speed in the sound signal, and the embodiment of the invention determines the expressed emotion information of the sound signal according to the text information and the speech parameter information in the sound signal.

Compared with the prior art, the method and the device have the advantages that the emotion information expressed by the sound signal is analyzed more comprehensively by combining the text information besides the voice parameter information, so that the misjudgment condition in the prior art can be avoided, and the accuracy of determining the emotion information expressed by the sound signal can be improved.

Drawings

FIG. 1 is a flowchart illustrating the steps of an embodiment of a method for emotion information analysis of a sound signal according to the present invention;

fig. 2 is a block diagram showing an embodiment of an emotion information analyzing apparatus for audio signals according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for analyzing emotion information of a sound signal according to the present invention is shown, which may specifically include the following steps:

in step S101, extracting text information and speech parameter information in the sound signal;

in the embodiment of the present invention, text information and speech parameter information in the sound signal may be extracted using a Deep Neural Network (DNN) algorithm, or text information and speech parameter information in the sound signal may be extracted using a Short-Term Memory (LSTM) algorithm and a Connection Temporal Classification (CTC) model.

The text information includes the content expressed by the sound signal, for example, the user speaks a sentence: the eight words "you are so happy," and "you are so happy" may be the text information of the sound signal.

The voice parameter information includes voice speed, signal-to-noise ratio, voice size, pitch, average pitch, pitch range, pitch variation, and the like of the voice signal.

In the embodiment of the present invention, after the sound signal is emitted through the mouth and nose of the user, the signal strength at some frequencies is reduced, for example, the signal strength at high frequencies is reduced and is lower than that at low frequencies, which may cause distortion of the sound signal, and thus may reduce the accuracy of the emotion information expressed by the sound signal. Therefore, in order to improve the accuracy of determining the emotion information expressed by a voice signal, it is necessary to detect the signal strength of the voice signal at various frequencies, and when the signal strength is detected to be low at some frequencies, the signal strength at those frequencies can be enhanced.

In another embodiment of the present invention, the sound signal needs to be split into a plurality of short sound signals according to time, and the soldier performs short-time signal strength analysis, short-time zero-crossing analysis, average signal strength analysis, short-time correlation analysis and average signal strength difference analysis on the plurality of short sound signals respectively to determine unvoiced sound, voiced sound and the like in the sound signal, so as to extract the speech parameter information of the sound signal later.

Secondly, the environment where the user speaks also has noise, the noise exists all the time, and the sound signal does not exist all the time, so whether the sound signal exists needs to be detected, when whether the sound signal exists is detected, a starting point and an ending point of the sound signal can be detected by using methods such as a double-threshold discrimination method and the like, the sound signal is further determined, excessive noise is prevented from being mixed in the sound signal to be processed at the same time, the data size and time of processing can be reduced, and the influence of the noise on the emotion analysis result of the sound signal can be avoided, so that the accuracy of the emotion analysis result of the sound signal is improved.

In step S102, performing text emotion analysis on the text information to obtain emotion information expressed by the text information;

in the embodiment of the invention, the text of the text information can be subjected to emotion analysis by using an LSTM algorithm, so that the probability value of each emotion information expressed by the text information is obtained and is used as the emotion information expressed by the text information.

Of course, when performing text emotion analysis on the text information, other text emotion analysis methods may also be adopted in the embodiments of the present invention, and the embodiments of the present invention do not limit the text emotion analysis method used when performing text emotion analysis on the text information.

In the embodiment of the present invention, the technician may set a variety of emotions locally in advance, such as happy, angry, frightened, sad, urgent, neutral, and so on. Thus, after analyzing the text information, the probability value of anger expressed by the text information, the probability value of joy expressed by the text information, the probability value of startle expressed by the text information, the probability value of sadness expressed by the text information, the probability value of urgency expressed by the text information and the probability value of neutrality expressed by the text information can be obtained.

In step S103, performing speech emotion analysis on the speech parameter information to obtain emotion information expressed by the speech parameter;

in the embodiment of the present invention, a CNN (Convolutional Neural Network) algorithm is used to perform speech emotion analysis on the speech parameter, so as to obtain probability values of each emotion information expressed by the speech parameter, and the probability values are used as emotion information expressed by the speech parameter information.

For example, a probability value of anger expressed by the voice parameter, a probability value of joy expressed by the voice parameter, a probability value of startle expressed by the voice parameter, a probability value of sadness expressed by the voice parameter, a probability value of urgency expressed by the voice parameter, and a probability value of neutrality expressed by the voice parameter are obtained.

Of course, when performing speech emotion analysis on the speech parameter, other speech emotion analysis methods may also be adopted in the embodiments of the present invention, and the embodiments of the present invention do not limit the speech emotion analysis method used when performing text emotion analysis on the speech information.

In step S104, emotion information expressed by the sound signal is acquired according to emotion information expressed by the text information and emotion information expressed by the speech parameter information.

In the embodiment of the present invention, for any one of a plurality of preset emotions, a comprehensive probability value of the emotion information expressed by the sound signal may be calculated according to a probability value of the emotion information expressed by the text information and a probability value of the emotion information expressed by the speech parameter information; the above operation is performed for each of the other emotions in the preset emotions, so that the comprehensive probability value of each emotion expressed by the sound signal can be obtained respectively, and then the emotion information with the highest comprehensive probability value is determined as the expressed emotion information of the sound signal.

The specific step of calculating the comprehensive probability value of the emotion expressed by the sound signal according to the probability value of the emotion expressed by the text information and the probability value of the emotion expressed by the voice parameter information may be implemented by the following processes, including:

calculating a first product between the probability value of the emotion information expressed by the text information and a preset text emotion coefficient; calculating a second product between the probability value of the emotion information expressed by the voice parameter information and a preset voice emotion coefficient; calculating a third product between the first product and a preset matrix vector of the emotion information; calculating a fourth product between the second product and a preset matrix vector of the emotion information; and acquiring the comprehensive probability value of the emotional information expressed by the sound signal according to the third product and the fourth product. For example, the third product and the fourth product are input into the tanh function to obtain the integrated probability value of the emotion information expressed by the sound signal.

In the embodiment of the present invention, the preset speech emotion coefficient and the preset text emotion coefficient may be the same or different.

Technical personnel can count a large number of sound signals expressing user emotion in advance, count the weight that text information and voice parameters can respectively express emotion, and set a preset text emotion coefficient to be larger than a preset voice emotion coefficient if the weight that the text information can express emotion is larger than the weight that the voice parameter information can express emotion; if the weight of the text information capable of expressing the emotion is smaller than the weight of the voice parameter information capable of expressing the emotion, the preset text emotion coefficient can be set to be smaller than the preset voice emotion coefficient; if the weight of the text information capable of expressing the emotion is equal to the weight of the voice parameter information capable of expressing the emotion, the preset text emotion coefficient can be set to be equal to the preset voice emotion coefficient. Then, storing the preset text emotion coefficients and the preset voice emotion coefficients in local respectively, so that the preset text emotion coefficients and the preset voice emotion coefficients can be directly obtained from local in step S104, and then calculating a first product between the probability value of the emotion information expressed by the text information and the preset text emotion coefficients; calculating a second product between the probability value of the emotion information expressed by the voice parameter information and a preset voice emotion coefficient; calculating a fourth product between the second product and a preset matrix vector of the emotion information; and acquiring the comprehensive probability value of the emotional information expressed by the sound signal according to the third product and the fourth product. For example, the third product and the fourth product are input into the tanh function to obtain the integrated probability value of the emotion information expressed by the sound signal.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 2, a block diagram of an embodiment of an emotion information analyzing apparatus for a sound signal according to the present invention is shown, and may specifically include the following modules:

the extraction module 11 is used for extracting text information and voice parameter information in the sound signal;

the first analysis module 12 is configured to perform text emotion analysis on the text information to obtain emotion information expressed by the text information;

the second analysis module 13 is configured to perform speech emotion analysis on the speech parameter information to obtain emotion information expressed by the speech parameter information;

and the obtaining module 14 is configured to obtain emotion information expressed by the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information.

Wherein the first analysis module 12 is specifically configured to: and carrying out text emotion analysis on the text information by using a long-short term memory network (LSTM) algorithm to obtain probability values of all emotion information expressed by the text information.

Wherein the second analysis module 13 is specifically configured to: and performing voice emotion analysis on the voice parameters by using a Convolutional Neural Network (CNN) algorithm to obtain probability values of all emotion information expressed by the voice parameters.

Wherein the obtaining module 14 includes:

Wherein the calculation unit includes:

Compared with the prior art, the method and the device have the advantages that the emotion information expressed by the sound signals is analyzed more comprehensively by combining the text information besides the voice parameter information, so that misjudgment in the prior art can be avoided, and the accuracy of determining the emotion information expressed by the sound signals can be improved.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the device for analyzing emotion information of a sound signal provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for emotion information analysis of a sound signal, the method comprising:

extracting text information and voice parameter information in a sound signal, wherein the method comprises the following steps: detecting signal intensities of the sound signal at respective frequencies, and when it is detected that the signal intensity at a high frequency is lower than the signal intensity at a low frequency, enhancing the signal intensity at the high frequency;

obtaining the emotion information expressed by the sound signal according to the emotion information expressed by the text information and the emotion information expressed by the voice parameter information, wherein the method comprises the following steps:

calculating the comprehensive probability value of the emotion information expressed by the sound signal according to the probability value of the emotion information expressed by the text information and the probability value of the emotion information expressed by the voice parameter information;

determining the emotion information with the highest comprehensive probability value as the expressed emotion information of the sound signal;

wherein, the calculating the comprehensive probability value of the emotion information expressed by the sound signal according to the probability value of the emotion information expressed by the text information and the probability value of the emotion information expressed by the voice parameter information comprises:

and acquiring the comprehensive probability value of the emotional information expressed by the sound signal according to the third product and the fourth product.

2. The method of claim 1, wherein the obtaining emotion information expressed by the text information through text emotion analysis of the text information comprises:

and carrying out text emotion analysis on the text information by using a long-short term memory network (LSTM) algorithm to obtain probability values of all emotion information expressed by the text information.

3. The method of claim 2, wherein performing speech emotion analysis on the speech parameter information to obtain emotion information expressed by the speech parameter information comprises:

and performing voice emotion analysis on the voice parameters by using a Convolutional Neural Network (CNN) algorithm to obtain probability values of all emotion information expressed by the voice parameters.

4. An apparatus for analyzing emotion information of a sound signal, the apparatus comprising:

the extraction module is used for extracting text information and voice parameter information in the sound signal, wherein the extraction module comprises: detecting signal intensities of the sound signal at respective frequencies, and when it is detected that the signal intensity at a high frequency is lower than the signal intensity at a low frequency, enhancing the signal intensity at the high frequency;

an obtaining module, configured to obtain emotion information expressed by the sound signal according to emotion information expressed by the text information and emotion information expressed by the speech parameter information, where the obtaining module includes:

5. The apparatus of claim 4, wherein the first analysis module is specifically configured to: and carrying out text emotion analysis on the text information by using a long-short term memory network (LSTM) algorithm to obtain probability values of all emotion information expressed by the text information.

6. The apparatus of claim 5, wherein the second analysis module is specifically configured to: and performing voice emotion analysis on the voice parameters by using a Convolutional Neural Network (CNN) algorithm to obtain probability values of all emotion information expressed by the voice parameters.