CN110085211B

CN110085211B - Voice recognition interaction method and device, computer equipment and storage medium

Info

Publication number: CN110085211B
Application number: CN201810079431.XA
Authority: CN
Inventors: 王慧; 余世经; 朱频频
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2018-01-26
Filing date: 2018-01-26
Publication date: 2021-06-29
Anticipated expiration: 2038-01-26
Also published as: CN110085211A

Abstract

The embodiment of the invention provides a voice recognition interaction method and device, computer equipment and a storage medium, and solves the problems that an intelligent interaction mode in the prior art cannot analyze the deep intention of a user voice message and cannot provide more humanized interaction experience. The voice recognition interaction method comprises the following steps: acquiring emotion recognition results according to voice messages of a user, wherein the emotion recognition results at least comprise audio emotion recognition results, or the emotion recognition results at least comprise audio emotion recognition results and text emotion recognition results; performing intention analysis according to the text content of the user voice message to obtain corresponding basic intention information; and determining a corresponding interaction instruction according to the emotion recognition result and the basic intention information.

Description

Voice recognition interaction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of intelligent interaction, in particular to a voice recognition interaction method and device, computer equipment and a storage medium.

Background

With the continuous development of artificial intelligence technology and the continuous improvement of the requirements of people on interaction experience, the intelligent interaction mode gradually starts to replace some traditional human-computer interaction modes and becomes a research hotspot. However, the existing intelligent interaction method can only roughly analyze semantic content of the user voice message, and cannot recognize the current emotional state of the user through voice, so that the deep emotional demand actually desired to be expressed by the user voice message cannot be analyzed according to the emotional state of the user, and more humanized interaction experience cannot be provided according to the user voice message. For example, for a user with a rush of time and a user with a gentle mood at the beginning of the itinerary, the expected reply mode is different when inquiring flight time information by voice, while according to the existing intelligent interaction mode based on semantics, the reply mode obtained by different users is the same, for example, only the corresponding flight time information program is given to the users.

Disclosure of Invention

In view of this, embodiments of the present invention provide a voice recognition interaction method, apparatus, computer device, and storage medium, which solve the problems that an intelligent interaction manner in the prior art cannot analyze a deep-level intention of a user voice message and cannot provide more humanized interaction experience.

An embodiment of the present invention provides a speech recognition interaction method, including:

acquiring emotion recognition results according to voice messages of a user, wherein the emotion recognition results at least comprise audio emotion recognition results, or the emotion recognition results at least comprise audio emotion recognition results and text emotion recognition results;

performing intention analysis according to the text content of the user voice message to obtain corresponding basic intention information; and

and determining a corresponding interaction instruction according to the emotion recognition result and the basic intention information.

An embodiment of the present invention provides a speech recognition interaction apparatus, including:

the emotion recognition module is configured to obtain emotion recognition results according to voice messages of the user, wherein the emotion recognition results at least comprise audio emotion recognition results, or the emotion recognition results at least comprise audio emotion recognition results and text emotion recognition results;

the basic intention recognition module is configured to perform intention analysis according to the text content of the user voice message to obtain corresponding basic intention information; and

and the interaction instruction determining module is configured to determine a corresponding interaction instruction according to the emotion recognition result and the basic intention information.

An embodiment of the present invention provides a computer device including: a memory, a processor and a computer program stored on the memory for execution by the processor, the processor implementing the steps of the method as described above when executing the computer program.

An embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as set forth above.

According to the voice recognition interaction method, the voice recognition interaction device, the computer equipment and the computer readable storage medium, on the basis of understanding the basic intention information of the voice message of the user, the emotion recognition result obtained based on the voice message of the user is combined, and the interaction instruction with emotion is further given according to the basic intention information and the emotion recognition result, so that the problems that the deep-level intention of the voice message of the user cannot be analyzed and more humanized interaction experience cannot be provided in an intelligent interaction mode in the prior art are solved.

Drawings

Fig. 1 is a flowchart illustrating a speech recognition interaction method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to an embodiment of the present invention. The user voice message in this embodiment also includes at least a user voice message, emotion recognition

Fig. 4 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to another embodiment of the present invention.

Fig. 5 is a schematic flow chart illustrating a process of obtaining a text emotion recognition result according to text content of a user voice message in the voice recognition interaction method according to an embodiment of the present invention.

Fig. 6 is a schematic flow chart illustrating a process of obtaining a text emotion recognition result according to text content of a user voice message in the voice recognition interaction method according to an embodiment of the present invention.

Fig. 7 is a schematic flow chart illustrating a process of determining a text emotion recognition result in the speech recognition interaction method according to an embodiment of the present invention.

Fig. 8 is a schematic flow chart illustrating a process of determining a text emotion recognition result in a speech recognition interaction method according to another embodiment of the present invention.

Fig. 9 is a schematic flow chart illustrating obtaining basic intention information according to a user voice message in a voice recognition interaction method according to an embodiment of the present invention.

Fig. 10 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Fig. 12 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Fig. 13 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Fig. 14 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Fig. 15 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart illustrating a speech recognition interaction method according to an embodiment of the present invention. As shown in fig. 1, the speech recognition interaction method includes the following steps:

step 101: and acquiring an emotion recognition result according to the voice message of the user.

The emotion recognition result at least comprises an audio emotion recognition result, or the emotion recognition result at least comprises an audio emotion recognition result and a text emotion recognition result.

The user voice message refers to voice information related to the user's interaction intention and need, which is input or acquired by the user in the course of interacting with the user. For example, in a customer service interaction scenario of a call center system, a specific form of a user voice message may include a user voice message sent by a user, and the user may be a client or a server at this time; for example, in an interaction scenario of the intelligent robot, the user voice message may include the voice message input by the user through the input module of the intelligent robot.

For example, since the audio data of the user voice message in different emotional states may include different audio features, an audio emotion recognition result may be obtained according to the audio data of the user voice message, and an emotion recognition result may be determined according to the audio emotion recognition result.

The emotion recognition result acquired according to the user voice message is combined with the basic intention information in the subsequent process to conjecture the emotion intention of the user, or an interaction instruction with emotion is given directly according to the basic intention information and the emotion recognition result.

Step 102: and analyzing the intention according to the text content of the voice message of the user to obtain corresponding basic intention information.

The basic intention information corresponds to the intention intuitively reflected by the user voice message, but does not reflect the real emotional demand of the user in the current state, so that the deep intention and emotional demand actually desired to be expressed by the user voice message are comprehensively determined by combining the emotion recognition result. For example, for a user with an urgent emotional state during driving and a user with a gentle emotional state just beginning to make a trip, when the contents of the voice messages sent by the two users are the same to inquire flight information, the obtained basic intention information is the same and is the same to inquire the flight information, but the emotional requirements of the two users are obviously different.

It should be understood that the basic intention information can be obtained by performing intention analysis according to the text content of the user voice message, and the basic intention information corresponds to the intention reflected by the text content of the user voice message at semantic level and does not have any emotional color.

In an embodiment of the present invention, in order to further improve the accuracy of the obtained basic intention information, intention analysis may be performed according to the current user voice message and in combination with the past user voice message and/or the subsequent user voice message, so as to obtain corresponding basic intention information. For example, some keywords and slots (slots) may be lacking in the intent of the current user voice message, but such content may be available through past user voice messages and/or subsequent user voice messages. For example, the content of the current user voice message is "what specialties are? "when the subject (slot) is missing, but by incorporating the past user voice message" how is the weather in the Changzhou? "you can extract" Changzhou "as the subject, so that the basic intention information of the current user voice message finally obtained can be" what specialties are in the Changzhou? ".

Step 103: and determining a corresponding interaction instruction according to the emotion recognition result and the basic intention information.

The correspondence between the emotion recognition result and the basic intention information and the interactive instruction may be established by a learning process. In one embodiment of the invention, the content and form of the interactive instructions may include one or more of the following emotion presentation modalities: a text output emotion presentation mode, a music play emotion presentation mode, a voice emotion presentation mode, an image emotion presentation mode, and a mechanical action emotion presentation mode. However, it should be understood that the specific emotion presentation modality of the interactive instruction can also be adjusted according to the requirements of the interactive scene, and the specific content and form of the interactive instruction are not limited by the present invention.

In an embodiment of the present invention, the corresponding emotional intention information may be determined according to the emotional recognition result and the basic intention information, and then the corresponding interactive instruction may be determined according to the emotional intention information, or the corresponding interactive instruction may be determined according to the emotional intention information and the basic intention information. The emotional intention information at this time may have specific contents or may exist only as an identification of the mapping relationship. The correspondence between emotional intention information and the interactive instruction, and the correspondence between emotional intention information and basic intention information and the interactive instruction may also be established in advance through a pre-learning process.

Specifically, the emotional intention information refers to intention information with emotional colors, which can reflect the emotional needs of the user voice message while reflecting the basic intention, and the correspondence between the emotional intention information and the emotion recognition result and the basic intention information can be established in advance through a pre-learning process. In an embodiment of the present invention, the emotional intention information may include emotional need information corresponding to the emotional recognition result, or may include emotional need information corresponding to the emotional recognition result and an association relationship between the emotional recognition result and the basic intention information. The correlation between the emotion recognition result and the basic intention information is set in advance. For example, when the content of the emotion recognition result is "worried" and the content of the basic intention information is "loss report credit card", the determined content of the emotion intention information may include the association relationship between the emotion recognition result and the basic intention information: "loss of credit card, user worried, possible loss or theft of credit card", and the determined emotional need information may be "comfort".

The association relationship between the emotion recognition result emotional state and the basic intention information may be preset (for example, set by a rule, or logically determined); the model may also be based on a specific training model (for example, a trained end-to-end model, which may directly output emotional intentions such as emotional states and scene information, etc.), and this training model may be a fixed deep network model (which is not a set rule), or may be continuously updated through online learning (for example, an objective function and a reward function are set in the model by using an enhanced learning model, and the model may also continuously update evolution as the number of man-machine interactions increases).

It should be understood that in some application scenarios, feedback content for the emotional intention information needs to be presented. For example, in some customer service interaction scenarios, emotional intention information analyzed according to the voice content of the customer needs to be presented to customer service staff to play a reminding role, and at this time, corresponding emotional intention information is inevitably determined, and feedback content of the emotional intention information is presented. However, in other application scenarios, the corresponding interactive instruction needs to be directly given without presenting feedback content of the emotional intention information, and the corresponding interactive instruction can also be directly determined according to the emotional recognition result and the basic intention information without generating the emotional intention information.

In an embodiment of the present invention, in order to further improve the accuracy of the obtained emotional intention information, the corresponding emotional intention information may also be determined according to the emotion recognition result and the basic intention information of the current user voice message and by combining the emotion recognition result and the basic intention information of the past user voice message and/or the subsequent user voice message. At this time, it is necessary to record the emotion recognition result and the basic intention information of the current user voice message in real time so as to be used as a reference when determining the emotional intention information from other user voice messages.

In an embodiment of the present invention, in order to further improve the accuracy of the obtained corresponding interaction instruction, the corresponding interaction instruction may also be determined according to the emotional intention information and the basic intention information of the current user voice message, and in combination with the emotional intention information and the basic intention information of the past user voice message and/or the subsequent user voice message. At this time, it is necessary to record the emotion recognition result and the basic intention information of the current user voice message in real time so as to be used as a reference when determining the interactive instruction according to other user voice messages.

Therefore, the voice recognition interaction method provided by the embodiment of the invention combines the emotion recognition result obtained based on the voice message of the user on the basis of understanding the basic intention information of the user, further speculates the emotion intention of the user, or directly gives an interaction instruction with emotion according to the basic intention information and the emotion recognition result, thereby solving the problems that the intelligent interaction mode in the prior art cannot analyze the deep-level intention and emotion requirements of the voice message of the user and cannot provide more humanized interaction experience.

In an embodiment of the present invention, when the user voice message includes the user voice message, the emotion recognition result may be determined comprehensively according to the audio emotion recognition result and the text emotion recognition result. Specifically, it is necessary to obtain an audio emotion recognition result according to audio data of a user voice message, obtain a text emotion recognition result according to text content of the user voice message, and then comprehensively determine an emotion recognition result according to the audio emotion recognition result and the text emotion recognition result. However, as described above, the final emotion recognition result may be determined based on only the audio emotion recognition result, and the present invention is not limited thereto.

It should be appreciated that the audio emotion recognition result and the text emotion recognition result may be characterized in a variety of ways. In one embodiment of the invention, the emotion recognition result may be characterized in a discrete emotion classification manner, and the audio emotion recognition result and the text emotion recognition result may respectively include one or more of a plurality of emotion classifications. For example, in a customer service interaction scenario, the plurality of emotion classifications may include: the method comprises the steps of satisfaction classification, calmness classification and fidgetiness classification so as to correspond to emotional states which may occur to users in a customer service interaction scene; alternatively, the plurality of emotion classifications may include: satisfaction classification, calmness classification, fidgetiness classification, and anger classification to correspond to emotional states that may occur to customer service personnel in a customer service interaction scenario. However, it should be understood that the type and number of these emotion classifications can be adjusted according to the actual application scene requirements, and the invention is not limited to the type and number of emotion classifications. In a further embodiment, each mood classification may also include a plurality of mood intensity levels. In particular, the mood classification and the mood intensity level can be considered as two dimensional parameters, which can be independent of each other (e.g., each mood classification has corresponding N mood intensity levels, such as mild, moderate, and severe), or can have a preset correspondence (e.g., a "fussy" mood classification includes three mood intensity levels, mild, moderate, and severe; a "happy" mood classification includes only two mood intensity levels, moderate, and severe). Therefore, the emotion intensity level at this time can be regarded as an attribute parameter of the emotion classification, and when an emotion classification is determined through the emotion recognition process, the emotion intensity level of the emotion classification is also determined.

In another embodiment of the invention, the emotion recognition result can be further characterized by adopting a non-discrete dimension emotion model. At this time, the audio emotion recognition result and the text emotion recognition result can respectively correspond to a coordinate point in the multi-dimensional emotion space, and each dimension in the multi-dimensional emotion space corresponds to an emotion factor defined by psychology. For example, a pad (plesurearousaldominanc) three-dimensional mood model may be employed. The model considers that the emotion has three dimensions of pleasure degree, activation degree and dominance degree, and each emotion can be characterized by emotional factors corresponding to the three dimensions respectively. Wherein P represents the pleasure degree and represents the positive and negative characteristics of the individual emotional state; a represents the degree of activation, representing the level of activation of the individual's nerve victory; d represents the dominance degree and represents the control state of the individual on the scene and other people.

It should be understood that the audio emotion recognition result and the text emotion recognition result may be characterized by other characterization methods, and the specific characterization method is not limited by the present invention.

Fig. 2 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to an embodiment of the present invention. In this embodiment, the user voice message at least includes a user voice message, the emotion recognition result needs to be determined comprehensively according to the audio emotion recognition result and the text emotion recognition result, and the audio emotion recognition result and the text emotion recognition result respectively include one or more of a plurality of emotion classifications, where the method for determining the emotion recognition result may include the following steps:

step 201: if the audio emotion recognition result and the text emotion recognition result include the same emotion classification, the same emotion classification is taken as the emotion recognition result.

For example, when the audio emotion recognition result includes a satisfaction classification and a calmness classification, and the text emotion recognition result includes only the satisfaction classification, the final emotion recognition result may be the satisfaction classification.

Step 202: and if the audio emotion recognition result and the text emotion recognition result do not comprise the same emotion classification, taking the audio emotion recognition result and the text emotion recognition result as emotion recognition results together.

For example, when the audio emotion recognition result includes a satisfaction classification and the text emotion recognition result includes only a calm classification, the final emotion recognition result may be the satisfaction classification and the calm classification. In an embodiment of the present invention, when the final emotion recognition result includes a plurality of emotion classifications, the emotion recognition result and the basic intention information of the past user voice message and/or the subsequent user voice message are combined in the subsequent process to determine the corresponding emotional intention information.

It should be understood that although it is defined in step 202 that the audio emotion recognition result and the text emotion recognition result are used together as the emotion recognition result when the audio emotion recognition result and the text emotion recognition result do not include the same emotion classification, in other embodiments of the present invention, a more conservative interaction strategy may be adopted, such as directly generating an error notification message or not outputting the emotion recognition result, so as to avoid misleading the interaction process, and the processing manner of the present invention when the audio emotion recognition result and the text emotion recognition result do not include the same emotion classification is not strictly limited.

Fig. 3 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to an embodiment of the present invention. In this embodiment, the user voice message at least includes a user voice message, the emotion recognition result also needs to be determined comprehensively according to the audio emotion recognition result and the text emotion recognition result, and the audio emotion recognition result and the text emotion recognition result respectively include one or more of a plurality of emotion classifications, and the method for determining the emotion recognition result may include the following steps:

step 301: and calculating the confidence degree of the emotion classification in the audio emotion recognition result and the confidence degree of the emotion classification in the text emotion recognition result.

Statistically, confidence is also referred to as reliability, confidence level, or confidence coefficient. Due to the randomness of the samples, the conclusions drawn are always uncertain when using the sampling to estimate the overall parameters. Therefore, interval estimation in mathematical statistics can be used to estimate how large the probability that the error between an estimated value and the overall parameter is within a certain allowable range, and this corresponding probability is called confidence. For example, it is assumed that the preset emotion classification is related to one variable characterizing the emotion classification, i.e., the emotion classification may correspond to different values according to the magnitude of the variable value. When the confidence of the speech emotion recognition result is to be obtained, a plurality of measured values of the variable are obtained through a plurality of times of audio emotion recognition/text emotion recognition processes, and then the average value of the plurality of measured values is used as an estimated value. And estimating the probability of the error range between the estimation value and the true value of the variable within a certain range by an interval estimation method, wherein the higher the probability value is, the more accurate the estimation value is, namely, the higher the confidence of the current emotion classification is. It should be understood that the above-mentioned variables characterizing the emotion classification can be determined according to a specific algorithm of emotion recognition, and the present invention is not limited thereto.

Step 302: and judging whether the emotion classification with the highest confidence level in the audio emotion recognition result is the same as the emotion classification with the highest confidence level in the text emotion recognition result. If yes, go to step 303, otherwise go to step 304.

Step 303: and taking the emotion classification with the highest confidence level in the audio emotion recognition result or the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result.

At this time, it is shown that the emotion classification with the highest confidence level in the audio emotion recognition result and the text emotion recognition result is the same, and therefore, the same emotion classification with the highest confidence level can be directly used as the final emotion recognition result. For example, when the audio emotion recognition result includes a satisfaction classification (confidence a1) and a calmness classification (confidence a2), and the text emotion recognition result includes only the satisfaction classification (confidence b1), and a1 > a2, the satisfaction classification is taken as the final emotion recognition result.

Step 304: and comparing the confidence coefficient of the emotion classification with the highest confidence degree in the audio emotion recognition result with the confidence coefficient of the emotion classification with the highest confidence degree in the text emotion recognition result.

In an embodiment of the invention, in consideration of practical application scenarios, according to a specific algorithm of emotion recognition and limitations of types and contents of user voice messages, one of an audio emotion recognition result and a text emotion recognition result can be selected to be output as an emotion recognition result mainly considered, and the other one of the audio emotion recognition result and the text emotion recognition result is output as an emotion recognition result considered in an auxiliary manner, and then a final emotion recognition result is comprehensively determined by using factors such as confidence degree and emotion intensity level. It should be understood that which of the audio emotion recognition result and the text emotion recognition result is selected as the emotion recognition result to be mainly considered may be output according to actual scenes, for example, when the audio quality of the user voice message is not high (for example, the sampling rate is low or the noise is large, and it is difficult to extract audio features), but the text conversion condition may be satisfied, the text emotion recognition result may be selected as the emotion recognition result to be mainly considered to be output, and the audio emotion recognition result may be output as the emotion recognition result to be considered to be assisted. However, the present invention is not limited to selecting which of the audio emotion recognition result and the text emotion recognition result is output as the emotion recognition result to be mainly considered.

In one embodiment of the invention, the audio emotion recognition result is output as the emotion recognition result mainly considered, and the text emotion recognition result is output as the emotion recognition result considered in an auxiliary manner. At this time, if the confidence of the emotion classification with the highest confidence level in the audio emotion recognition result is greater than the confidence of the emotion classification with the highest confidence level in the text emotion recognition result, executing step 305; if the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result, executing step 306; if the confidence level of the most confident emotion classification in the audio emotion recognition result is equal to the confidence level of the most confident emotion classification in the text emotion recognition result, step 309 is performed.

Step 305: and taking the emotion classification with the highest confidence level in the audio emotion recognition result as an emotion recognition result.

Since the audio emotion recognition result is selected to be output as the emotion recognition result mainly considered, the emotion classification in the audio emotion recognition result should be considered preferentially; and the confidence coefficient of the emotion classification with the highest confidence degree in the audio emotion recognition result is greater than that of the emotion classification with the highest confidence degree in the text emotion recognition result, so that the emotion classification with the highest confidence degree in the audio emotion recognition result which is mainly considered can be selected as the emotion recognition result. For example, when the audio emotion recognition result includes a satisfaction classification (confidence a1) and a calmness classification (confidence a2), and the text emotion recognition result includes only a calmness classification (confidence b1), a1 > a2 and a1 > b1, the satisfaction classification is regarded as the final emotion recognition result.

Step 306: and judging whether the emotion classification with the highest confidence level in the text emotion recognition result is included in the audio emotion recognition result. If yes, go to step 307; if the result of the determination is negative, step 309 is executed.

When the confidence of the emotion classification with the highest confidence in the audio emotion recognition result is smaller than the confidence of the emotion classification with the highest confidence in the text emotion recognition result, it is indicated that the emotion classification with the highest confidence in the text emotion recognition result is possibly more credible, but since the audio emotion recognition result is selected as the emotion recognition result mainly considered for output, it is required to judge whether the emotion classification with the highest confidence in the text emotion recognition result is included in the audio emotion recognition result. If the emotion classification with the highest confidence in the text emotion recognition result is really included in the audio emotion recognition result, whether the emotion classification with the highest confidence in the text emotion recognition result needs to be considered as an auxiliary consideration can be measured by means of the emotion intensity level. For example, when the audio emotion recognition result includes a satisfied classification (confidence a1) and a calm classification (confidence a2), and the text emotion recognition result includes only a calm classification (confidence b1), a1 > a2 and a1 < b1, it is necessary to determine whether the calm classification with the highest confidence in the text emotion recognition result is included in the audio emotion recognition result.

Step 307: and further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition results in the audio emotion recognition results is greater than a first intensity threshold value. If the result of the further determination is yes, go to step 308; otherwise, step 309 is performed.

Step 308: and taking the emotion classification with the highest confidence level in the text emotion recognition result as an emotion recognition result.

Execution proceeds to step 308, which illustrates that the emotion classification with the highest confidence in the textual emotion recognition results is indeed included in the audio emotion recognition results, and that the emotion intensity level of the emotion classification with the highest confidence in the textual emotion recognition results is sufficiently high. In this case, the emotion classification with the highest confidence in the text emotion recognition result is not only highly reliable but also has a significant tendency to emotion, and therefore, the emotion classification with the highest confidence in the text emotion recognition result can be used as the emotion recognition result. For example, when the audio emotion recognition result includes a satisfied classification (confidence a1) and a calm classification (confidence a2), and the text emotion recognition result includes only a calm classification (confidence b1), a1 > a2, a1 < b1, and the emotion intensity level of the calm classification in the text emotion recognition result is greater than the first intensity threshold, the calm classification is taken as the final emotion recognition result.

Step 309: and taking the emotion classification with the highest confidence level in the audio emotion recognition result as an emotion recognition result, or taking the emotion classification with the highest confidence level in the audio emotion recognition result and the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result.

And when the confidence coefficient of the emotion classification with the highest confidence coefficient in the audio emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the text emotion recognition result, or the emotion classification with the highest confidence coefficient in the text emotion recognition result is not included in the audio emotion recognition result, or even if the emotion classification with the highest confidence coefficient in the text emotion recognition result is included in the audio emotion recognition result but the emotion intensity level of the emotion classification is not high enough, the fact that the emotion recognition result is fashionable means that a uniform emotion classification cannot be output as a final emotion recognition result according to the audio emotion recognition result and the text emotion recognition result. In this case, in an embodiment of the present invention, it is considered that the audio emotion recognition result is selected as the emotion recognition result to be mainly considered for output, and therefore, the emotion classification with the highest confidence level in the audio emotion recognition result may be directly used as the emotion recognition result. In another embodiment of the present invention, the audio emotion recognition result and the text emotion recognition result may be used together as the emotion recognition result. And determining corresponding emotional intention information in the subsequent process by combining the emotion recognition results and the basic intention information of the past user voice message and/or the subsequent user voice message.

Fig. 4 is a schematic flow chart illustrating a process of determining an emotion recognition result in a speech recognition interaction method according to another embodiment of the present invention. Unlike the embodiment shown in fig. 3, the text emotion recognition result is selected to be output as the emotion recognition result of the main consideration in the embodiment shown in fig. 4, and the audio emotion recognition result is output as the emotion recognition result of the subsidiary consideration. It should be understood that, at this time, the flow of determining the emotion recognition result may be similar to the flow logic shown in fig. 3, only the emotion recognition result output which is mainly considered is changed to the text emotion recognition result, and the following steps may be specifically included, but the repeated logic description is not repeated:

step 401: and calculating the confidence degree of the emotion classification in the audio emotion recognition result and the confidence degree of the emotion classification in the text emotion recognition result.

Step 402: and judging whether the emotion classification with the highest confidence level in the audio emotion recognition result is the same as the emotion classification with the highest confidence level in the text emotion recognition result. If yes, go to step 403, otherwise go to step 404.

Step 403: and taking the emotion classification with the highest confidence level in the audio emotion recognition result or the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result.

Step 404: and comparing the confidence coefficient of the emotion classification with the highest confidence degree in the text emotion recognition result with the confidence coefficient of the emotion classification with the highest confidence degree in the audio emotion recognition result.

If the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result is greater than the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result, executing step 405; if the confidence of the emotion classification with the highest confidence level in the text emotion recognition result is less than the confidence of the emotion classification with the highest confidence level in the audio emotion recognition result, executing step 406; if the confidence of the most confident emotion classification in the text emotion recognition result is equal to the confidence of the most confident emotion classification in the audio emotion recognition result, step 409 is performed.

Step 405: and taking the emotion classification with the highest confidence level in the text emotion recognition result as an emotion recognition result.

Step 406: and judging whether the text emotion recognition result comprises the emotion classification with the highest confidence level in the audio emotion recognition result. If the judgment result is yes, step 407 is executed; if the determination result is negative, step 409 is executed.

Step 407: and further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition results in the text emotion recognition results is greater than a first intensity threshold value. If the result of the further determination is yes, go to step 408; otherwise, step 409 is performed.

Step 408: and taking the emotion classification with the highest confidence level in the audio emotion recognition result as an emotion recognition result.

Step 409: and taking the emotion classification with the highest confidence level in the text emotion recognition result as an emotion recognition result, or taking the emotion classification with the highest confidence level in the text emotion recognition result and the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result.

It should be understood that, although the embodiments of fig. 3 and 4 give examples of determining emotion recognition results, the process of comprehensively determining emotion recognition results according to audio emotion recognition results and text emotion recognition results may be implemented in other ways according to the specific forms of audio emotion recognition results and text emotion recognition results, and is not limited to the embodiments shown in fig. 3 and 4, and the present invention is not limited thereto.

In an embodiment of the invention, the audio emotion recognition result and the text emotion recognition result respectively correspond to a coordinate point in the multidimensional feeling space, and at this time, weighted average processing can be performed on coordinate values of the coordinate points of the audio emotion recognition result and the text emotion recognition result in the multidimensional feeling space, and the coordinate point obtained after the weighted average processing is used as the emotion recognition result. For example, when the PAD three-dimensional emotion model is used, the audio emotion recognition result is characterized as (p1, a1, d1), and the text emotion recognition result is characterized as (p2, a2, d2), then the final emotion recognition result can be characterized as ((p1+ p2)/2, (a1+1.3 a2)/2, (d1+0.8 d2)/2), wherein 1.3 and 0.8 are weight coefficients. The non-discrete dimension emotion model is adopted, so that the final emotion recognition result can be calculated in a quantitative mode. However, it should be understood that the combination of the two is not limited to the weighted average process described above, and the present invention is not limited to the specific manner of determining the emotion recognition result when the audio emotion recognition result and the text emotion recognition result each correspond to one coordinate point in the multidimensional emotion space.

In an embodiment of the present invention, the process of obtaining the audio emotion recognition result according to the audio data of the user voice message includes:

step 501: and extracting the audio characteristic vector of the user voice message in the audio stream to be recognized, wherein the user voice message corresponds to a section of speech in the audio stream to be recognized.

The audio feature vector comprises values of at least one audio feature in at least one vector direction. In this way, all audio features are characterized by using a multi-dimensional vector space, in which the direction and value of the audio feature vector can be regarded as the sum of values of a plurality of audio features in different vector directions in the vector space, and the value of each audio feature in one vector direction can be regarded as one component of the audio feature vector. The voice message of the user including different emotions necessarily has different audio characteristics, and the emotion of the voice message of the user is recognized by utilizing the corresponding relation between the different emotions and the different audio characteristics. In particular, the audio features may include one or more of the following: energy features, frame number of utterance features, pitch frequency features, formant features, harmonic-to-noise ratio features, and mel-frequency cepstral coefficient features. In an embodiment of the present invention, the following vector directions may be set in the vector space: scale, mean, maximum, median, and standard deviation.

The energy characteristic refers to the power spectrum characteristic of the user voice message and can be obtained by summing the power spectrums. The calculation formula may be:

wherein E represents the value of the energy characteristic, k represents the number of the frame, j represents the number of the frequency point, N is the frame length, and P represents the value of the power spectrum. In an embodiment of the invention, the energy characteristic may include a short-time energy first order difference, and/or an energy magnitude below a predetermined frequency. The formula for calculating the first order difference of the short-time energy may be:

VE(k)＝(-2*E(k-2)-E(k-1)+E(k+1)+2*E(k+2))/3；

the energy below the preset frequency can be measured by a proportional value, for example, the formula for calculating the proportional value of the energy of the frequency band below 500Hz to the total energy can be:

wherein j₅₀₀The number of the frequency point corresponding to 500Hz, k1 is the number of the voice start frame of the user voice message to be identified, and k2 is the number of the voice end frame of the user voice message to be identified.

The pronunciation frame number characteristic refers to the number of pronunciation frames in the user voice message, and the number of pronunciation frames can also be measured by a proportion value. For example, if the number of voiced frames and unvoiced frames in the user voice message is n1 and n2, respectively, the ratio of the voiced frames to the unvoiced frames is p2 to n1/n2, and the ratio of the voiced frames to the total frames is: p3 ═ n1/(n1+ n 2).

The pitch frequency feature may be extracted using an algorithm based on an autocorrelation function of a Linear Prediction (LPC) error signal. The pitch frequency characteristic may comprise a pitch frequency and/or a pitch frequency first order difference. The algorithm flow for pitch frequency may be as follows: first, linear prediction coefficients of a speech frame x (k) are calculated and a linear prediction estimation signal is calculated

Next, the autocorrelation function c1 of the error signal is calculated:

then, in the offset range of the corresponding pitch frequency of 80-500Hz, the maximum value of the autocorrelation function is searched, and the corresponding offset delta h is recorded. The pitch frequency F0 is calculated as: f0 ═ Fs/Δ h, where Fs is the sampling frequency.

Formant features may be extracted using an algorithm based on linear prediction polynomial root, and may include a first formant, a second formant, and a third formant, as well as a first order difference of the three formants. Harmonic Noise Ratio (HNR) features can be extracted using Independent Component Analysis (ICA) based algorithms. The mel-frequency cepstrum (MFCC) coefficient characteristics may include 1-12 th-order mel-frequency cepstrum coefficients, which may be obtained by a general mel-frequency cepstrum coefficient calculation process, and are not described herein again.

It should be understood that which audio feature vectors are extracted specifically may be determined according to the requirements of the actual scene, and the present invention does not limit the type, number and vector direction of the audio features corresponding to the extracted audio feature vectors. However, in an embodiment of the present invention, in order to obtain an optimal emotion recognition effect, the above six audio features may be extracted simultaneously: energy features, frame number of utterance features, pitch frequency features, formant features, harmonic-to-noise ratio features, and mel-frequency cepstral coefficient features. For example, when the above six audio features are extracted simultaneously, the extracted audio feature vector may include 173 components as shown in table 1 below, and the accuracy of speech emotion recognition on the cas chinese emotion corpus using the audio feature vector and gaussian model (GMM) as emotion feature models in table 1 can reach 74% to 80%.

TABLE 1

In an embodiment of the present invention, the audio stream to be recognized may be a customer service interactive audio stream, and the user voice message corresponds to one user input voice segment or one customer service input voice segment in the audio stream to be recognized. Because the customer interaction process is often in a question-answer mode, one-time user voice input segment can correspond to one-time question or answer of the user in one-time interaction process, and one-time customer service voice input segment can correspond to one-time question or answer of a customer service person in one-time interaction process. Because the user or the customer service can completely express the emotion in one question or answer, the emotion recognition integrity can be ensured and the emotion recognition real-time performance in the customer service interaction process can be ensured by taking the voice input segment of the user or the voice input segment of the customer service as an emotion recognition unit.

Step 502: and matching the audio characteristic vector of the user voice message with a plurality of emotion characteristic models, wherein the emotion characteristic models respectively correspond to one of a plurality of emotion classifications.

The emotion feature models can be established by pre-learning the audio feature vectors of the voice messages of the preset users respectively, wherein the audio feature vectors comprise emotion classification labels corresponding to a plurality of emotion classifications, so that the corresponding relation between the emotion feature models and the emotion classifications is established, and each emotion feature model can correspond to one emotion classification. The pre-learning process of establishing the emotional characteristic model may include: firstly, clustering respective audio characteristic vectors of a plurality of preset user voice messages comprising emotion classification labels corresponding to a plurality of emotion classifications to obtain clustering results of the preset emotion classifications (S61); then, the audio feature vector of the preset user voice message in each cluster is trained as an emotion feature model according to the clustering result (S62). Based on the emotion feature models, the emotion feature models corresponding to the current user voice messages can be obtained through a matching process based on the audio feature vectors, and then corresponding emotion classifications are obtained.

In one embodiment of the present invention, the emotional feature models may be Gaussian Mixture Models (GMMs) (the degree of mixture may be 5). Therefore, the emotion characteristic vectors of the voice samples of the same emotion classification can be clustered by adopting a K-means algorithm, and the initial value of the parameter of the Gaussian mixture model is calculated according to the clustering result (the iteration number can be 50). And then training a Gaussian mixture model (iteration number is 200) corresponding to each emotion classification by adopting an E-M algorithm. When the mixed Gaussian models are used for carrying out the matching process of emotion classification, the likelihood probabilities between the audio feature vectors of the current user voice message and the plurality of emotion feature models can be calculated, and then the matched emotion feature models are determined by measuring the likelihood probabilities, for example, the emotion feature model with the likelihood probability larger than a preset threshold and the maximum emotion feature model is used as the matched emotion feature model.

It should be understood that although it is stated in the above description that the emotional feature model may be a gaussian mixture model, the emotional feature model may also be implemented in other forms, such as a Support Vector Machine (SVM) model, a K nearest neighbor classification algorithm (KNN) model, a markov model (HMM), A Neural Network (ANN) model, and the like. The specific implementation form of the emotional characteristic model is not strictly limited by the invention. Meanwhile, it should be understood that the implementation form of the matching process can also be adjusted according to the change of the implementation form of the emotional characteristic model, and the specific implementation form of the matching process is not limited by the invention.

In an embodiment of the present invention, the plurality of emotion classifications may include: a satisfaction category, a calm category, and a fidgety category to correspond to emotional states that may occur to a user in a customer service interaction scenario. In another embodiment, the plurality of emotion classifications may include: satisfaction classification, calmness classification, fidgetiness classification, and anger classification to correspond to emotional states that may occur to customer service personnel in a customer service interaction scenario. That is, when the audio stream to be recognized is the user service interactive audio stream in the service interactive scene, if the current user voice message corresponds to the user service input voice segment once, the plurality of emotion classifications may include: satisfaction classification, calmness classification, and fidget classification; if the current user voice message corresponds to a user input voice segment, the emotion classifications may include: satisfaction classification, calmness classification, fidget classification, and anger classification. Through the emotion classification of the users and the customer service, the method can be more simply suitable for the call center system, reduces the calculation amount and meets the emotion recognition requirement of the call center system. However, it should be understood that the type and number of these emotion classifications can be adjusted according to the actual application scene requirements, and the invention is not limited to the type and number of emotion classifications.

Step 503: and taking the emotion classification corresponding to the emotion characteristic model with the matched matching result as the emotion classification of the voice message of the user.

As mentioned above, since there is a corresponding relationship between the emotion feature model and the emotion classification, after the matching emotion feature model is determined according to the matching process of step 502, the emotion classification corresponding to the matching emotion feature model is the recognized emotion classification. For example, when the emotion feature models are gaussian mixture models, the matching process can be implemented by measuring likelihood probabilities between the audio feature vectors of the current user voice message and the plurality of emotion feature models respectively, and then the emotion classification corresponding to the emotion feature model with the likelihood probability greater than the preset threshold and the maximum likelihood probability is used as the emotion classification of the user voice message.

Therefore, the voice emotion recognition method provided by the embodiment of the invention realizes real-time emotion recognition of the user voice message by extracting the audio feature vector of the user voice message in the audio stream to be recognized and matching the extracted audio feature vector by using the pre-established emotion feature model. Therefore, under the application scene of the call center system, the emotional states of the customer service and the customer can be monitored in real time in the customer service interactive conversation, and the service quality of an enterprise adopting the call center system and the customer service experience of the customer can be obviously improved.

It should also be understood that based on the emotion classification recognized by the speech emotion recognition method provided by the embodiment of the present invention, more flexible secondary applications can be further implemented in accordance with specific scene requirements. In an embodiment of the invention, the emotion classification of the currently recognized user voice message can be displayed in real time, and the specific real-time display mode can be adjusted according to actual scene requirements. For example, different mood categories may be characterized by different colors of the signal light, a blue light for "happy", a green light for "calm", a yellow light for "dysphoric", and a red light for "angry". Therefore, according to the change of the color of the signal lamp, the emotional state of the current call between the customer service personnel and the quality inspection personnel can be reminded in real time. In another embodiment, the emotion classification of the recognized user voice message within a preset time period can be counted, for example, the audio number of the call recording, the time stamps of the starting point and the ending point of the user voice message, and the emotion recognition result are recorded, an emotion recognition database is finally formed, the number of times and the probability of various emotions occurring within a time period are counted, and a graph or a table is made for the enterprise to judge the reference basis of the service quality of the customer service personnel within a time period. In another embodiment, emotional response information corresponding to the recognized emotional category of the user voice message may also be transmitted in real time, which may be applicable to an unattended machine customer service scenario. For example, when the user is identified to be in the "angry" state in real time in the current call, the soothing words corresponding to the "angry" state of the user are automatically replied, so as to calm the user's mood and achieve the purpose of continuing communication. As for the correspondence between the emotion classification and the emotional response information, it can be established in advance through a pre-learning process.

In an embodiment of the present invention, before extracting the audio feature vector of the user voice message in the audio stream to be recognized, the user voice message needs to be extracted from the audio stream to be recognized first, so as to perform emotion recognition subsequently by using the user voice message as a unit, and the extraction process may be performed in real time.

Fig. 5 is a schematic flow chart illustrating a process of obtaining a text emotion recognition result according to text content of a user voice message in the voice recognition interaction method according to an embodiment of the present invention. As shown in fig. 5, the process of obtaining the text emotion recognition result according to the text content of the user voice message may include the following steps:

step 1001: and recognizing emotion vocabularies in the text content of the voice message of the user, and determining a first text emotion recognition result according to the recognized emotion vocabularies.

The corresponding relation between the emotion vocabulary and the first text emotion recognition result can be established through a pre-learning process, each emotion vocabulary has a corresponding emotion classification and an emotion intensity level, and the emotion classification of the whole text content of the user voice message and the emotion intensity level of the emotion classification can be obtained according to a preset statistical algorithm and the corresponding relation. For example, the following emotional vocabulary is included in the text content of the user voice message: the first text emotion recognition result is possibly a satisfied emotion classification when the emotion vocabularies such as 'thank you' (corresponding to the satisfied emotion classification, the emotion intensity level is moderate), 'your true stick' (corresponding to the satisfied emotion classification, the emotion intensity level is high), 'too good' (corresponding to the satisfied emotion classification, the emotion intensity level is high), and the like.

Step 1002: and inputting the text content of the user voice message into a text emotion recognition deep learning model, wherein the text emotion recognition deep learning model is established on the basis of training the text content comprising an emotion classification label and an emotion intensity level label, and the output result of the text emotion recognition deep learning model is used as a second text emotion recognition result.

Step 1003: and determining a text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result.

It should be appreciated that the first textual emotion recognition result and the second textual emotion recognition result may be characterized in a variety of ways. In one embodiment of the invention, the emotion recognition results may be characterized in a discrete emotion classification manner, where the first textual emotion recognition result and the second textual emotion recognition result may each include one or more of a plurality of emotion classifications, each of which may include a plurality of emotion intensity levels. In another embodiment of the invention, the emotion recognition result can also be represented by a non-discrete dimensional emotion model, the first text emotion recognition result and the second text emotion recognition result respectively correspond to a coordinate point in a multi-dimensional emotion space, and each dimension in the multi-dimensional emotion space corresponds to a psychologically defined emotion factor. The characterization of discrete emotion classification and the characterization of non-discrete dimensional emotion models are described above and will not be described in detail here. However, it should be understood that the first textual emotion recognition result and the second textual emotion recognition result may be characterized by other characterization methods, and the specific characterization methods are not limited by the present invention. It should be understood that, in an embodiment of the present invention, the final text emotion recognition result may also be determined according to only one of the first text emotion recognition result and the second text emotion recognition result, which is not limited by the present invention.

Fig. 6 is a schematic flow chart illustrating a process of obtaining a text emotion recognition result according to text content of a user voice message in the voice recognition interaction method according to an embodiment of the present invention. In this embodiment, the user voice message includes a user voice message, the text emotion recognition result needs to be comprehensively determined according to the first text emotion recognition result and the second text emotion recognition result, and the first text emotion recognition result and the second text emotion recognition result respectively include one or more of a plurality of emotion classifications, where the method for determining the text emotion recognition result may include the following steps:

step 1101: if the first text emotion recognition result and the second text emotion recognition result include the same emotion classification, the same emotion classification is taken as a text emotion recognition result.

For example, when the first textual emotion recognition result includes a satisfaction classification and a calmness classification, and the second textual emotion recognition result includes only a satisfaction classification, then the final textual emotion recognition result may be a satisfaction classification.

Step 1102: and if the first text emotion recognition result and the second text emotion recognition result do not comprise the same emotion classification, using the first text emotion recognition result and the second text emotion recognition result together as a text emotion recognition result.

For example, when the first textual emotion recognition result includes a happy category and the second textual emotion recognition result includes only a calm category, then the final textual emotion recognition result may be a happy category and a calm category. In an embodiment of the present invention, when the final textual emotion recognition result includes a plurality of emotion classifications, the textual emotion recognition result and the basic intention information of the past user voice message and/or the subsequent user voice message are combined in the subsequent process to determine the corresponding emotional intention information.

It should be understood that, although it is defined in step 1102 that the first text emotion recognition result and the second text emotion recognition result are used together as the text emotion recognition result when the first text emotion recognition result and the second text emotion recognition result do not include the same emotion classification, in other embodiments of the present invention, a more conservative interaction policy may be adopted, for example, error information is directly generated or the text emotion recognition result is not output, so as to avoid misleading the interaction process.

Fig. 7 is a schematic flow chart illustrating a process of determining a text emotion recognition result in the speech recognition interaction method according to an embodiment of the present invention. In this embodiment, the user voice message also includes a user voice message, the text emotion recognition result also needs to be determined comprehensively according to the first text emotion recognition result and the second text emotion recognition result, and the first text emotion recognition result and the second text emotion recognition result respectively include one or more of a plurality of emotion classifications, and the method for determining the text emotion recognition result may include the following steps:

step 1201: and calculating the confidence degree of the emotion classification in the first text emotion recognition result and the confidence degree of the emotion classification in the second text emotion recognition result.

Statistically, confidence is also referred to as reliability, confidence level, or confidence coefficient. Due to the randomness of the samples, the conclusions drawn are always uncertain when using the sampling to estimate the overall parameters. Therefore, interval estimation in mathematical statistics can be used to estimate how large the probability that the error between an estimated value and the overall parameter is within a certain allowable range, and this corresponding probability is called confidence. For example, it is assumed that the preset emotion classification is related to one variable characterizing the emotion classification, i.e., the emotion classification may correspond to different values according to the magnitude of the variable value. When the confidence degree of the speech text emotion recognition result is to be acquired, a plurality of measured values of the variable are obtained through a plurality of times of first text emotion recognition/second text emotion recognition processes, and then the average value of the plurality of measured values is used as an estimated value. And estimating the probability of the error range between the estimation value and the true value of the variable within a certain range by an interval estimation method, wherein the higher the probability value is, the more accurate the estimation value is, namely, the higher the confidence of the current emotion classification is. It should be understood that the above-mentioned variables characterizing the emotion classification can be determined according to a specific algorithm of emotion recognition, and the present invention is not limited thereto.

Step 1202: and judging whether the emotion classification with the highest confidence level in the first text emotion recognition result is the same as the emotion classification with the highest confidence level in the second text emotion recognition result. If yes, go to step 1203, otherwise go to step 1204.

Step 1203: and taking the emotion classification with the highest confidence level in the first text emotion recognition result or the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

This indicates that the emotion classification with the highest confidence level in the first text emotion recognition result and the second text emotion recognition result is the same, and therefore the same emotion classification with the highest confidence level can be directly used as the final text emotion recognition result. For example, when the first text emotion recognition result includes a satisfaction classification (confidence a1) and a calmness classification (confidence a2), and the second text emotion recognition result includes only a satisfaction classification (confidence b1), and a1 > a2, the satisfaction classification is regarded as the final text emotion recognition result.

Step 1204: the confidence of the emotion classification with the highest confidence in the first text emotion recognition result is compared with the confidence of the emotion classification with the highest confidence in the second text emotion recognition result.

In an embodiment of the invention, in a practical application scenario, according to a specific algorithm of emotion recognition and limitations of types and contents of user voice messages, one of a first text emotion recognition result and a second text emotion recognition result can be selected to be output as a text emotion recognition result mainly considered, the other one of the first text emotion recognition result and the second text emotion recognition result is output as a text emotion recognition result considered in an auxiliary manner, and then factors such as confidence degree and emotion intensity level are used for comprehensively determining a final text emotion recognition result. It should be understood that which of the first text emotion recognition result and the second text emotion recognition result is selected as the output of the text emotion recognition result to be mainly considered may be determined according to an actual scene, and the present invention is not limited to which of the first text emotion recognition result and the second text emotion recognition result is selected as the output of the text emotion recognition result to be mainly considered.

In an embodiment of the invention, the first text emotion recognition result is output as a text emotion recognition result mainly considered, and the second text emotion recognition result is output as a text emotion recognition result considered in an auxiliary manner. At this time, if the confidence of the emotion classification with the highest confidence level in the first text emotion recognition result is greater than the confidence of the emotion classification with the highest confidence level in the second text emotion recognition result, executing step 1205; if the confidence of the emotion classification with the highest confidence level in the first text emotion recognition result is smaller than the confidence of the emotion classification with the highest confidence level in the second text emotion recognition result, executing step 1206; if the confidence level of the most confident emotion classification in the first text emotion recognition result is equal to the confidence level of the most confident emotion classification in the second text emotion recognition result, step 1209 is performed.

Step 1205: and classifying the emotion with the highest confidence level in the first text emotion recognition result as a text emotion recognition result.

Because the first text emotion recognition result is selected to be output as the text emotion recognition result mainly considered, the emotion classification in the first text emotion recognition result should be considered preferentially; and the confidence coefficient of the emotion classification with the highest confidence degree in the first text emotion recognition result is greater than that of the emotion classification with the highest confidence degree in the second text emotion recognition result, so that the emotion classification with the highest confidence degree in the first text emotion recognition result which is mainly considered can be selected as the text emotion recognition result. For example, when the first text emotion recognition result includes a satisfaction classification (confidence a1) and a calmness classification (confidence a2), and the second text emotion recognition result includes only a calmness classification (confidence b1), a1 > a2 and a1 > b1, the satisfaction classification is regarded as the final text emotion recognition result.

Step 1206: and judging whether the emotion classification with the highest confidence level in the second text emotion recognition result is included in the first text emotion recognition result. If yes, go to step 1207; if the determination is no, step 1209 is performed.

When the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result, it is indicated that the emotion classification with the highest confidence coefficient in the second text emotion recognition result is possibly more credible, but since the first text emotion recognition result is selected as the text emotion recognition result which is mainly considered to be output, it is required to judge whether the emotion classification with the highest confidence coefficient in the second text emotion recognition result is included in the first text emotion recognition result. If the most confident emotion classification in the second text emotion recognition result is indeed included in the first text emotion recognition result, the emotion intensity level can be used to measure whether the most confident emotion classification in the second text emotion recognition result needs to be considered as an auxiliary consideration. For example, when the first text emotion recognition result includes a satisfied classification (confidence a1) and a calm classification (confidence a2), and the second text emotion recognition result includes only a calm classification (confidence b1), a1 > a2 and a1 < b1, it is necessary to determine whether the calm classification with the highest confidence in the second text emotion recognition result is included in the first text emotion recognition result.

Step 1207: and further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the second text emotion recognition results in the first text emotion recognition results is greater than a first intensity threshold value. If the result of the further determination is yes, go to step 1208; otherwise, step 1209 is performed.

Step 1208: and classifying the emotion with the highest confidence level in the second text emotion recognition result as a text emotion recognition result.

Execution to step 1208 illustrates that the most confident emotion classification in the second textual emotion recognition result is indeed included in the first textual emotion recognition result, and that the emotional intensity level of the most confident emotion classification in the second textual emotion recognition result is sufficiently high. In this case, it means that the emotion classification with the highest confidence in the second text emotion recognition result is not only highly reliable but also has a significant tendency of emotion, and therefore, the emotion classification with the highest confidence in the second text emotion recognition result can be used as the text emotion recognition result. For example, when the first text emotion recognition result includes a satisfaction classification (confidence a1) and a calmness classification (confidence a2), and the second text emotion recognition result includes only a calmness classification (confidence b1), a1 > a2, a1 < b1, and the emotion intensity level of the calmness classification in the second text emotion recognition result is greater than the first intensity threshold, the calmness classification is taken as the final text emotion recognition result.

Step 1209: and taking the emotion classification with the highest confidence level in the first text emotion recognition result as a text emotion recognition result, or taking the emotion classification with the highest confidence level in the first text emotion recognition result and the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result together.

And when the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result, or the emotion classification with the highest confidence coefficient in the second text emotion recognition result is not included in the first text emotion recognition result, or even if the emotion classification with the highest confidence coefficient in the second text emotion recognition result is included in the first text emotion recognition result but the emotion intensity level of the emotion classification is not high enough, the fact shows that the fact cannot output a uniform emotion classification as a final text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result. At this time, in an embodiment of the present invention, it is considered that the first text emotion recognition result is selected and output as the text emotion recognition result mainly considered, and therefore, the emotion classification with the highest confidence level in the first text emotion recognition result may be directly used as the text emotion recognition result. In another embodiment of the present invention, the first text emotion recognition result and the second text emotion recognition result may be used together as the text emotion recognition result. And determining corresponding emotion intention information in the subsequent process by combining the text emotion recognition result and the basic intention information of the past user voice message and/or the subsequent user voice message.

Fig. 8 is a schematic flow chart illustrating a process of determining a text emotion recognition result in a speech recognition interaction method according to another embodiment of the present invention. Unlike the embodiment shown in fig. 7, the second text emotion recognition result is selected in the embodiment shown in fig. 8 to be output as the text emotion recognition result of the main consideration, and the first text emotion recognition result is output as the text emotion recognition result of the supplementary consideration. It should be understood that, at this time, the process of determining the text emotion recognition result may be similar to the logic of the process shown in fig. 7, only the output of the text emotion recognition result which is mainly considered is changed to the second text emotion recognition result, which may specifically include the following steps, but the repeated logic description is not repeated:

step 1301: and calculating the confidence degree of the emotion classification in the first text emotion recognition result and the confidence degree of the emotion classification in the second text emotion recognition result.

Step 1302: and judging whether the emotion classification with the highest confidence level in the first text emotion recognition result is the same as the emotion classification with the highest confidence level in the second text emotion recognition result. If yes, go to step 1303, otherwise go to step 1304.

Step 1303: and taking the emotion classification with the highest confidence level in the first text emotion recognition result or the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

Step 1304: and comparing the confidence coefficient of the emotion classification with the highest confidence degree in the second text emotion recognition result with the confidence coefficient of the emotion classification with the highest confidence degree in the first text emotion recognition result.

If the confidence of the emotion classification with the highest confidence level in the second text emotion recognition result is greater than the confidence of the emotion classification with the highest confidence level in the first text emotion recognition result, executing step 1305; if the confidence of the emotion classification with the highest confidence level in the second text emotion recognition result is less than the confidence of the emotion classification with the highest confidence level in the first text emotion recognition result, executing step 1306; if the confidence level of the most confident emotion classification in the second text emotion recognition result is equal to the confidence level of the most confident emotion classification in the first text emotion recognition result, step 1309 is performed.

Step 1305: and classifying the emotion with the highest confidence level in the second text emotion recognition result as a text emotion recognition result.

Step 1306: and judging whether the emotion classification with the highest confidence level in the first text emotion recognition result is included in the second text emotion recognition result. If the judgment result is yes, step 1307 is executed; if the result of the determination is negative, step 1309 is performed.

Step 1307: and further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the first text emotion recognition result in the second text emotion recognition results is greater than a first intensity threshold value. If the result of the further determination is yes, go to step 1308; otherwise, step 1309 is performed.

Step 1308: and classifying the emotion with the highest confidence level in the first text emotion recognition result as a text emotion recognition result.

Step 1309: and taking the emotion classification with the highest confidence level in the second text emotion recognition result as a text emotion recognition result, or taking the emotion classification with the highest confidence level in the second text emotion recognition result and the emotion classification with the highest confidence level in the first text emotion recognition result as the text emotion recognition result.

It should be understood that, although the embodiments of fig. 7 and 8 give examples of determining the text emotion recognition result, the process of comprehensively determining the text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result may be implemented in other ways according to different specific forms of the first text emotion recognition result and the second text emotion recognition result, and is not limited to the embodiments shown in fig. 7 and 8, and the present invention is not limited thereto.

In an embodiment of the present invention, the first text emotion recognition result and the second text emotion recognition result respectively correspond to a coordinate point in the multidimensional space, and at this time, weighted average processing may be performed on coordinate values of the coordinate points in the multidimensional space of the first text emotion recognition result and the second text emotion recognition result, and the coordinate point obtained after the weighted average processing is used as the text emotion recognition result. For example, when the PAD three-dimensional emotion model is used, the first text emotion recognition result is characterized as (p1, a1, d1), the second text emotion recognition result is characterized as (p2, a2, d2), and the final text emotion recognition result is characterized as ((p1+ p2)/2, (a1+1.3 a2)/2, (d1+0.8 d2)/2), wherein 1.3 and 0.8 are weight coefficients. The non-discrete dimension emotion model is adopted, so that the final text emotion recognition result can be calculated in a quantitative mode. However, it should be understood that the combination of the first and second text emotion recognition results is not limited to the weighted average process described above, and the present invention is not limited to the specific manner of determining the text emotion recognition result when the first text emotion recognition result and the second text emotion recognition result respectively correspond to one coordinate point in the multidimensional space.

Fig. 9 is a schematic flow chart illustrating obtaining basic intention information according to a user voice message in a voice recognition interaction method according to an embodiment of the present invention. As shown in fig. 9, the process of acquiring the basic intention information may include the following steps:

step 1401: matching the text content of the user voice message with a plurality of preset semantic templates in a semantic knowledge base to determine a matched semantic template; the corresponding relation between the semantic template and the basic intention information is pre-established in a semantic knowledge base, and the same intention information corresponds to one or more semantic templates.

It should be understood that semantic matching (such as semantic templates including standard questions and extended questions) through semantic templates is only one implementation way, and voice text information input by a user can also be directly matched or classified by extracting characters, words and sentence vector features (possibly adding an entry mechanism) through a network.

Step 1402: and acquiring basic intention information corresponding to the matched semantic template.

In an embodiment of the invention, the text content of the user voice message can correspond to a 'standard question' in a semantic knowledge base, wherein the 'standard question' is a character used for representing a certain knowledge point, and the main purpose is clear expression and convenient maintenance. The term "question" should not be construed narrowly as "question" but rather broadly as "input" with corresponding "output". When a user inputs to the intelligent interactive machine, the most ideal situation is to use standard questions, and the intelligent semantic recognition system of the machine can immediately understand the meaning of the user.

However, rather than using standard questions, users often use some variant form of standard questions, namely extended questions. Therefore, for intelligent semantic recognition, there is a need in the knowledge base for an expanded query of standard queries that is slightly different from the standard query expression but expresses the same meaning. Therefore, in a further embodiment of the present invention, the semantic template is a set of one or more semantic expressions representing a certain semantic content, and is generated by a developer according to a predetermined rule in combination with the semantic content, that is, a sentence with multiple different expression modes of the corresponding semantic content can be described by using one semantic template, so as to cope with multiple possible variations of the text content of the user voice message. Therefore, the text content of the user voice message is matched with the preset semantic template, and the limitation of recognizing the user voice message by using the standard question which can only describe one expression mode is avoided.

The ontology class attributes are further abstracted, for example, using abstract semantics. The abstract semantics of a category describe different expressions of a class of abstract semantics through a set of abstract semantic expressions, which are extended over the constituent elements in order to express more abstract semantics.

It should be understood that, examples of the semantic component words, the semantic rule words, and the semantic symbols, but the specific content and part of speech of the semantic component words, the specific content and part of speech of the semantic rule words, and the definition and collocation of the semantic symbols may all be preset by a developer according to a specific interaction service scenario applied by the speech recognition interaction method, which is not limited in the present invention.

In an embodiment of the present invention, the process of determining the matched semantic template according to the text content of the user voice message may be implemented by a similarity calculation process. Specifically, a plurality of text similarities between the text content of the user voice message and a plurality of preset semantic templates are calculated, and then the semantic template with the highest text similarity is used as the matched semantic template. The similarity can be calculated by one or more of the following methods: an edit distance calculation method, an n-gram calculation method, a Jarouwinkler calculation method, and a Soundex calculation method. In a further embodiment, when the semantic component words and the semantic rule words in the text content of the user voice message are identified, the semantic component words and the semantic rule words included in the user voice message and the semantic template can also be converted into simplified text strings to improve the efficiency of semantic similarity calculation.

In an embodiment of the present invention, as described above, the semantic template may be composed of semantic component words and semantic rule words, and the semantic component words and the semantic rule words are related to parts of speech of the words in the semantic template and grammatical relations between the words, so the similarity calculation process may specifically be: the method comprises the steps of firstly identifying words, parts of speech and grammatical relations of the words in a user voice message text, then identifying semantic component words and semantic rule words according to the parts of speech and the grammatical relations of the words, and then introducing the identified semantic component words and semantic rule words into a vector space model to calculate a plurality of similarities between text contents of the user voice message and a plurality of preset semantic templates. In an embodiment of the present invention, words, parts of speech of the words, and grammatical relations between the words in the text content of the user voice message may be identified by one or more of the following word segmentation methods: hidden markov model method, forward maximum matching method, reverse maximum matching method and named entity recognition method.

In an embodiment of the present invention, as described above, the semantic template may be a set of multiple semantic expressions representing a certain semantic content, and at this time, a sentence with multiple different expression modes of the corresponding semantic content may be described by using one semantic template, so as to correspond to multiple extension questions of the same standard question. Therefore, when calculating the semantic similarity between the text content of the user voice message and the preset semantic template, the similarity between the text content of the user voice message and at least one extension question respectively expanded by the plurality of preset semantic templates needs to be calculated, and then the semantic template corresponding to the extension question with the highest similarity is used as the matched semantic template. These expanded questions may be obtained from semantic component words and/or semantic rule words and/or semantic symbols included in the semantic template.

Of course, the method for obtaining the basic intention information is not limited to this, and the speech text information input by the user may be directly matched or classified to the basic intention information by extracting characters, words, and sentence vector features (for example, an entry mechanism may be added) through a network.

Therefore, the voice recognition interaction method provided by the embodiment of the invention can realize an intelligent interaction mode of providing different response services according to different emotional states of the user, thereby greatly improving the experience of intelligent interaction. For example, when the speech recognition interaction method provided by the embodiment of the present invention is applied to an entity robot in the field of bank customer service, a user speaks to the entity customer service robot with speech: "how do credit cards should be lost? ". The entity customer service robot receives the voice message of the user through the microphone, obtains an audio emotion recognition result of 'worries' by analyzing the audio data of the voice message of the user, and takes the audio emotion recognition result as a final emotion recognition result; converting the user voice message into text to obtain the basic intention information of the client as a loss report credit card (the step may also need to involve a semantic knowledge base combining the past or subsequent user voice message and the bank field); then, the emotion recognition result ' worried ' is linked with the basic intention information ' loss credit card ' to obtain the emotion intention information ' loss credit card ', the user is worried about, and the credit card is possibly lost or stolen ' (the step may also involve combining the past or subsequent user voice message and a semantic knowledge base in the bank field); determining a corresponding interactive instruction: the method comprises the following steps of screen output credit card loss report, emotion classification 'comfort' is presented through voice broadcast, the emotion intensity level is high, and voice broadcast which is output to a user and accords with the emotion instruction and is possible to be in a pitch mode and a medium speech speed mode is as follows: the step of losing the credit card is displayed on a screen, so that a user does not worry about the fact that if the credit card is lost or stolen, the credit card is frozen immediately after loss, and the property and the credit of the user are not lost … ….

In an embodiment of the present invention, some application scenarios (e.g. bank customer service) may also avoid the voice broadcast operation in consideration of the privacy of the interactive content, and instead implement the interactive instruction in a plain text or animation manner. The modality selection of such interactive instructions may be adjusted according to the application scenario.

It should be understood that the presentation manner of the emotion classification and the emotion intensity level in the interactive instruction can be implemented by adjusting the speed and tone of voice broadcast, and the present invention is not limited thereto.

For another example, when the speech recognition interaction method provided by the embodiment of the present invention is applied to a virtual intelligent personal assistant application of an intelligent terminal device, a user speaks to the intelligent terminal device: "what is the fastest path from home to airport? ". The virtual intelligent personal assistant application receives the voice message of the user through a microphone of the intelligent terminal equipment, and obtains an audio emotion recognition result of 'excitation' by analyzing audio data of the voice message of the user; meanwhile, the voice message of the user is converted into a text, the text content of the voice message of the user is analyzed to obtain a text emotion recognition result which is 'worried', and two emotions of 'excitement' and 'worried' are classified and used as the emotion recognition result at the same time through logic judgment. The basic intention information of the customer obtained by combining the past or subsequent user voice messages and the semantic knowledge base in the field is 'obtaining the fastest path navigation of the user from home to the airport'. The emotional intention information obtained by linking the 'anxious' and the basic intention information 'to obtain the fastest path navigation of the user from the home to the airport' is 'to obtain the fastest path navigation of the user from the home to the airport, the user is anxious and possibly worrying about mistakenly hitting the airplane'; the emotional intention information obtained by connecting the excitement and the basic intention information is that the user is excited and may go to travel soon after the fastest path navigation from the home to the airport is obtained; thus, two types of emotional intention information are generated, and the user can find out that the former user mentions "do my flight take 11 points and take several points for departure? Then, the emotion recognition result of the user is judged to be "anxious", the emotion intention information is "obtaining the fastest path navigation from the home to the airport, the user is anxious and may worry about mistakenly hitting the airplane". Determining a corresponding interactive instruction: navigation information is output by a screen, meanwhile, emotion classification 'comfort' and 'warning' are presented through voice broadcast, the emotion intensity level is high respectively, and voice broadcast which is output to a user and accords with the emotion instruction and is possibly stable in tone and medium in speech speed is performed: after the fastest route from the home address to the airport is planned, please navigate according to the screen display, the normal driving is expected to reach the airport in 1 hour, and you do not worry about. In addition, the system reminds the user of making time planning, paying attention to driving safety and asking for speeding. "

For another example, when the voice recognition interaction method provided by the embodiment of the present invention is applied to an intelligent wearable device, a user speaks to the intelligent wearable device with voice during exercise: "how do my heartbeat present? ". The intelligent wearable device receives a voice message of a user through a microphone, obtains an audio emotion recognition result as a PAD three-dimensional emotion model vector (p1, a1 and d1) by analyzing audio data of the voice message of the user, obtains a text emotion recognition result as a PAD three-dimensional emotion model vector (p2, a2 and d2) by analyzing the audio data of the voice message of the user, obtains a final emotion recognition result (p3, a3 and d3) by combining the audio emotion recognition result and the text emotion recognition result, and represents the combination of worry and tension. Meanwhile, the intelligent wearable device obtains basic intention information of the client by combining a semantic knowledge base in the medical health field, namely 'obtaining heartbeat data of the user'. Next, the emotion recognition result (p3, a3, d3) is associated with the basic intention "obtaining heartbeat data of the user", and the emotion intention information is obtained as "obtaining heartbeat data of the user, the user showing fear, possibly currently having an uncomfortable symptom such as a quick heartbeat". Determining the interactive instruction according to the corresponding relation between the emotional intention information and the interactive instruction: presenting emotions (p6, a6, d6), i.e. a combination of "comfort" and "encouragement", respectively, with high emotional intensity, while initiating a procedure of real-time monitoring of the heartbeat for 10min and broadcasting with soft-pitch, slow-speech: "your current heartbeat data is 150 per minute, you do not worry about, and the data still belongs to the normal heartbeat range. If the patient feels uncomfortable symptoms such as fast heartbeat, the patient should relax the mood and take deep breath for adjustment. The past health data show that the heart works well, and the heart and lung functions can be enhanced by keeping regular exercise. "then continuously focus on the emotional state of the user. If the user says "somewhat uncomfortable after 5 min. "the emotion recognition result obtained through the emotion recognition process is a three-dimensional emotion model vector (p7, a7, d7) and represents" pain ", and then the interaction command is updated again as follows: the screen output heartbeat data presents the mood (p8, a8, d8) through voice broadcast simultaneously, and "warning" promptly, mood intensity is the height respectively, outputs the alarm sound to with the voice broadcast of the pitch is steady, slow speech speed: "your current heartbeat data is 170 times per minute, and if the normal range is exceeded, you stop moving and adjust breathing. Press screen if help request is required. "

Fig. 10 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention. As shown in fig. 10, the speech recognition interaction device 10 includes: emotion recognition module 11, basic intention recognition module 12, and interaction instruction determination module 13.

The emotion recognition module 11 is configured to obtain emotion recognition results according to the user voice message, where the emotion recognition results at least include audio emotion recognition results, or the emotion recognition results at least include audio emotion recognition results and text emotion recognition results. The basic intention recognition module 12 is configured to perform intention analysis according to the text content of the user voice message to obtain corresponding basic intention information. And the interaction instruction determining module 13 is configured to determine a corresponding interaction instruction according to the emotion recognition result and the basic intention information.

The voice recognition interaction device 10 provided by the embodiment of the invention combines the emotion recognition result obtained based on the voice message of the user on the basis of understanding the basic intention information of the user, and further gives the interaction instruction with emotion according to the basic intention information and the emotion recognition result, thereby solving the problems that the intelligent interaction mode in the prior art cannot analyze the deep intention of the voice message of the user and cannot provide more humanized interaction experience.

In an embodiment of the present invention, as shown in fig. 11, the interactive instruction determining module 13 includes: an emotional intention recognition unit 131 and an interaction instruction determination unit 132. The emotional intent recognition unit 131 is configured to determine corresponding emotional intent information from the emotional recognition result and the basic intent information. The interaction order determination unit 132 is configured to determine a corresponding interaction order from the emotional intention information, or determine a corresponding interaction order from the emotional intention information and the basic intention information.

In an embodiment of the invention, the interactive instruction includes presenting feedback content of the emotional intention information. For example, in some customer service interaction scenarios, emotional intention information analyzed according to the voice content of the customer needs to be presented to customer service staff to play a reminding role, and at this time, corresponding emotional intention information is inevitably determined, and feedback content of the emotional intention information is presented.

In one embodiment of the invention, the interactive instructions comprise one or more of the following emotion presentation modalities: a text output emotion presentation mode, a music play emotion presentation mode, a voice emotion presentation mode, an image emotion presentation mode, and a mechanical action emotion presentation mode.

In one embodiment of the invention, the emotional intention information comprises emotional demand information corresponding to the emotion recognition result; or, the emotion intention information comprises emotion requirement information corresponding to the emotion recognition result and the association relationship between the emotion recognition result and the basic intention information.

In an embodiment of the present invention, the relationship between the emotion recognition result and the basic intention information is preset.

In an embodiment of the present invention, the user information at least includes a user voice message; wherein the emotion recognition module 11 is further configured to: and acquiring an emotion recognition result according to the voice message of the user.

In an embodiment of the present invention, as shown in fig. 11, the emotion recognition module 11 may include: an audio emotion recognition unit 111 configured to acquire an audio emotion recognition result from audio data of the user voice message; and an emotion recognition result determination unit 112 configured to determine an emotion recognition result from the audio emotion recognition result.

Or, the emotion recognition module 11 includes: an audio emotion recognition unit 111 configured to acquire an audio emotion recognition result from audio data of the user voice message; a text emotion recognition unit 113 configured to acquire a text emotion recognition result according to text content of the user voice message; and an emotion recognition result determination unit 112 configured to determine an emotion recognition result from the audio emotion recognition result and the text emotion recognition result.

In one embodiment of the invention, the audio emotion recognition result comprises one or more of a plurality of emotion classifications; or, the audio emotion recognition result corresponds to a coordinate point in the multi-dimensional emotion space. Or the audio emotion recognition result and the text emotion recognition result respectively comprise one or more of a plurality of emotion classifications; or the audio emotion recognition result and the text emotion recognition result respectively correspond to a coordinate point in the multi-dimensional emotion space. Each dimension in the multidimensional emotional space corresponds to a psychologically defined emotional factor, each emotional category may also include a plurality of emotional intensity levels, or may not include an emotional intensity level, which is not limited in the present invention.

In an embodiment of the invention, the audio emotion recognition result and the text emotion recognition result respectively include one or more of a plurality of emotion classifications. Wherein the emotion recognition result determination unit 112 is further configured to: if the audio emotion recognition result and the text emotion recognition result include the same emotion classification, the same emotion classification is taken as the emotion recognition result.

In an embodiment of the present invention, the emotion recognition result determination unit 112 is further configured to: and if the audio emotion recognition result and the text emotion recognition result do not comprise the same emotion classification, taking the audio emotion recognition result and the text emotion recognition result as emotion recognition results together.

In an embodiment of the invention, the audio emotion recognition result and the text emotion recognition result respectively include one or more of a plurality of emotion classifications. Among them, the emotion recognition result determination unit 112 includes: a first confidence measure operator unit 1121 and an emotion recognition subunit 1122.

The first confidence operator unit 1121 is configured to calculate the confidence of the emotion classification in the audio emotion recognition result and the confidence of the emotion classification in the text emotion recognition result. The emotion recognition subunit 1122 is configured to, when the emotion classification with the highest confidence in the audio emotion recognition result is the same as the emotion classification with the highest confidence in the text emotion recognition result, take the emotion classification with the highest confidence in the audio emotion recognition result or the emotion classification with the highest confidence in the text emotion recognition result as the emotion recognition result.

In an embodiment of the present invention, the emotion recognition subunit 1122 is further configured to, when the emotion classification with the highest confidence level in the audio emotion recognition result is different from the emotion classification with the highest confidence level in the text emotion recognition result, determine the emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between a confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and a confidence level of the emotion classification with the highest confidence level in the text emotion recognition result includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the audio emotion recognition result is greater than that of the emotion classification with the highest confidence coefficient in the text emotion recognition result, taking the emotion classification with the highest confidence coefficient in the audio emotion recognition result as an emotion recognition result; and when the confidence coefficient of the emotion classification with the highest confidence coefficient in the audio emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the text emotion recognition result, taking the emotion classification with the highest confidence coefficient in the audio emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence coefficient in the audio emotion recognition result and the emotion classification with the highest confidence coefficient in the text emotion recognition result as the emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result further includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the audio emotion recognition result is smaller than that of the emotion classification with the highest confidence coefficient in the text emotion recognition result, judging whether the emotion classification with the highest confidence coefficient in the text emotion recognition result is included in the audio emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result in the audio emotion recognition result is larger than a first intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result is larger than the first intensity threshold value, taking the emotion classification with the highest confidence level in the text emotion recognition result as an emotion recognition result; and if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the audio emotion recognition result is used as the emotion recognition result, or the emotion classification with the highest confidence level in the audio emotion recognition result and the emotion classification with the highest confidence level in the text emotion recognition result are jointly used as the emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between a confidence level of the emotion classification with the highest confidence level in the text emotion recognition result and a confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the text emotion recognition result is greater than that of the emotion classification with the highest confidence coefficient in the audio emotion recognition result, taking the emotion classification with the highest confidence coefficient in the text emotion recognition result as an emotion recognition result; and when the confidence coefficient of the emotion classification with the highest confidence coefficient in the text emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the audio emotion recognition result, taking the emotion classification with the highest confidence coefficient in the text emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence coefficient in the text emotion recognition result and the emotion classification with the highest confidence coefficient in the audio emotion recognition result as the emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result further includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the text emotion recognition result is smaller than that of the emotion classification with the highest confidence coefficient in the audio emotion recognition result, judging whether the emotion classification with the highest confidence coefficient in the audio emotion recognition result is included in the text emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition result in the text emotion recognition result is larger than a second intensity threshold value or not, and if the emotion intensity level is larger than the second intensity threshold value, taking the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result; and if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the text emotion recognition result is used as the emotion recognition result, or the emotion classification with the highest confidence level in the text emotion recognition result and the emotion classification with the highest confidence level in the audio emotion recognition result are used as the emotion recognition result together.

In an embodiment of the invention, the audio emotion recognition result and the text emotion recognition result respectively correspond to a coordinate point in the multi-dimensional emotion space. Wherein the emotion recognition result determination unit 112 is further configured to: and carrying out weighted average processing on coordinate values of coordinate points of the audio emotion recognition result and the text emotion recognition result in the multi-dimensional emotion space, and taking the coordinate points obtained after the weighted average processing as emotion recognition results.

Fig. 12 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention. As shown in fig. 11, the audio emotion recognition unit 111 in the speech recognition interaction device 10 includes: an audio feature extraction sub-unit 1111, a matching sub-unit 1112, and an audio emotion determination sub-unit 1113.

The audio feature extraction subunit 1111 is configured to extract an audio feature vector of a user voice message, where the user voice message corresponds to a segment of speech in the audio stream to be recognized. A matching subunit 1112 configured to match the audio feature vector of the user voice message with a plurality of emotion feature models, wherein the plurality of emotion feature models respectively correspond to one of the plurality of emotion classifications. The audio emotion judgment subunit 1113 is configured to use the emotion classification corresponding to the emotion feature model whose matching result is matching as the emotion classification of the user voice message.

In an embodiment of the present invention, as shown in fig. 12, the audio emotion recognition unit 111 further includes: the emotion model establishing subunit 1114 is configured to establish a plurality of emotion feature models by pre-learning the audio feature vectors of the respective plurality of preset speech segments including emotion classification labels corresponding to the plurality of emotion classifications.

In an embodiment of the present invention, the emotion model establishing subunit 1114 includes: a clustering subunit and a training subunit. The clustering subunit is configured to perform clustering processing on respective audio feature vector sets of a plurality of preset voice segments including emotion classification labels corresponding to a plurality of emotion classifications to obtain clustering results of the preset emotion classifications. The training subunit is configured to train the set of audio feature vectors of the preset voice segments in each cluster into an emotion feature model according to the clustering result.

In an embodiment of the present invention, as shown in fig. 13, the audio emotion recognition unit 111 further includes: a sentence end point detection subunit 1115, and an extraction subunit 1116. The sentence end point detection subunit 1115 is configured to determine a speech start frame and a speech end frame in the audio stream to be recognized. The extraction subunit 1116 is configured to extract the portion of the audio stream between the speech start frame and the speech end frame as a user speech message.

In an embodiment of the present invention, the sentence end point detecting subunit 1115 includes: a first judgment subunit, a speech start frame judgment subunit and a speech end frame judgment subunit.

The first determining subunit is configured to determine whether a speech frame in the audio stream to be recognized is a voiced frame or a unvoiced frame. The voice start frame judging subunit is configured to, after a voice end frame of a previous voice segment or when a first voice segment is not currently recognized, regard a first voice frame of the first preset number of voice frames as a voice start frame of the current voice segment when a first preset number of voice frames are continuously judged as pronunciation frames. The voice ending frame judging subunit is configured to, after the voice starting frame of the current voice segment, when a second preset number of voice frames are continuously judged as non-voice frames, take a first voice frame of the second preset number of voice frames as a voice ending frame of the current voice segment.

In an embodiment of the present invention, the audio feature vector includes one or more of the following audio features: the method comprises the following steps of energy characteristics, pronunciation frame number characteristics, fundamental tone frequency characteristics, formant characteristics, harmonic noise ratio characteristics and Mel cepstrum coefficient characteristics.

In one embodiment of the invention, the energy characteristics include: short-time energy first-order difference and/or energy below a preset frequency; and/or the pitch frequency characteristics include: pitch frequency and/or pitch frequency first order difference; and/or, the formant features include one or more of: a first formant, a second formant, a third formant, a first formant first-order difference, a second formant first-order difference, and a third formant first-order difference; and/or the mel-frequency cepstral coefficient characteristics comprise 1-12 order mel-frequency cepstral coefficients and/or 1-12 order mel-frequency cepstral coefficient first order differences.

In one embodiment of the invention, the audio features are characterized by one or more of the following computational characterization methods: scale, mean, maximum, median, and standard deviation.

In one embodiment of the invention, the energy characteristics include: the average value, the maximum value, the median value and the standard deviation of the short-time energy first-order difference, and/or the ratio of the energy below a preset frequency to the total energy; and/or the pronunciation frame number characteristics comprise: the ratio of the number of pronunciation frames to the number of non-pronunciation frames, and/or the ratio of the number of pronunciation frames to the total number of frames; the pitch frequency characteristics include: the mean, the maximum, the median and the standard deviation of the pitch frequency, and/or the mean, the maximum, the median and the standard deviation of the first order difference of the pitch frequency; and/or, the formant features include one or more of: the mean, maximum, median and standard deviation of the first formants, the mean, maximum, median and standard deviation of the second formants, the mean, maximum, median and standard deviation of the third formants, the mean, maximum, median and standard deviation of the first formant first-order differences, the mean, maximum, median and standard deviation of the second formant first-order differences, and the mean, maximum, median and standard deviation of the third formant first-order differences; and/or the mel-frequency cepstrum coefficient characteristics comprise the mean value, the maximum value, the median value and the standard deviation of the mel-frequency cepstrum coefficients of 1-12 orders and/or the mean value, the maximum value, the median value and the standard deviation of the first order difference of the mel-frequency cepstrum coefficients of 1-12 orders.

Fig. 14 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention. As shown in fig. 14, the text emotion recognition unit 113 in the speech recognition interaction device 10 includes: a first text emotion recognition subunit 1131, a second text emotion recognition subunit 1132, and a text emotion determination subunit 1133.

The first text emotion recognition subunit 1131 is configured to recognize an emotion vocabulary in the text content of the user voice message, and determine a first text emotion recognition result according to the recognized emotion vocabulary. And a second text emotion recognition subunit 1132 configured to input the text content of the user voice message into a text emotion recognition deep learning model, where the text emotion recognition deep learning model is built based on training of the text content including the emotion classification tag and the emotion intensity level tag, and an output result of the text emotion recognition deep learning model is used as a second text emotion recognition result. The text emotion determination subunit 1133 is configured to determine a text emotion recognition result from the first text emotion recognition result and the second text emotion recognition result.

In one embodiment of the invention, the first text emotion recognition result comprises one or more of a plurality of emotion classifications; or, the first text emotion recognition result corresponds to one coordinate point in the multi-dimensional emotion space. Or the first text emotion recognition result and the second text emotion recognition result respectively comprise one or more of a plurality of emotion classifications; or the first text emotion recognition result and the second text emotion recognition result respectively correspond to a coordinate point in the multi-dimensional emotion space. Each dimension in the multidimensional emotional space corresponds to a psychologically defined emotional factor, each emotional category may also include a plurality of emotional intensity levels, or may not include an emotional intensity level, which is not limited in the present invention.

In an embodiment of the invention, the first textual emotion recognition result and the second textual emotion recognition result each include one or more of a plurality of emotion classifications. Wherein the text emotion determination subunit 1133 is further configured to: if the first text emotion recognition result and the second text emotion recognition result include the same emotion classification, the same emotion classification is taken as a text emotion recognition result.

In an embodiment of the present invention, the text emotion determining subunit 1133 is further configured to: and if the first text emotion recognition result and the second text emotion recognition result do not comprise the same emotion classification, using the first text emotion recognition result and the second text emotion recognition result together as a text emotion recognition result.

In an embodiment of the invention, the first textual emotion recognition result and the second textual emotion recognition result each include one or more of a plurality of emotion classifications. The text emotion determining subunit 1133 includes: a second confidence operator unit 11331 and a text emotion judgment subunit 11332.

The second confidence operator unit 11331 is configured to calculate the confidence of the emotion classification in the first text emotion recognition result and the confidence of the emotion classification in the second text emotion recognition result. The text emotion judgment subunit 11332 is configured to, when the emotion classification with the highest confidence level in the first text emotion recognition result is the same as the emotion classification with the highest confidence level in the second text emotion recognition result, take the emotion classification with the highest confidence level in the first text emotion recognition result or the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

In an embodiment of the present invention, the text emotion determining subunit 11332 is further configured to: and when the emotion classification with the highest confidence level in the first text emotion recognition result is different from the emotion classification with the highest confidence level in the second text emotion recognition result, determining the emotion recognition result according to the magnitude relation between the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between a confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and a confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result is larger than that of the emotion classification with the highest confidence coefficient in the second text emotion recognition result, the emotion classification with the highest confidence coefficient in the first text emotion recognition result is used as a text emotion recognition result; and when the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result, taking the emotion classification with the highest confidence coefficient in the first text emotion recognition result as a text emotion recognition result, or taking the emotion classification with the highest confidence coefficient in the first text emotion recognition result and the emotion classification with the highest confidence coefficient in the second text emotion recognition result as the text emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result further includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result, judging whether the emotion classification with the highest confidence coefficient in the second text emotion recognition result is included in the first text emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the second text emotion recognition result in the first text emotion recognition result is larger than a first intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the second text emotion recognition result is larger than the first intensity threshold value, taking the emotion classification with the highest confidence level in the second text emotion recognition result as a text emotion recognition result; and if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the first text emotion recognition result is used as the text emotion recognition result, or the emotion classification with the highest confidence level in the first text emotion recognition result and the emotion classification with the highest confidence level in the second text emotion recognition result are used as the text emotion recognition result together.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between a confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and a confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result is larger than that of the emotion classification with the highest confidence coefficient in the first text emotion recognition result, the emotion classification with the highest confidence coefficient in the second text emotion recognition result is used as a text emotion recognition result; and when the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result is equal to the confidence coefficient of the emotion classification with the highest confidence coefficient in the first text emotion recognition result, taking the emotion classification with the highest confidence coefficient in the second text emotion recognition result as a text emotion recognition result, or taking the emotion classification with the highest confidence coefficient in the second text emotion recognition result and the emotion classification with the highest confidence coefficient in the first text emotion recognition result as the text emotion recognition result.

In an embodiment of the present invention, determining the emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result further includes: when the confidence coefficient of the emotion classification with the highest confidence coefficient in the second text emotion recognition result is smaller than that of the emotion classification with the highest confidence coefficient in the first text emotion recognition result, judging whether the emotion classification with the highest confidence coefficient in the first text emotion recognition result is included in the second text emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the first text emotion recognition result in the second text emotion recognition result is larger than a second intensity threshold, and if the emotion intensity level of the emotion classification with the highest confidence level in the first text emotion recognition result is larger than the second intensity threshold, taking the emotion classification with the highest confidence level in the first text emotion recognition result as the text emotion recognition result; and if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the second text emotion recognition result is used as the text emotion recognition result, or the emotion classification with the highest confidence level in the second text emotion recognition result and the emotion classification with the highest confidence level in the first text emotion recognition result are used as the text emotion recognition result together.

In an embodiment of the present invention, the first text emotion recognition result and the second text emotion recognition result respectively correspond to a coordinate point in the multidimensional emotion space. Wherein the text emotion determination subunit 1133 is further configured to: and carrying out weighted average processing on coordinate values of coordinate points of the first text emotion recognition result and the second text emotion recognition result in the multi-dimensional emotion space, and taking the coordinate points obtained after the weighted average processing as text emotion recognition results.

Fig. 15 is a schematic structural diagram of a speech recognition interaction device according to an embodiment of the present invention. As shown in fig. 15, the basic intention recognition module 12 in the speech recognition interaction device 10 includes: a semantic template matching unit 121 and a basic intention acquisition unit 122.

The semantic template matching unit 121 is configured to match the text content of the user voice message with a plurality of preset semantic templates in a semantic knowledge base to determine a matched semantic template. The basic intention acquisition unit 122 is configured to acquire basic intention information corresponding to the matched semantic template. The corresponding relation between the semantic template and the basic intention information is pre-established in a semantic knowledge base, and the same intention information corresponds to one or more semantic templates.

In an embodiment of the present invention, the semantic template matching unit 121 includes: a similarity operator unit 1211 and a semantic template determination subunit 1212.

The similarity operator unit 1211 is configured to perform similarity calculation between the text content of the user voice message and a plurality of preset semantic templates. The semantic template determining subunit 1212 is configured to use the semantic template with the highest similarity as the matched semantic template.

In an embodiment of the invention, the corresponding relationship between the emotion recognition result and the basic intention information and the emotion intention information is pre-established; or the corresponding relation between the emotional intention information and the interactive instruction is pre-established; or the corresponding relation between the emotional intention information and the basic intention information and the interactive instruction is established in advance.

In an embodiment of the present invention, in order to further improve the accuracy of obtaining the basic intention information, the basic intention identifying module 12 is further configured to: and analyzing the intention according to the current user voice message and the past user voice message and/or the subsequent user voice message to obtain corresponding basic intention information.

In an embodiment of the present invention, in order to further improve the accuracy of obtaining the emotional intention information, the voice recognition interaction device 10 further includes: and the first recording module is configured to record the emotion recognition result and the basic intention information of the voice message of the user. Wherein the emotional intent recognition unit 131 is further configured to: and determining corresponding emotional intention information according to the emotion recognition result and the basic intention information of the current user voice message and by combining the emotion recognition result and the basic intention information of the past user voice message and/or the subsequent user voice message.

In an embodiment of the present invention, in order to further improve the accuracy of obtaining the interactive instruction, the speech recognition interactive apparatus 10 further includes: and the second recording module is configured to record the emotional intention information and the basic intention information of the voice message of the user. Wherein the instruction of interaction determination unit 132 is further configured to: and determining a corresponding interaction instruction according to the emotion intention information and the basic intention information of the current user voice message and by combining the emotion intention information and the basic intention information of the past user voice message and/or the subsequent user voice message.

It should be understood that each of the modules or units described in the voice recognition interaction device 10 provided in the above embodiments corresponds to one of the method steps described above. Thus, the operations, features and effects described in the foregoing method steps are also applicable to the speech recognition interaction device 10 and the corresponding modules and units included therein, and repeated contents are not described herein again.

An embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executed by the processor, wherein the processor implements the voice recognition interaction method according to any of the foregoing embodiments when executing the computer program.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the voice recognition interaction method according to any of the foregoing embodiments. The computer storage medium may be any tangible medium, such as a floppy disk, a CD-ROM, a DVD, a hard drive, even a network medium, and the like.

It should be understood that although one implementation form of the embodiments of the present invention described above may be a computer program product, the method or apparatus of the embodiments of the present invention may be implemented in software, hardware, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. It will be appreciated by those of ordinary skill in the art that the methods and apparatus described above may be implemented using computer executable instructions and/or embodied in processor control code, such code provided, for example, on a carrier medium such as a disk, CD or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The methods and apparatus of the present invention may be implemented in hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, or in software for execution by various types of processors, or in a combination of hardware circuitry and software, such as firmware.

It should be understood that although several modules or units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, according to exemplary embodiments of the invention, the features and functions of two or more modules/units described above may be implemented in one module/unit, whereas the features and functions of one module/unit described above may be further divided into implementations by a plurality of modules/units. Furthermore, some of the modules/units described above may be omitted in some application scenarios.

It should be understood that the terms "first", "second", and "third" used in the description of the embodiments of the present invention are only used for clearly illustrating the technical solutions, and are not used for limiting the protection scope of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. A method for speech recognition emotion interaction, comprising:

acquiring emotion recognition results according to user voice messages, wherein the emotion recognition results at least comprise audio emotion recognition results, or the emotion recognition results at least comprise audio emotion recognition results and text emotion recognition results;

determining a corresponding interaction instruction according to the emotion recognition result and the basic intention information;

the obtaining of the emotion recognition result according to the user voice message includes: acquiring an audio emotion recognition result according to the audio data of the user voice message; determining the emotion recognition result according to the audio emotion recognition result; or acquiring an audio emotion recognition result according to the audio data of the user voice message, and acquiring a text emotion recognition result according to the text content of the user voice message; determining the emotion recognition result according to the audio emotion recognition result and the text emotion recognition result;

the audio emotion recognition result and the text emotion recognition result respectively comprise one or more of a plurality of emotion classifications;

wherein the determining the emotion recognition result according to the audio emotion recognition result and the text emotion recognition result comprises: calculating confidence degrees of emotion classifications in the audio emotion recognition results and confidence degrees of emotion classifications in the text emotion recognition results; when the emotion classification with the highest confidence level in the audio emotion recognition result is the same as the emotion classification with the highest confidence level in the text emotion recognition result, taking the emotion classification with the highest confidence level in the audio emotion recognition result or the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result;

when the emotion classification with the highest confidence in the audio emotion recognition result is not the same as the emotion classification with the highest confidence in the text emotion recognition result, the determining the emotion recognition result according to the audio emotion recognition result and the text emotion recognition result further comprises: determining an emotion recognition result according to the magnitude relation between the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result;

determining an emotion recognition result according to a magnitude relation between the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result includes: when the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result is greater than the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result, taking the emotion classification with the highest confidence degree in the audio emotion recognition result as the emotion recognition result; and when the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result, taking the emotion classification with the highest confidence degree in the audio emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence degree in the audio emotion recognition result and the emotion classification with the highest confidence degree in the text emotion recognition result as the emotion recognition result;

determining an emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result, further comprising: when the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result, judging whether the emotion classification with the highest confidence level in the text emotion recognition result is included in the audio emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result in the audio emotion recognition result is larger than a first intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result is larger than the first intensity threshold value, taking the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result; if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the audio emotion recognition result is used as the emotion recognition result, or the emotion classification with the highest confidence level in the audio emotion recognition result and the emotion classification with the highest confidence level in the text emotion recognition result are used as the emotion recognition result together;

or, the determining an emotion recognition result according to the magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result includes: when the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result is greater than the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result, taking the emotion classification with the highest confidence degree in the text emotion recognition result as the emotion recognition result; and when the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result, taking the emotion classification with the highest confidence degree in the text emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence degree in the text emotion recognition result and the emotion classification with the highest confidence degree in the audio emotion recognition result as the emotion recognition result; determining an emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result, further comprising: when the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result, judging whether the emotion classification with the highest confidence level in the audio emotion recognition result is included in the text emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition result in the text emotion recognition result is larger than a second intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition result is larger than the second intensity threshold value, taking the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result; and if the judgment result or the further judgment result is negative, taking the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence level in the text emotion recognition result and the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result.

2. The speech recognition emotion interaction method of claim 1, wherein the audio emotion recognition result includes one or more of a plurality of emotion classifications; or the audio emotion recognition result corresponds to a coordinate point in the multi-dimensional emotion space;

or the audio emotion recognition result and the text emotion recognition result respectively comprise one or more of a plurality of emotion classifications; or the audio emotion recognition result and the text emotion recognition result respectively correspond to a coordinate point in the multi-dimensional emotion space;

wherein each dimension in the multi-dimensional emotional space corresponds to a psychologically defined affective factor.

3. The speech recognition emotion interaction method of claim 2, wherein the audio emotion recognition result and the text emotion recognition result respectively include one or more of the plurality of emotion classifications;

wherein the determining the emotion recognition result according to the audio emotion recognition result and the text emotion recognition result comprises:

and if the audio emotion recognition result and the text emotion recognition result comprise the same emotion classification, taking the same emotion classification as the emotion recognition result.

4. The speech recognition emotion interaction method of claim 3, wherein the determining the emotion recognition result from the audio emotion recognition result and the text emotion recognition result further comprises:

and if the audio emotion recognition result and the text emotion recognition result do not comprise the same emotion classification, taking the audio emotion recognition result and the text emotion recognition result as the emotion recognition result together.

5. The speech recognition emotion interaction method of claim 1, wherein the audio emotion recognition result and the text emotion recognition result each correspond to a coordinate point in a multidimensional emotion space;

wherein said determining the emotion recognition result from the audio emotion recognition result and the text emotion recognition result comprises:

and carrying out weighted average processing on coordinate values of coordinate points of the audio emotion recognition result and the text emotion recognition result in the multi-dimensional emotion space, and taking the coordinate points obtained after the weighted average processing as the emotion recognition results.

6. The method of claim 1, wherein the obtaining a text emotion recognition result according to the text content of the user voice message comprises:

recognizing emotion vocabularies in text contents of the user voice messages, and determining a first text emotion recognition result according to the recognized emotion vocabularies;

inputting the text content of the user voice message into a text emotion recognition deep learning model, wherein the text emotion recognition deep learning model is established on the basis of training the text content comprising emotion classification labels and emotion intensity level labels, and the output result of the text emotion recognition deep learning model is used as a second text emotion recognition result; and

and determining the text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result.

7. The speech recognition emotion interaction method of claim 6, wherein the first text emotion recognition result includes one or more of a plurality of emotion classifications; or the first text emotion recognition result corresponds to a coordinate point in a multi-dimensional emotion space;

or the first text emotion recognition result and the second text emotion recognition result respectively comprise one or more of a plurality of emotion classifications; or the first text emotion recognition result and the second text emotion recognition result respectively correspond to a coordinate point in a multi-dimensional emotion space;

8. The speech recognition emotion interaction method of claim 6, wherein the first text emotion recognition result and the second text emotion recognition result respectively include one or more of a plurality of emotion classifications;

wherein the determining the text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result comprises:

and if the first text emotion recognition result and the second text emotion recognition result comprise the same emotion classification, taking the same emotion classification as the text emotion recognition result.

9. The speech recognition emotion interaction method of claim 8, wherein the determining the textual emotion recognition result from the first textual emotion recognition result and the second textual emotion recognition result further comprises:

and if the first text emotion recognition result and the second text emotion recognition result do not comprise the same emotion classification, taking the first text emotion recognition result and the second text emotion recognition result as the text emotion recognition results together.

10. The speech recognition emotion interaction method of claim 8, wherein the first text emotion recognition result and the second text emotion recognition result respectively include one or more of the plurality of emotion classifications;

calculating confidence degrees of the emotion classifications in the first text emotion recognition result and the second text emotion recognition result;

and when the emotion classification with the highest confidence level in the first text emotion recognition result is the same as the emotion classification with the highest confidence level in the second text emotion recognition result, taking the emotion classification with the highest confidence level in the first text emotion recognition result or the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

11. The method of speech recognition emotion interaction of claim 10, wherein when the emotion classification most confident in the first text emotion recognition result is not the same as the emotion classification most confident in the second text emotion recognition result, said determining the text emotion recognition result from the first text emotion recognition result and the second text emotion recognition result further comprises:

and determining an emotion recognition result according to the magnitude relation between the confidence coefficient of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence coefficient of the emotion classification with the highest confidence level in the second text emotion recognition result.

12. The method of speech recognition emotion interaction of claim 11, wherein said determining an emotion recognition result from a magnitude relationship between a confidence level of the most confident emotion classification in the first text emotion recognition result and a confidence level of the most confident emotion classification in the second text emotion recognition result comprises:

when the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result is greater than the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result, taking the emotion classification with the highest confidence degree in the first text emotion recognition result as the text emotion recognition result; and

when the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result, taking the emotion classification with the highest confidence degree in the first text emotion recognition result as the text emotion recognition result, or taking the emotion classification with the highest confidence degree in the first text emotion recognition result and the emotion classification with the highest confidence degree in the second text emotion recognition result as the text emotion recognition result;

wherein the determining an emotion recognition result according to a magnitude relationship between the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result further comprises:

when the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result is smaller than the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result, judging whether the emotion classification with the highest confidence degree in the second text emotion recognition result is included in the first text emotion recognition result; and

if so, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the second text emotion recognition result in the first text emotion recognition result is greater than a first intensity threshold, and if so, taking the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result; and if the judgment result or the further judgment result is negative, using the emotion classification with the highest confidence level in the first text emotion recognition result as the text emotion recognition result, or using the emotion classification with the highest confidence level in the first text emotion recognition result and the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

13. The method of speech recognition emotion interaction of claim 11, wherein said determining an emotion recognition result from a magnitude relationship between a confidence level of the most confident emotion classification in the first text emotion recognition result and a confidence level of the most confident emotion classification in the second text emotion recognition result comprises:

when the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result is greater than the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result, taking the emotion classification with the highest confidence degree in the second text emotion recognition result as the text emotion recognition result; and

when the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result, taking the emotion classification with the highest confidence degree in the second text emotion recognition result as the text emotion recognition result, or taking the emotion classification with the highest confidence degree in the second text emotion recognition result and the emotion classification with the highest confidence degree in the first text emotion recognition result as the text emotion recognition result;

when the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result is smaller than the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result, judging whether the emotion classification with the highest confidence degree in the first text emotion recognition result is included in the second text emotion recognition result; and

if so, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the first text emotion recognition result in the second text emotion recognition result is greater than a second intensity threshold, and if so, taking the emotion classification with the highest confidence level in the first text emotion recognition result as the text emotion recognition result; and if the judgment result or the further judgment result is negative, using the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result, or using the emotion classification with the highest confidence level in the second text emotion recognition result and the emotion classification with the highest confidence level in the first text emotion recognition result as the text emotion recognition result.

14. The speech recognition emotion interaction method of claim 7, wherein the first text emotion recognition result and the second text emotion recognition result respectively correspond to one coordinate point in the multidimensional emotion space;

and carrying out weighted average processing on coordinate values of coordinate points of the first text emotion recognition result and the second text emotion recognition result in the multidimensional feeling space, and taking the coordinate points obtained after the weighted average processing as the text emotion recognition results.

15. A speech recognition interaction device, comprising:

the interaction instruction determining module is configured to determine a corresponding interaction instruction according to the emotion recognition result and the basic intention information;

the emotion recognition module is further configured to include: the audio emotion recognition unit is configured to acquire an audio emotion recognition result according to audio data of the user voice message; and an emotion recognition result determination unit configured to determine the emotion recognition result from the audio emotion recognition result; or, the audio emotion recognition unit is configured to obtain an audio emotion recognition result according to the audio data of the user voice message; the text emotion recognition unit is configured to acquire a text emotion recognition result according to the text content of the user voice message; and an emotion recognition result determination unit configured to determine the emotion recognition result from the audio emotion recognition result and the text emotion recognition result;

wherein the emotion recognition result determination unit includes: a first confidence operator unit configured to calculate confidence of emotion classification in the audio emotion recognition result and confidence of emotion classification in the text emotion recognition result; an emotion recognition subunit configured to take the emotion classification with the highest confidence in the audio emotion recognition result or the emotion classification with the highest confidence in the text emotion recognition result as the emotion recognition result when the emotion classification with the highest confidence in the audio emotion recognition result is the same as the emotion classification with the highest confidence in the text emotion recognition result;

the emotion recognition subunit is further configured to, when the emotion classification with the highest confidence level in the audio emotion recognition result is different from the emotion classification with the highest confidence level in the text emotion recognition result, determine an emotion recognition result according to a magnitude relation between the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result;

the emotion recognition subunit is further configured to, when the confidence level of the emotion classification with the highest confidence level in the audio emotion recognition result is greater than the confidence level of the emotion classification with the highest confidence level in the text emotion recognition result, take the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result; and when the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result, taking the emotion classification with the highest confidence degree in the audio emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence degree in the audio emotion recognition result and the emotion classification with the highest confidence degree in the text emotion recognition result as the emotion recognition result; when the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result, judging whether the emotion classification with the highest confidence level in the text emotion recognition result is included in the audio emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result in the audio emotion recognition result is larger than a first intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the text emotion recognition result is larger than the first intensity threshold value, taking the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result; if the judgment result or the further judgment result is negative, the emotion classification with the highest confidence level in the audio emotion recognition result is used as the emotion recognition result, or the emotion classification with the highest confidence level in the audio emotion recognition result and the emotion classification with the highest confidence level in the text emotion recognition result are used as the emotion recognition result together;

or, the emotion recognition subunit is further configured to, when the confidence of the emotion classification with the highest confidence level in the text emotion recognition result is greater than the confidence of the emotion classification with the highest confidence level in the audio emotion recognition result, take the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result; and when the confidence degree of the emotion classification with the highest confidence degree in the text emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the audio emotion recognition result, taking the emotion classification with the highest confidence degree in the text emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence degree in the text emotion recognition result and the emotion classification with the highest confidence degree in the audio emotion recognition result as the emotion recognition result; when the confidence coefficient of the emotion classification with the highest confidence level in the text emotion recognition result is smaller than the confidence coefficient of the emotion classification with the highest confidence level in the audio emotion recognition result, judging whether the emotion classification with the highest confidence level in the audio emotion recognition result is included in the text emotion recognition result; if the judgment result is yes, further judging whether the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition result in the text emotion recognition result is larger than a second intensity threshold value or not, and if the emotion intensity level of the emotion classification with the highest confidence level in the audio emotion recognition result is larger than the second intensity threshold value, taking the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result; and if the judgment result or the further judgment result is negative, taking the emotion classification with the highest confidence level in the text emotion recognition result as the emotion recognition result, or taking the emotion classification with the highest confidence level in the text emotion recognition result and the emotion classification with the highest confidence level in the audio emotion recognition result as the emotion recognition result.

16. The speech recognition interaction device of claim 15, wherein the audio emotion recognition result comprises one or more of a plurality of emotion classifications; or the audio emotion recognition result corresponds to a coordinate point in the multi-dimensional emotion space;

17. The speech recognition interaction device of claim 15, wherein the audio emotion recognition result and the text emotion recognition result each comprise one or more of the plurality of emotion classifications;

wherein the emotion recognition result determination unit is further configured to:

18. The speech recognition interaction device of claim 16, wherein the emotion recognition result determination unit is further configured to:

19. The speech recognition interaction device of claim 15, wherein the audio emotion recognition result and the text emotion recognition result each correspond to a coordinate point in a multidimensional emotion space;

20. The speech recognition interaction device of claim 15, wherein the text emotion recognition unit comprises:

the vocabulary emotion recognition subunit is configured to recognize emotion vocabularies in the text content of the user voice message and determine a first text emotion recognition result according to the recognized emotion vocabularies;

a deep learning emotion recognition subunit configured to input the text content of the user voice message into a text emotion recognition deep learning model, where the text emotion recognition deep learning model is built based on training of text content including an emotion classification tag and an emotion intensity level tag, and an output result of the text emotion recognition deep learning model is used as a second text emotion recognition result; and

a text emotion determination subunit configured to determine the text emotion recognition result according to the first text emotion recognition result and the second text emotion recognition result.

21. The speech recognition interaction device of claim 20, wherein the first textual emotion recognition result comprises one or more of a plurality of emotion classifications; or the first text emotion recognition result corresponds to a coordinate point in a multi-dimensional emotion space;

22. The speech recognition interaction device of claim 20, wherein the first textual emotion recognition result and the second textual emotion recognition result each comprise one or more of a plurality of emotion classifications;

wherein the text emotion determination subunit is further configured to:

23. The speech recognition interaction device of claim 22, wherein the text emotion determination subunit is further configured to:

24. The speech recognition interaction device of claim 22, wherein the first textual emotion recognition result and the second textual emotion recognition result each comprise one or more of the plurality of emotion classifications;

wherein the text emotion determining subunit includes:

a second confidence operator unit configured to calculate a confidence of the emotion classification in the first text emotion recognition result and a confidence of the emotion classification in the second text emotion recognition result;

a text emotion judgment subunit configured to, when the emotion classification with the highest confidence level in the first text emotion recognition result is the same as the emotion classification with the highest confidence level in the second text emotion recognition result, take the emotion classification with the highest confidence level in the first text emotion recognition result or the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result.

25. The speech recognition interaction device of claim 24, wherein the text emotion determination subunit is further configured to:

and when the emotion classification with the highest confidence level in the first text emotion recognition result is different from the emotion classification with the highest confidence level in the second text emotion recognition result, determining an emotion recognition result according to the magnitude relation between the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result and the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result.

26. The speech recognition interaction device of claim 25, wherein the text emotion determination subunit is further configured to: when the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result is greater than the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result, taking the emotion classification with the highest confidence degree in the first text emotion recognition result as the text emotion recognition result; and

and when the confidence degree of the emotion classification with the highest confidence degree in the first text emotion recognition result is equal to the confidence degree of the emotion classification with the highest confidence degree in the second text emotion recognition result, taking the emotion classification with the highest confidence degree in the first text emotion recognition result as the text emotion recognition result, or taking the emotion classification with the highest confidence degree in the first text emotion recognition result and the emotion classification with the highest confidence degree in the second text emotion recognition result as the text emotion recognition result.

27. The speech recognition interaction device of claim 25, wherein the text emotion judgment subunit is further configured to judge whether the emotion classification with the highest confidence in the second text emotion recognition result is included in the first text emotion recognition result when the confidence of the emotion classification with the highest confidence in the first text emotion recognition result is less than the confidence of the emotion classification with the highest confidence in the second text emotion recognition result; and

28. The speech recognition interaction device of claim 21, wherein the text emotion judgment subunit is further configured to, when the confidence level of the emotion classification with the highest confidence level in the second text emotion recognition result is greater than the confidence level of the emotion classification with the highest confidence level in the first text emotion recognition result, take the emotion classification with the highest confidence level in the second text emotion recognition result as the text emotion recognition result; and

29. The speech recognition interaction device of claim 21, wherein the first textual emotion recognition result and the second textual emotion recognition result each correspond to a coordinate point in the multidimensional emotion space;

wherein the text emotion determination subunit is further configured to:

30. A computer device comprising a memory, a processor and a computer program stored on the memory for execution by the processor, characterized in that the steps of the method according to any of claims 1 to 14 are implemented when the computer program is executed by the processor.

31. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 14.