CN113808621A

CN113808621A - Method and device for marking voice conversation in man-machine interaction, equipment and medium

Info

Publication number: CN113808621A
Application number: CN202111069995.3A
Authority: CN
Inventors: 余凯; 李星宇; 王怀章; 王超银
Original assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Current assignee: Horizon Shanghai Artificial Intelligence Technology Co Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2021-12-17
Also published as: WO2023035870A1

Abstract

The embodiment of the disclosure discloses a method, a device, equipment and a medium for marking voice conversation in man-machine interaction, wherein an emotional characteristic of a user in the voice response of the machine voice is determined by determining a machine voice response made by a man-machine interaction system aiming at the previous voice of the user, and a first satisfaction degree of the user aiming at the machine voice response is determined based on the emotional characteristic; if the voice is the ending voice in the multi-turn conversations, at least one second satisfaction degree of the user in the historical wheel conversations aiming at the machine voice reply output by the man-machine interaction system is determined, and the multi-turn conversations are labeled based on the first satisfaction degree and the at least one second satisfaction degree, so that the automatic labeling of the man-machine conversations can be realized.

Description

Method and device for marking voice conversation in man-machine interaction, equipment and medium

Technical Field

The present disclosure relates to natural language processing technologies, and in particular, to a method and apparatus, a device, and a medium for labeling a voice dialog in human-computer interaction.

Background

The man-machine interaction refers to the process of information exchange between a person and a computer for completing a determined task in a certain interaction mode by using a certain dialogue language between the person and the computer. While conventional human-computer interaction is mainly implemented by input and output devices such as a keyboard, a mouse, and a display, with the development of technologies such as voice recognition, Natural Language Processing (NLP), etc., human and machine can interact with each other in a manner similar to natural Language.

Along with the gradual popularization of the intelligent life concept and the continuous promotion of the human-computer interaction technology, higher requirements are also put forward on the NLP technology. For example, when a user gives a conversation such as voice conversation in order to expect the machine to give a corresponding reply or perform a related task, the conversation content is converted into text by signal processing, voice recognition, etc. as an input of the NLP system, the NLP system understands the meaning of the conversation of the user and gives a corresponding reply or performs a related task on the basis thereof.

Therefore, the accuracy of understanding the user conversation meaning by the NLP system directly influences the reply efficiency and accuracy of the NLP system for the user conversation or the task execution efficiency and accuracy, and therefore the human-computer interaction effect is influenced.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a method and a device for marking a voice conversation in man-machine interaction, electronic equipment and a medium.

According to an aspect of the embodiments of the present disclosure, there is provided a method for labeling a voice dialog in a human-computer interaction, including:

determining a machine voice reply made by the human-computer interaction system aiming at the previous voice of the user;

determining emotional characteristics of the user in the current voice made by the machine voice reply;

determining a first satisfaction level of the user with respect to the machine voice reply based on the emotional feature;

if the current voice is the ending voice in the multiple rounds of conversations, determining at least one second satisfaction degree of the user in the historical round of conversations before the current round of conversations to which the current voice belongs, aiming at the machine voice reply output by the man-machine interaction system, wherein one machine voice reply corresponds to one voice of the user;

annotating the multiple turns of conversation based on the first satisfaction and the at least one second satisfaction.

According to an aspect of the embodiments of the present disclosure, there is provided an apparatus for labeling a voice dialog in a human-computer interaction, including:

the first determination module is used for determining a machine voice reply made by the man-machine interaction system aiming at the previous voice of the user;

the second determination module is used for determining the emotional characteristics of the user in the current voice made aiming at the machine voice reply;

a third determination module to determine a first satisfaction of the user with respect to the machine voice reply based on the emotional characteristic;

a fourth determining module, configured to determine, if the current voice is an end voice in multiple rounds of dialogs, at least one second satisfaction degree of the user with respect to a machine voice reply output by the human-computer interaction system in a history round of dialogs before the current round of dialogs to which the current voice belongs, where one machine voice reply corresponds to one voice of the user;

and the marking module is used for marking the multiple turns of conversations based on the first satisfaction and the at least one second satisfaction.

According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, where the storage medium stores a computer program for executing the method for labeling a voice dialog in human-computer interaction according to any of the above embodiments of the present disclosure.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for marking the voice conversation in the man-machine interaction according to any one of the above embodiments of the present disclosure.

Based on the method and the device for labeling the voice conversation in the man-machine interaction, the electronic device and the medium provided by the embodiment of the disclosure, a machine voice reply made by the man-machine interaction system for the previous voice of the user is determined, the emotional characteristics of the user in the current voice made by the user for the machine voice reply are determined, then, the first satisfaction degree of the user for the machine voice reply is determined based on the emotional characteristics, if the current voice is the ending voice in the multi-turn conversation, at least one second satisfaction degree of the user for the machine voice reply output by the man-machine interaction system in the history round conversation before the current voice belongs to the multi-turn conversation is determined, and then, the multi-turn conversation is labeled based on the first satisfaction degree and the at least one second satisfaction degree. The method and the device for automatically marking the language materials of the human-computer interaction system determine the satisfaction degree of the user for the machine voice reply through the emotional characteristics of the user for the machine voice reply during the current voice, and determine the semantic understanding accuracy corresponding to the human-computer interaction system based on the satisfaction degree of the user for each machine voice reply in multiple rounds of conversation between the user and the human-computer interaction system, so that the automatic marking of the multiple rounds of conversation is realized, the corpus marking accuracy and efficiency of the human-computer interaction system are improved, the semantic understanding accuracy of the human-computer interaction system is improved, the reply efficiency and accuracy of the human-computer interaction system for the user conversation or the task execution efficiency and accuracy are improved, and the human-computer interaction effect is improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a scene diagram to which the present disclosure is applicable.

Fig. 2 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to still another exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a method for tagging a voice dialog in human-computer interaction according to still another exemplary embodiment of the present disclosure.

Fig. 6 is a flow chart diagram of an exemplary application embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of an apparatus for labeling a voice conversation in human-computer interaction according to an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of an apparatus for labeling a voice conversation in human-computer interaction according to another exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the human-computer interaction based on the NLP, the fact that the real feedback of a human about a machine language is accurate, wrong or general is not considered, so that a large amount of manpower needs to be invested to manually label the learning linguistic data to train the NLP system, higher manpower cost and longer time need to be consumed, the learning linguistic data cannot be collected in real time in specific application to be labeled, and the real-time updating of the learning linguistic data cannot be realized.

In view of this, the embodiments of the present disclosure provide a method and an apparatus for labeling a voice dialog in a human-computer interaction, an electronic device, and a medium, where a satisfaction degree of a user for a machine voice reply is determined by emotional characteristics of the user in the current voice in response to the machine voice reply, and an accuracy of semantic understanding corresponding to a human-computer interaction system is determined based on the satisfaction degree of the user for each machine voice reply in multiple rounds of dialogs between the user and the human-computer interaction system, so that automatic labeling of the multiple rounds of dialogs is realized, and accuracy and efficiency of corpus labeling of the human-computer interaction system are improved.

Exemplary System

The embodiment of the disclosure can be applied to various scenes with voice interaction, such as a vehicle machine, a user terminal, an Application (APP) and the like.

Fig. 1 is a diagram of a scenario to which the present disclosure is applicable. As shown in fig. 1, the system of the embodiment of the present disclosure includes: an audio acquisition module 101, a front-end signal processing module 102, a voice recognition module 103, a video sensor 104, a human-computer interaction system 105, an Emotion Perception System (EPS)106, a memory 107 and a speaker 108. The EPS106 may include a voice parameter collection module 1061, an expression recognition module 1062, an emotion determination module 1063, and a satisfaction determination module 1604.

When the embodiment of the present disclosure is applied to a voice interaction scene, an audio signal of a voice initiated by a user in the current voice interaction scene is acquired by an audio acquisition module (e.g., a microphone array, etc.) 101, processed by a front-end signal processing module 102, and then voice recognition is performed by a voice recognition module 103, so as to obtain text information and input the text information to a human-computer interaction system 105, and the human-computer interaction system 105 understands a conversation meaning of the user, and outputs a corresponding reply to convert the reply into a voice on the basis, so as to obtain a machine voice reply and play the voice reply by a speaker 108.

Then, the audio acquisition module 101 acquires the current voice made by the user in response to the machine voice output by the human-computer interaction system 105, executes the processing procedures of the front-end signal processing module 102, the voice recognition module 103 and the human-computer interaction system 105, and inputs the current voice into the EPS 106; in addition, when the audio collection module 101 collects the current voice made by the user for the machine voice reply output by the human-computer interaction system 105, the video sensor (e.g., a camera) 104 collects a face image of the user when the current voice is made for the machine voice reply output by the human-computer interaction system 105, and inputs the face image into the EPS 106. A voice parameter acquisition module 1061 in the EPS106 acquires the voice parameters of the voice this time acquired by the audio acquisition module 101, and an expression recognition module 1062 recognizes a facial expression in a facial image; then, the emotion determining module 1063 determines the emotional characteristics of the user in the current voice based on the voice parameters and the facial expressions, and further, the satisfaction determining module 1064 determines the satisfaction of the user for the machine voice reply based on the emotional characteristics, and stores the previous round of dialog (including the previous voice of the user and the machine voice reply of the human-computer interaction system 105 for the previous voice) and the corresponding satisfaction in the memory 107; the above processes are repeatedly executed, the satisfaction of the user for each machine voice reply is determined until the current voice interaction scene is finished, multiple rounds of conversations in the current voice interaction scene are labeled based on the satisfaction of the user for each machine voice reply obtained by the satisfaction determining module 1064, and the multiple rounds of conversations and the corresponding satisfaction are stored in the memory 107. One round of conversation refers to one voice of a user and one machine voice made by the man-machine interaction system aiming at the one voice.

Exemplary method

Fig. 2 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to an exemplary embodiment of the present disclosure. The embodiment can be applied to a car machine or an electronic device such as a user terminal, and as shown in fig. 2, the method for labeling a voice conversation in human-computer interaction in the embodiment includes the following steps:

step 201, determining a machine voice reply made by the man-machine interaction system aiming at the previous voice of the user.

Wherein, one machine voice reply corresponds to one voice of the user, that is, each machine reply is a reply of the human-computer interaction system to one voice output of the user.

In a particular application, a user's one voice (e.g., please go to ABC mall) and one machine voice reply made by the human-computer interaction system to the one voice (e.g., which ABC mall) may be referred to as a round of dialog. Multiple rounds of dialog may be triggered when a user gives a conversation, such as a voice conversation, in anticipation of the machine giving a corresponding reply or performing a related task.

Optionally, in some embodiments, the voice of the user may be collected by an audio collection device (e.g., a microphone or a microphone array), and after the front-end signal processing, the voice recognition may be performed to obtain text information and input the text information into the human-computer interaction system, and the human-computer interaction system may understand the meaning of the previous voice of the user and output a machine voice reply based on the meaning, so that in step 201, the machine voice reply output by the human-computer interaction system may be obtained.

Step 202, determining the emotional characteristics of the user in the current voice made by the machine voice reply.

The emotional characteristic in the embodiment of the present disclosure is a related characteristic for representing the emotion of the user.

Step 203, determining a first satisfaction degree of the user for the machine voice reply based on the emotional characteristics.

The first satisfaction is used for indicating the satisfaction degree of the user for the machine voice reply, and can also be considered as the satisfaction degree of the user for the previous dialog (including the previous voice of the user and the machine voice reply made by the man-machine interaction system for the previous voice of the user).

Optionally, in some embodiments, the satisfaction level in the embodiments of the present disclosure may be expressed as a specific score, and a higher score may be set to indicate that the user has a higher satisfaction level with respect to the machine voice reply.

Alternatively, in other implementations, the satisfaction in the embodiments of the present disclosure may be specifically expressed as a satisfaction level. In a specific application, the satisfaction degree of the user may be divided into a plurality of (e.g., 5) levels according to actual requirements, where the plurality of levels gradually transition from satisfactory to unsatisfactory or from unsatisfactory to satisfactory corresponding to the satisfaction degree of the user, for example, when the satisfaction degree of the user is divided into 5 levels, the following may be respectively: is not satisfactory, is generally satisfactory and is very satisfactory. The disclosed embodiments are not limited in the specific number of satisfaction levels and transitional relationships to user satisfaction.

And 204, if the voice is the ending voice in the multiple rounds of conversations between the user and the human-computer interaction system, determining at least one second satisfaction degree of the user for the machine voice reply output by the human-computer interaction system in the historical round of conversations before the round of conversation to which the voice belongs in the multiple rounds of conversations.

In the embodiment of the disclosure, one satisfaction degree is generated for each round of conversation, each round of conversation before the current round of conversation can be called history round of conversation, the satisfaction degree can be called second satisfaction degree, and the history round of conversation before the current round of conversation has at least one second satisfaction degree according to the round of the history round of conversation.

And step 205, labeling the multiple turns of conversations based on the first satisfaction degree and the at least one second satisfaction degree.

Based on the embodiment, a machine voice reply made by a human-computer interaction system for the previous voice of a user is determined, emotional characteristics of the user in the current voice made by the user for the machine voice reply are determined, then, a first satisfaction degree of the user for the machine voice reply is determined based on the emotional characteristics, if the current voice is an ending voice in multiple rounds of conversations, at least one second satisfaction degree of the user for the machine voice reply output by the human-computer interaction system in the historical round of conversations before the current round of conversations to which the current voice belongs is determined, and then, the multiple rounds of conversations are labeled based on the first satisfaction degree and the at least one second satisfaction degree. The method and the device for automatically marking the language materials of the human-computer interaction system determine the satisfaction degree of the user for the machine voice reply through the emotional characteristics of the user for the machine voice reply during the current voice, and determine the semantic understanding accuracy corresponding to the human-computer interaction system based on the satisfaction degree of the user for each machine voice reply in multiple rounds of conversation between the user and the human-computer interaction system, so that the automatic marking of the multiple rounds of conversation is realized, the corpus marking accuracy and efficiency of the human-computer interaction system are improved, the semantic understanding accuracy of the human-computer interaction system is improved, the reply efficiency and accuracy of the human-computer interaction system for the user conversation or the task execution efficiency and accuracy are improved, and the human-computer interaction effect is improved.

Fig. 3 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to another exemplary embodiment of the present disclosure. As shown in fig. 3, based on the embodiment shown in fig. 2, step 202 may include the following steps:

step 2021, determine the voice parameters of the user in the current voice response to the machine voice.

Optionally, in some embodiments, the speech parameters may include, but are not limited to, any one or more of the following: pitch, volume (also called loudness), etc., and the embodiments of the present disclosure do not limit the specific parameters of the speech parameters.

Where pitch is used to indicate the level of a sound, the level of a sound pitch is determined by the frequency of the vibration that produces the sound, and the faster the vibration, the higher the pitch. The volume is used to indicate the magnitude of the sound, and the magnitude of a sound is determined by the magnitude of the vibration that produces the sound, the greater the magnitude of the vibration, the greater the loudness. Colloquially, pitch refers to the sharpness of a sound, while volume refers to the magnitude of a sound, e.g., the sound of a child's stealing language, with a high pitch, but a low volume; while the voice is scared by the voice of the adult, the tone is low, but the volume is large.

Step 2022, determining the facial expression of the user in the current voice made by the user in response to the machine voice.

Optionally, in some embodiments, the facial expression may include, but is not limited to, any one or more of the following: satisfaction, dissatisfaction, happiness, neutrality, anger, fidget, etc., and the embodiments of the present disclosure do not limit the specific types of facial expressions.

Step 2023, based on the voice parameters and the facial expressions, determining the emotional characteristics of the user in the current voice made by the machine voice reply.

Optionally, in some of these embodiments, speech parameters and facial expressions may be used as emotional features; or, feature extraction may be performed on the voice parameters and the facial expressions respectively, and the extracted features are fused to obtain emotional features.

Based on the embodiment of the disclosure, the emotion characteristics of the user in response to the machine voice are determined through the voice parameters and the facial expressions of the user in the current voice, so that the emotion of the user can be objectively and truly determined, and the satisfaction degree of the user in response to the machine voice is determined.

Alternatively, in some embodiments, starting from the starting time point of the current voice of the user for the machine voice reply, the voice parameter component corresponding to each syllable of the current voice of the user is determined by taking syllable as a unit, namely, one voice parameter component (which may also be referred to as a unit voice parameter) corresponds to each syllable, and then, based on the voice parameter components obtained during the duration of the current voice, the voice parameter of the current voice of the user for the machine voice reply is determined. For example, the speech parameter components obtained during the duration of the current speech are accumulated or averaged to obtain the speech parameter of the user at the time of the current speech made for the machine speech reply. The embodiment of the present disclosure does not limit the specific manner in which the speech parameter component corresponding to the duration of the current speech determines the speech parameter in the current speech.

Based on the embodiment, the voice parameter components of all syllables in the duration of the voice are determined by taking the syllables as units, the voice parameters of the whole voice are determined based on the voice parameter components of all syllables, the voice parameters are determined more objectively, the voice parameters of the whole voice are obtained more accurately, and therefore the emotional characteristics of the user when the voice is made are determined accurately.

Fig. 4 is a flowchart illustrating a method for labeling a voice dialog in human-computer interaction according to still another exemplary embodiment of the present disclosure. As shown in fig. 4, on the basis of the embodiment shown in fig. 3, step 2022 may include the following steps:

step 20221, obtain the face image when the user makes this voice for the machine voice reply.

Step 20222, inputting the facial image into a first neural network obtained by pre-training, and outputting the facial expression corresponding to the facial image through the first neural network.

In some embodiments, when the user replies to the machine voice to make the voice, the user may collect a face image of the user when making the voice through a visual sensor (a camera), input the face image into a first neural network trained in advance, and output a facial expression corresponding to the face image through the first neural network. For example, when an audio acquisition device (e.g., a microphone or a microphone array) acquires that a user makes a current voice for a machine voice reply, a camera is triggered to acquire a face image of the user at the current time and input the face image into a first neural network, and then the first neural network identifies a facial expression in the face image and outputs the facial expression.

In the embodiment of the disclosure, a first neural network can be obtained in advance based on training of a face image sample with face expression labeling information, and after the training of the first neural network is completed, a face expression corresponding to an input face image can be identified.

Based on the embodiment, the facial expression corresponding to the facial image can be rapidly and accurately identified through the neural network, and the identification efficiency and accuracy of the facial expression are improved, so that the emotion characteristics of the user when the user makes the voice are accurately determined.

Optionally, in some implementations, in step 203 of any of the above embodiments, the first satisfaction may be output via a second neural network obtained by inputting the emotional features into the pre-trained second neural network.

In the embodiment of the disclosure, a second neural network can be obtained by training in advance based on the emotional feature sample with the satisfaction degree labeling information, and after the training of the second neural network is completed, the satisfaction degree (i.e. the first satisfaction degree) corresponding to each input emotional feature can be identified.

Based on the embodiment, the satisfaction corresponding to the emotion characteristics can be quickly and accurately determined through the neural network, so that the satisfaction of the user for machine voice reply can be quickly, accurately and objectively determined.

Optionally, in another implementation manner, in step 203 of any one of the above embodiments, a first emotion score corresponding to the speech parameter is determined, a second emotion score corresponding to the facial expression is determined, and then the first emotion score and the second emotion score are weighted and summed according to a preset manner to obtain the first satisfaction.

For example, a first emotion score corresponding to the voice parameter can be determined through a third neural network obtained through pre-training; a second emotion score corresponding to the facial expression can be determined through a fourth neural network obtained through pre-training; then, the first sentiment score and the second sentiment score are weighted and summed through a P + b Q ═ S, and the first satisfaction degree is obtained. The values of a and b can be preset and can be updated according to actual requirements; p, Q respectively expressing the first emotion score and the second emotion score, wherein the values are respectively larger than 0; s represents the first satisfaction.

Based on the embodiment, a first emotion score corresponding to the voice parameter and a second emotion score corresponding to the facial expression can be respectively determined, the weight values of the first emotion score and the second emotion score are reasonably determined according to requirements, and the first satisfaction degree is obtained by means of weighting and summing the first emotion score and the second emotion score, so that the satisfaction degree is determined to be more in line with actual requirements.

Optionally, in some implementations, in step 205 of any of the above embodiments, a comprehensive satisfaction of multiple sessions may be determined based on the first satisfaction and the at least one second satisfaction, and then the comprehensive satisfaction may be labeled for the multiple sessions.

Based on the embodiment, the satisfaction of each round of conversation in the current business scene can be comprehensively considered to determine the comprehensive satisfaction of the user on the machine voice reply in the whole current business scene, so that the semantic understanding accuracy corresponding to the man-machine interaction system in the current business scene is integrally determined, the automatic labeling of multiple rounds of conversations is realized, and the accuracy and the efficiency of the corpus labeling of the man-machine interaction system are improved.

Optionally, in some implementation manners, after step 205 in any one of the above embodiments, the first satisfaction may be further marked for the current round of conversation, so as to realize marking of the satisfaction of each round of conversation, which is beneficial to marking each round of conversation in the current service scenario based on the satisfaction of each round of conversation in the current service scenario when the man-machine of the current service scenario is finished.

Fig. 5 is a flowchart illustrating a method for tagging a voice dialog in human-computer interaction according to still another exemplary embodiment of the present disclosure. As shown in fig. 5, based on the embodiment shown in fig. 2, step 201 may include the following steps:

in step 2011, the previous speech is speech recognized to obtain a first text recognition result.

Step 2012, based on the historical round of dialog before the current round of dialog, performing semantic analysis on the first character recognition result to obtain a first semantic analysis result.

And 2013, acquiring reply content according to the first semantic analysis result.

Step 2014, converting the reply content into voice to obtain the machine voice reply.

Based on the embodiment, the first character recognition result is obtained by performing voice recognition on the previous voice made by the user, based on the historical round of conversation before the current round of conversation, semantic analysis is performed on the first character recognition result to obtain a first semantic analysis result, then reply content is obtained according to the first semantic analysis result, the reply content is converted into voice to obtain a machine voice reply, and the semantic analysis is performed on the first character recognition result corresponding to the previous voice of the user by combining the historical round of conversation to obtain the machine voice reply.

Fig. 6 is a flow chart diagram of an exemplary application embodiment of the present disclosure. As shown in fig. 6, the application embodiment takes an application scenario in the navigation APP as an example, and explains an application of the embodiment of the present disclosure. The application embodiment comprises the following steps:

step 301, the user initiates a first voice "ABC mall" to request navigation to the navigation destination ABC mall.

Step 302, the microphone array collects the audio signal of the first voice ABC mall, and the audio signal is sequentially subjected to front-end signal processing and voice recognition to obtain first text information and input the first text information into the human-computer interaction system.

Step 303, the human-computer interaction system understands the conversation meaning of the user and outputs a corresponding first machine voice reply of "what ABC mall? ".

The first voice "ABC mall" and the first machine voice reply "which ABC mall? "as a session, it may be referred to as a first session.

Step 304, the user replies to the first machine voice with a second voice "ABC mall at X location".

During the process of the user uttering the second voice "ABC mall at X", steps 305 and 306 are performed simultaneously.

Step 305, the microphone array collects the audio signal of the second voice of the ABC mall at the X place and inputs the audio signal of the second voice into the EPS; meanwhile, the audio signal is sequentially subjected to front-end signal processing and voice recognition to obtain second text information, and the second text information is input into the man-machine interaction system.

Thereafter, step 310 is performed.

And step 306, the camera collects the face image of the user and inputs the face image into the EPS.

Step 307, determining voice parameters including tone and volume of the second voice acquired by the microphone array by the EPS; and recognizing the facial expression corresponding to the facial image by using a first neural network obtained by pre-training.

And 308, determining the emotional characteristics of the user when the user sends out the second voice by the EPS based on the voice parameters and the facial expression.

In step 309, the EPS determines a first satisfaction of the user with respect to the first machine voice reply based on the emotional characteristic, where the first satisfaction corresponds to a satisfaction of the first round of conversation.

In step 310, the human-computer interaction system understands the meaning of the user's conversation and outputs a corresponding second machine voice reply "is the first ABC mall in X? ".

The second voice "ABC store in X place" and the second machine voice reply "is the first ABC store in X place? "as a session, it may be referred to as a second session.

In step 311, the user utters a third voice "Ben!for the second machine voice reply! ".

When the user utters the third voice "Ben! "steps 312 and 313 are performed simultaneously.

Step 312, the microphone array collects the third voice "Bengbi! "the audio signal of the third voice is input to the EPS; meanwhile, the audio signal is sequentially subjected to front-end signal processing and voice recognition to obtain third text information, and the third text information is input into a man-machine interaction system.

Thereafter, step 317 is performed.

And 313, acquiring a face image of the user by the camera, and inputting the face image into the EPS.

Step 314, determining voice parameters including tone and volume when the microphone array collects third voice by the EPS; and recognizing the facial expression corresponding to the facial image by using a first neural network obtained by pre-training.

And 315, determining the emotional characteristics of the user when the user sends the third voice by the EPS based on the voice parameters and the facial expression.

In step 316, the EPS determines a first satisfaction of the user with respect to the second machine voice reply based on the emotional characteristic, where the first satisfaction corresponds to a satisfaction of the second round of conversation.

At the moment, the second round of dialogue is the current round of dialogue, the first round of dialogue becomes the historical round of dialogue before the current round of dialogue, and the satisfaction degree of the first round of dialogue becomes the second satisfaction degree.

In step 317, the human-computer interaction system understands the meaning of the user's conversation and outputs a corresponding third machine voice reply "is the first ABC mall in X? ".

The third voice "ABC mall at X place" and the third machine voice reply "do D mall at X place" are used as a round of conversation, which may be referred to as a third round of conversation.

At step 318, the user utters a fourth voice "pair" for the third machine voice reply.

Then, for the fourth voice, the operations of

step

305 and 309 or step 312 and 316 are executed to obtain the first satisfaction of the user for the third machine voice reply.

Step 319, the human-computer interaction system understands the conversation meaning of the user, outputs a fourth machine voice reply of 'good', acquires the current position of the user as an initial position on the basis, and takes the D market in the X place as a navigation destination to execute a navigation task.

And 320, if the microphone array does not collect the voice sent by the user within the preset time, the EPS does not receive the audio signal and the face image again within the preset time, confirms that the fourth voice is the ending voice in the multi-turn conversations between the user and the man-machine interaction system, and determines three second satisfaction degrees corresponding to the first turn of conversation to the third turn of conversation.

And step 321, labeling the four dialogs based on the first satisfaction corresponding to the fourth voice and the three second satisfaction corresponding to the first dialog to the third dialog.

Any of the methods for tagging voice dialogs in human-computer interaction provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any method for labeling a voice dialog in human-computer interaction provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any method for labeling a voice dialog in human-computer interaction mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 7 is a schematic structural diagram of an apparatus for labeling a voice conversation in human-computer interaction according to an exemplary embodiment of the present disclosure. The device for labeling the voice conversation in the human-computer interaction can be arranged in electronic equipment such as a vehicle machine and a user terminal, and executes the method for labeling the voice conversation in the human-computer interaction according to any embodiment of the disclosure. As shown in FIG. 7, the apparatus for labeling the voice dialog in the man-machine interaction of the embodiment comprises: a first determining module 401, a second determining module 402, a third determining module 403, a fourth determining module 404 and an annotating module 405. Wherein:

a first determining module 401, configured to determine a machine voice reply made by the human-computer interaction system for a previous voice of the user.

A second determining module 402, configured to determine an emotional characteristic of the user in the current voice made for the machine voice reply.

A third determining module 403, configured to determine a first satisfaction degree of the user with respect to the machine voice reply based on the emotional characteristic.

A fourth determining module 404, configured to determine, if the current voice is an end voice in multiple rounds of dialogs, at least one second satisfaction of the user with respect to a machine voice reply output by the human-computer interaction system in a history round of dialogs before the current round of dialogs to which the current voice belongs, where one machine voice replies a voice of a corresponding user.

And the labeling module 405 is used for labeling the multiple turns of the dialog based on the first satisfaction degree and the at least one second satisfaction degree.

Fig. 8 is a schematic structural diagram of an apparatus for labeling a voice conversation in human-computer interaction according to another exemplary embodiment of the present disclosure. As shown in fig. 8, on the basis of the embodiment shown in fig. 7, in this embodiment, the second determining module 402 may include: a first determining unit 4021, configured to determine a voice parameter of the user in response to the machine voice at this time; a second determining unit 4022, configured to determine a facial expression of the user in response to the machine voice when the user makes the current voice; a third determining unit 4023, configured to determine, based on the voice parameter and the facial expression, an emotional characteristic of the user in the current voice made for the machine voice reply.

Optionally, in some embodiments, the first determining unit 4021 is specifically configured to: determining a voice parameter component corresponding to each syllable of the voice of the user in the current voice by taking the syllable as a unit from the beginning of the detected initial time point of the voice of the user for the machine voice reply of the user; and determining the voice parameters of the user aiming at the current voice made by the machine voice reply based on the voice parameter components obtained during the duration of the current voice.

Optionally, referring back to fig. 8, in a further exemplary embodiment, the second determining module 402 may further include: the first obtaining unit 4024 is configured to obtain a face image when the user makes the current voice for the machine voice reply. Correspondingly, in this embodiment, the second determining unit 4022 is specifically configured to: and inputting the face image into a first neural network obtained by pre-training, and outputting the face expression corresponding to the face image through the first neural network.

Optionally, in some embodiments, the third determining module 403 is specifically configured to: and inputting the emotional features into a second neural network obtained by pre-training, and outputting the first satisfaction degree through the second neural network.

Optionally, referring back to fig. 8, in some embodiments, the third determining module 403 may include: a third determining unit 4031, configured to determine a first emotion score corresponding to the voice parameter; a fourth determining unit 4032, configured to determine a second emotion score corresponding to the facial expression; and the weighting processing unit 4033 is configured to perform weighted summation on the first emotion score and the second emotion score according to a preset manner, so as to obtain a first satisfaction.

Optionally, referring back to fig. 8, in some embodiments, the labeling module 405 may include: a fifth determining unit 4051, configured to determine a comprehensive satisfaction degree of the multiple rounds of conversations based on the first satisfaction degree and the at least one second satisfaction degree; and the labeling unit 4052 is used for labeling the comprehensive satisfaction degrees of the multiple rounds of conversations.

Optionally, referring back to fig. 8, in some embodiments, the labeling module 405 may further be configured to: and marking the first satisfaction degree for the current round of conversation.

Optionally, referring back to fig. 8, in some embodiments, the first determining module 401 may include: the voice recognition unit 4011 is configured to perform voice recognition on a previous voice to obtain a first character recognition result; the semantic analysis unit 4012 is configured to perform semantic analysis on the first character recognition result based on a history round of conversations to obtain a first semantic analysis result; a second obtaining unit 4013, configured to obtain reply content according to the first semantic analysis result; and the conversion unit 4014 is configured to convert the reply content into a voice, so as to obtain a machine voice reply.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 9. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 9 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 9, the electronic device includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the method of labeling voice dialogs in human-computer interaction and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the first device 100 or the second device 200, the input device 13 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 9, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of labeling voice dialogs in human-computer interaction according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of tagging a voice conversation in human-computer interaction according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of annotating a voice dialog in a human-computer interaction, comprising:

2. The method of claim 1, wherein the determining emotional characteristics of the user at the time of the voice of the machine voice reply comprises:

determining voice parameters of the user in the current voice made by aiming at the machine voice reply;

determining the facial expression of the user in the current voice made by the machine voice reply;

and determining the emotional characteristics of the user in the current voice made by the machine voice reply based on the voice parameters and the facial expression.

3. The method of claim 2, wherein the determining the voice parameters of the user at the time of the voice of the machine voice reply comprises:

determining a voice parameter component corresponding to each syllable of the current voice of the user by taking the syllable as a unit from the beginning of the detected starting time point of the current voice made by the user aiming at the machine voice reply;

and determining the voice parameters of the user aiming at the current voice made by the machine voice reply based on the voice parameter components obtained during the duration of the current voice.

4. The method of claim 2, wherein the determining the facial expression of the user at the time of the current voice made for the machine voice reply comprises:

acquiring a face image when the user makes the voice for the machine voice reply;

and inputting the face image into a first neural network obtained by pre-training, and outputting the face expression corresponding to the face image through the first neural network.

5. The method of claim 2, wherein the determining a first satisfaction level of the user with respect to the machine voice reply based on the emotional feature comprises:

determining a first emotion score corresponding to the voice parameter;

determining a second emotion score corresponding to the facial expression;

and according to a preset mode, carrying out weighted summation on the first emotion score and the second emotion score to obtain the first satisfaction.

6. The method of any of claims 2-5, wherein said annotating the plurality of conversations based on the first satisfaction and the at least one second satisfaction comprises:

determining a comprehensive satisfaction of the multiple rounds of dialog based on the first satisfaction and the at least one second satisfaction;

and marking the comprehensive satisfaction degrees for the multiple rounds of conversations.

7. The method of any of claims 1-6, further comprising, after determining a first satisfaction level of the user with respect to the machine voice reply based on the emotional characteristic:

and marking the first satisfaction degree for the current round of dialogue.

8. An apparatus for annotating a voice conversation in a human-computer interaction, comprising:

9. A computer-readable storage medium, in which a computer program is stored, the computer program being adapted to perform the method of tagging speech dialogs in human-computer interactions according to any of the claims 1-7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for marking voice conversations in human-computer interaction as claimed in any one of the claims 1 to 7.