CN112767916A

CN112767916A - Voice interaction method, device, equipment, medium and product of intelligent voice equipment

Info

Publication number: CN112767916A
Application number: CN202110164588.4A
Authority: CN
Inventors: 熊志伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-05-07
Anticipated expiration: 2041-02-05
Also published as: CN112767916B

Abstract

The present disclosure discloses a voice interaction method, apparatus, device, medium and product of an intelligent voice device, relating to the technical field of computers, in particular to the technical field of artificial intelligence and voice interaction. The specific implementation scheme is as follows: acquiring target voice pointing to a currently triggered target voice application program, and predicting the response satisfaction degree of the target voice application program to the target voice; if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice; and determining a target response result in the response results, and performing user feedback according to the target response result. The scheme disclosed by the invention solves the problems that the response result of the intelligent voice equipment is single and the intelligent degree is lower aiming at the target voice triggering the target voice application program, provides a new voice interaction mode, improves the voice interaction efficiency and improves the intelligent degree of the intelligent voice equipment.

Description

Voice interaction method, device, equipment, medium and product of intelligent voice equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to, but not limited to, a method, an apparatus, a device, a medium, and a product for voice interaction in an intelligent voice device.

Background

With the continuous development of intelligent voice equipment such as intelligent sound boxes, mobile phone voice assistants or vehicle-mounted voice systems, great convenience is brought to the life of people. Through the intelligent voice device, people can inquire weather, listen to music, listen to radio, inquire information or shop and the like.

How to improve the voice interaction capability of the intelligent voice equipment is a key issue of attention in the industry.

Disclosure of Invention

The present disclosure provides a voice interaction method, apparatus, device, medium and product for an intelligent voice device.

According to an aspect of the present disclosure, a voice interaction method of an intelligent voice device is provided, including:

acquiring target voice pointing to a currently triggered target voice application program, and predicting the response satisfaction degree of the target voice application program to the target voice;

if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice;

and determining a target response result in each response result, and performing user feedback according to the target response result.

According to another aspect of the present disclosure, there is provided a voice interaction apparatus of an intelligent voice device, including:

the response satisfaction predicting module is used for acquiring target voice pointing to a currently triggered target voice application program and predicting the response satisfaction of the target voice application program to the target voice;

a response result generation module, configured to generate at least one response result corresponding to the target voice according to a voice feature and/or a scene feature of the target voice if it is determined that the response satisfaction does not meet a preset threshold condition;

and the target response result determining module is used for determining a target response result in each response result and performing user feedback according to the target response result.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a voice interaction method of a smart voice device according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a voice interaction method of an intelligent voice device according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a voice interaction method of an intelligent voice device according to any one of the embodiments of the present disclosure.

According to the technical scheme disclosed by the invention, a new voice interaction method of the intelligent voice equipment is provided.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a voice interaction method of an intelligent voice device according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another method of voice interaction for a smart voice device in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a voice interaction method of a smart voice device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a voice interaction method of a smart voice device according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a voice interaction apparatus of an intelligent voice device according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a voice interaction method of a smart voice device according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Before specifically describing the embodiments of the present disclosure, it should be noted that the smart audio device may be loaded with a system application such as weather or music, and may also be loaded with a third-party audio application. In general, system applications can meet the needs of users and can perform efficient voice interaction with users. However, for the third-party voice application program, because the samples collected by the third-party developer are few, the response result of the third-party voice application program to the voice of the user is poor, and therefore, how to improve the performance of the third-party voice application program, so as to improve the intelligence degree of the intelligent voice device, is a key issue of attention in the industry.

Fig. 1 is a schematic diagram of a voice interaction method of an intelligent voice device according to an embodiment of the present disclosure, where the present embodiment is applicable to a case of optimizing a voice interaction manner of the intelligent voice device, and the method may be executed by a voice interaction apparatus of the intelligent voice device, and the apparatus may be implemented by software and/or hardware and integrated in an electronic device; the electronic device related in this embodiment may be a smart speaker, a computer, a smart phone, a smart watch, a tablet computer, or the like. Specifically, referring to fig. 1, the method specifically includes the following steps:

s110, obtaining target voice pointing to the currently triggered target voice application program, and predicting the response satisfaction degree of the target voice application program to the target voice.

In this embodiment, the smart audio device may be a smart speaker, a voice dialog system embedded in the smart electronic device, or a vehicle-mounted audio system, which is not limited in this embodiment.

The target voice application program can be a third-party application program installed in the intelligent voice device; for example, the target voice application may be a teaching application, a query application, or a consultation application installed in the smart voice device, which is not limited in this embodiment.

It is understood that the target voice pointing to the currently triggered target voice application specifically refers to any voice information received by the intelligent voice device when the target voice application in the intelligent voice device is in the triggered state.

For example, the voice data such as start of cooking, weather of today, going to a shopping mall, etc., is not limited in the present embodiment.

It should be noted that the smart voice device can be loaded with a system voice application (e.g., a weather application or a music playing application) and a third party voice application. In view of the existing implementation, when no third-party voice application is currently in a trigger state, each time the intelligent voice device receives the voice input by the user, the response result and the corresponding response score of each system voice application to the voice input by the user can be obtained, and the response result of which system voice application has the highest score is fed back by the user. However, once the user invokes (triggers) one of the third-party voice applications by setting a voice command, the voice input by the user, which is received by the smart voice device again, is only responded by the third-party voice application.

In a specific example, when the intelligent voice device receives "open application a" input by the user, the intelligent voice device may trigger the internally installed third-party application a accordingly, and the third-party application a outputs the voice in response to the input voice of the user.

In an optional implementation manner of the embodiment, when the target voice directed to the target voice application is received, the response satisfaction degree of the target voice application to the target voice may be predicted before the target voice is sent to the target voice application.

The response satisfaction degree of the target voice application program to the target voice can be satisfied or unsatisfied; any value between 0 and 1 or any value between 0 and 100 may be used, which is not limited in this embodiment. Illustratively, the prediction result of the satisfaction degree of the response of the target voice application program to the target voice may be satisfied; can also be 0.98; 98, etc. are also possible.

In an optional implementation manner of this embodiment, after the target speech pointing to the currently triggered target speech application is acquired, the acquired target speech may be input into the response satisfaction prediction model, so as to output a response satisfaction prediction result of the target speech application on the target speech.

In another optional implementation manner of this embodiment, after the target speech pointing to the currently triggered target speech application program is acquired, the acquired feature vector of the target speech may be compared with feature vectors of voices collected in advance, so as to determine a result of predicting the response satisfaction of the target speech application program to the target speech. For example, if the feature vector of the target speech is the same as the feature vector of one reference speech collected in advance, the response satisfaction of the target application program to the reference speech may be used as the response satisfaction prediction result of the target speech application program to the target speech.

And S120, if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice.

The threshold condition may be satisfactory, or may be a threshold condition (for example, the threshold condition may be a numerical value such as 0.6, 0.8, or 60, which is not limited in this embodiment), which is not limited in this embodiment. For example, if the threshold condition is satisfied and the prediction result of the response satisfaction of the target speech application program to the target speech is unsatisfactory, it may be determined that the predicted response satisfaction does not satisfy the preset threshold condition; if the threshold condition is 0.8 and the prediction result of the response satisfaction of the target voice application program to the target voice is 0.7, it can be determined that the predicted response satisfaction does not meet the preset threshold condition.

In an optional implementation manner of this embodiment, after the response satisfaction of the target speech application program for the target speech is predicted, the response satisfaction may be compared with a preset threshold condition, and if it is determined that the response satisfaction does not meet the preset threshold condition, at least one response result corresponding to the target speech may be generated according to the acquired speech feature and scene feature of the target speech, or the speech feature and the scene feature.

The voice feature of the target voice can be a semantic understanding result corresponding to the target voice; the scene characteristic of the target voice can be a chatting scene or an instruction scene, for example, if the user chats with the intelligent voice device, the scene characteristic of the target voice is the chatting scene; if the user sends an instruction to the intelligent voice equipment, the scene characteristic of the target voice is an instruction scene; the query instruction or the instruction for turning on another internet of things device (for example, a lamp or a television set, etc.) is not limited in this embodiment.

In an example of this embodiment, if the obtained target speech is "play music", and the currently triggered target speech application is a game application, and the satisfaction degree of the game application in response to "play music" is predicted to be unsatisfactory, that is, the satisfaction degree of the response does not satisfy the preset threshold, a response result "guess me, which means that is it is? "; and generating a response result according to the semantic understanding result of playing music and the scene characteristic of playing music.

It should be noted that the response result generated in this embodiment is not unique, and multiple response results may be generated at the same time, for example, the response result generated in the above example may also be: just i did not hear, please say again, or could not understand, you could try another question, etc.

And S130, determining a target response result in the response results, and performing user feedback according to the target response result.

In an optional implementation manner of this embodiment, after a plurality of response results corresponding to the target voice are generated, the response results may be sorted to filter out the target response result, and the target response result is fed back to the user.

In an example of this embodiment, the generated multiple responses may be prioritized or scored, the response result with the highest priority or the highest score is determined as the target response result, and the target response result is further fed back to the user to wait for receiving the next voice data sent by the user.

In another optional implementation manner of this embodiment, when the target response result does not satisfy the condition, the target response result may also be directly fed back to the target voice application program, so as to respond to the target voice through the target voice application program.

According to the scheme of the embodiment, the response satisfaction can be predicted without sending the target voice to the target voice application program by acquiring the target voice pointing to the currently triggered target voice application program and predicting the response satisfaction of the target voice application program to the target voice; if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice, enriching the response result, and improving the interactive experience of the user and the intelligent voice equipment; and determining a target response result in each response result, and performing user feedback according to the target response result, so that the problems of single response result and low intelligent degree of the intelligent voice equipment aiming at the target voice triggering the target voice application program are solved, a new voice interaction mode is provided, the voice interaction efficiency is improved, and the intelligent degree of the intelligent voice equipment is improved.

Fig. 2 is a schematic diagram of another speech interaction method of an intelligent speech device according to an embodiment of the present disclosure, where this embodiment is a further refinement of the above technical solution, and the technical solution in this embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 2, the voice interaction method of the intelligent voice device includes the following steps:

s210, target voice pointing to the currently triggered target voice application program is obtained.

S220, inputting a text recognition result of the target voice and an application identifier of a target voice application program into a pre-trained online response satisfaction prediction model; and obtaining the response satisfaction output by the online response satisfaction prediction model.

Optionally, after the target speech pointing to the currently triggered target speech application program is obtained, the target speech may be input to the natural language processing module, so as to obtain a text recognition result of the target speech.

Further, the text recognition result of the target voice and the application identifier of the target voice application program can be input into the online response satisfaction prediction model trained in advance, so that the response satisfaction output by the online response satisfaction model is obtained.

Wherein the application id of each target voice application is unique, that is, each third-party application has unique identification information in the intelligent voice device, for example, if three third-party applications are included in the intelligent voice device, the application ids of the three third-party applications may be 001, 002 and 003, respectively.

In an optional implementation manner of this embodiment, before inputting the text recognition result of the target speech and the application identifier of the target speech application program into the pre-trained online response satisfaction prediction model, a plurality of sample data to be labeled may be further obtained, where the sample data to be labeled includes: the method comprises the steps that a user inputs a text, context information of the user input text and an application identifier; respectively inputting each sample data to be marked into a pre-trained offline response satisfaction prediction model; acquiring the response satisfaction output by the offline response satisfaction prediction model; constructing a plurality of training samples according to the input text of the user, the application identification and the response satisfaction; and training a preset machine learning model by using each training sample to obtain an online response satisfaction prediction model.

Optionally, before inputting the text recognition result of the target speech and the application identifier of the target speech application program into the pre-trained online response satisfaction prediction model, the online response satisfaction prediction model can be obtained through training; optionally, training the online response satisfaction prediction model may include: the method comprises the steps of obtaining a plurality of user input texts to be labeled, context information corresponding to each input text and application identifications of all target voice application programs in the intelligent voice equipment.

Further, the obtained multiple user input texts to be labeled, the context information corresponding to each input text, and the application identifiers of all target voice application programs in the intelligent voice device can be input into a pre-trained offline response satisfaction prediction model, so that the responsiveness output by the offline response satisfaction prediction model is obtained.

Further, according to the user input text, the application identification and the response satisfaction output by the off-line response satisfaction prediction model, a plurality of training samples are constructed; illustratively, a piece of user input text, application identification, and response satisfaction may be combined into a training sample.

Furthermore, each constructed training sample can be used for training a preset machine learning model, so that an online responsiveness satisfaction prediction model is obtained. Wherein. The preset machine learning model may be a transform model or other natural language processing model, which is not limited in this embodiment.

The method has the advantages that an online responsiveness satisfaction prediction model can be obtained through training according to a plurality of user input texts to be labeled, context information corresponding to each input text and application identifications of all target voice application programs in the intelligent voice device, response satisfaction of the target voice application programs to target voice can be accurately predicted, and execution efficiency of an algorithm is improved.

And S230, if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice.

S240, if the response satisfaction is determined to meet the preset threshold condition, providing the text recognition result of the target voice to the target voice application program; and carrying out user feedback on the response result provided by the target voice application program.

In another optional implementation manner of this embodiment, after the response satisfaction of the target speech application program for the target speech is predicted, the response satisfaction may be compared with a preset threshold condition, and if it is determined that the response satisfaction satisfies the preset threshold condition, the text identification result of the target speech may be directly provided to the target speech application program, and the response result provided by the target speech application program is fed back to the user.

In an example of this embodiment, if the obtained target speech is "play music", and the currently triggered target speech application is a music play application, the text identification result of "play music" may be directly provided to the music play application by predicting that the response satisfaction of the music play application to "play music" is satisfactory, that is, the response satisfaction satisfies a preset threshold, and the response result provided by the music play application is fed back to the user.

The method has the advantages that when the response satisfaction degree is determined to meet the preset threshold condition, the target voice can be directly provided for the target voice application program, the response result is provided through the target voice application program, and the voice interaction efficiency of the user and the intelligent voice device can be further improved.

And S250, determining a target response result in the response results, and feeding back the user according to the target response result.

According to the scheme of the embodiment, the text recognition result of the target voice and the application identification of the target voice application program are input into the pre-trained online responsiveness satisfaction prediction model, the response satisfaction output by the online satisfaction prediction model is obtained, the response satisfaction of the target voice application program to the target voice can be rapidly predicted, the voice interaction efficiency is further improved, and a basis is provided for improving the intelligent degree of the intelligent voice device.

Fig. 3 is a schematic diagram of a voice interaction method of another intelligent voice device according to an embodiment of the present disclosure, where this embodiment is a further refinement of the above technical solution, and the technical solution in this embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 3, the voice interaction method of the intelligent voice device includes the following steps:

s310, obtaining the target voice pointing to the currently triggered target voice application program, and predicting the response satisfaction degree of the target voice application program to the target voice.

And S320, if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice.

Optionally, generating at least one response result corresponding to the target speech according to the speech feature and/or the scene feature of the target speech may include the following operations, which may be implemented independently or in combination; the operations may be performed in series or in parallel in a predetermined order. I.e., the order of implementation and combination is not limited. The specific operation is as follows:

s321, obtaining a voice confidence coefficient matched with the target voice; and if the voice confidence coefficient does not meet the preset threshold condition, generating a response result for requesting the user to input again.

The confidence of the speech matching with the target speech may be the intelligibility of the target speech, and may be any value between 0 and 1, which is not limited in this embodiment. It should be noted that the greater the confidence of the speech, the higher the clarity of the target speech, and the greater the probability that the target speech application can understand the target speech. In this embodiment, the preset threshold may be a value such as 0.6, 0.7, or 0.8, which is not limited in this embodiment.

It should be noted that, in this embodiment, the definition (speech confidence) of the target speech may be determined through the natural language processing model, that is, after the target speech is acquired, the target speech may be input into the natural language processing model, and the target speech is understood through the natural language processing model, so as to determine the confidence of the target speech. The natural language processing model is a speech processing module, which is not described herein in detail and is not limited in this embodiment.

In an optional implementation manner of this embodiment, generating at least one response result corresponding to the target speech according to the speech feature and/or the scene feature of the target speech may include: acquiring a voice confidence coefficient matched with the target voice to determine whether the target voice is clear; when the voice confidence coefficient does not meet the preset threshold condition, namely under the condition that the target voice is determined to be unclear, a response result requesting the user to input again can be generated; wherein the response result of requesting the user to re-input may be "did i not hear and please say again? "

The method has the advantages that when the voice confidence of the target voice does not meet the preset condition, the response result which requests the user to input again can be directly generated, the situation that the response result is wrong due to unclear target voice can be prevented, and the accuracy and the efficiency of voice interaction are improved.

S322, detecting whether a target change candidate result matched with the target voice exists or not; if yes, a response result requesting the user to confirm the target change candidate result is generated.

Wherein the target change candidate result is a voice text related to the target voice; for example, the target voice is that i wants to make a dish, and the target change candidate result matched with i wants to make a dish may be to start making a dish or to start cooking, which is not limited in this embodiment.

In an optional implementation manner of this embodiment, after the response satisfaction of the target speech application program for the target speech is obtained through prediction, if the response satisfaction does not satisfy the preset condition, it may be further detected whether there is a target change subsequent result matching the target speech

It should be noted that, in this embodiment, a change candidate result matching the target speech may be generated by the change module; for example, the target speech may be input to the alteration module, thereby obtaining alteration candidate results matching the target speech. The changing module is a speech processing module, which is not described herein in detail, and is not limited to this embodiment.

In an optional implementation manner of the embodiment, if a target alteration candidate result matching the target voice is detected, a response result requesting the user to confirm the target alteration candidate result may be generated. For example, if the target voice is me to make a dish, and the target change candidate result matching me to make a dish is detected as start to make a dish, a response result "do you want to say so, start to make a dish" may be generated.

The advantage of setting up like this is that can detect the change candidate result, and the response result of the change candidate result that the request user confirmed provides the basis for confirming user's intention accurately, has promoted voice interaction's accuracy and efficiency.

S323, obtaining the response result score of each associated voice application program to the target voice; if the target associated voice application program with the response result score meeting the preset threshold condition exists, determining the current interaction state according to the historical interaction record matched with the target voice; if the current state is not in the strong interaction state, generating a response result for calling the target associated voice application program; and if the current state is in the strong interaction state, generating a response result for requesting the user to confirm to invoke the target associated voice application program.

The associated speech application may be a system application of a device in the intelligent speech apparatus, such as a loaded weather query application, a music playing program, or a translation program, which is not limited in this embodiment. It should be noted that, in this embodiment, the associated voice program is an application program that is loaded after the intelligent voice device leaves the factory, and the target voice application program may be a third-party application program installed by the user. Typically, the speech understanding capabilities of the associated speech application may be higher than the target speech application.

In an optional implementation manner of this embodiment, after the response satisfaction of the target voice application program to the target voice is obtained through prediction, if the response satisfaction does not satisfy the preset condition, a response result score of each associated voice application program in the intelligent voice device to the target voice may be obtained, and if it is determined that there is a target associated voice application program whose response result score satisfies the preset threshold condition, a current interaction state may be determined according to a historical interaction record matched with the target voice; the current interaction state may be a strong interaction state or a non-strong interaction state, which is not limited in this embodiment.

Wherein, the strong interaction state may be that the number of interactions with the target voice application program by the user within a set time (e.g., 30 seconds, 1 minute, etc.) is multiple (e.g., 2, 3, 5, etc.), and then it may be determined that the strong interaction state is currently in place.

Further, if it is determined that the target associated voice application is currently in a strong interaction state, a response result may be generated requesting the user to determine to invoke the target associated voice application; if it is determined that the target associated application is not currently in a strong interaction state, a response result may be generated that invokes the target associated application.

It should be noted that if the user has performed multiple rounds of voice interaction with the target voice application program, directly jumping out the dialog to the target associated application program will bring bad dialog experience to the user; in this embodiment, when it is determined that the user has performed multiple rounds of voice interaction with the target voice application program, a response result requesting the user to determine to invoke the target associated voice application program may be generated, and whether to transfer the dialog to the target associated voice application program may be determined according to a selection of the user, which may enhance an intelligent degree of the intelligent voice device and improve a satisfaction degree of the user with the intelligent device.

S324, obtaining a scene type matched with the target voice; and generating a response result which is matched with the scene type and requests the user to convert the question.

The scene type may include: the instruction class or the chat class, which is not limited in this embodiment.

In an optional implementation manner of this embodiment, after the response satisfaction of the target speech application program for the target speech is obtained through prediction, if the response satisfaction does not satisfy the preset condition, a scene type matched with the target speech may be further obtained, and a response result matched with the scene type and requesting a user to switch a question is generated.

For example, if the scene type of the target voice is an instruction class, it may generate "the current target voice application does not support this instruction yet, and it is troublesome to change an instruction trial"; if the scene type of the target voice is chat type, the method can generate 'the current target voice application program does not support chat, and ask me for questions and the like'.

The advantage of setting up like this, can generate a pocket end response result, the intelligent degree of further reinforcing intelligent speech equipment promotes the degree of satisfaction of user to intelligent equipment.

S330, determining a target response result in the response results, and feeding back the user according to the target response result.

According to the scheme of the embodiment, under the condition that the response satisfaction is determined not to meet the preset threshold, a plurality of response results corresponding to the target voice can be generated according to the voice characteristics and/or the scene characteristics of the target voice, the target voice can be responded to different degrees, the intelligent degree of the intelligent voice equipment can be enhanced, and the satisfaction of a user on the intelligent equipment is improved.

Fig. 4 is a schematic diagram of still another voice interaction method of an intelligent voice device according to an embodiment of the present disclosure, where this embodiment is a further refinement of the above technical solution, and the technical solution in this embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 4, the voice interaction method of the intelligent voice device includes the following steps:

s410, acquiring target voice pointing to the currently triggered target voice application program, and predicting the response satisfaction degree of the target voice application program to the target voice.

And S420, if the response satisfaction is determined not to meet the preset threshold condition, generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice.

S430, if the number of the generated response results is multiple, sequencing the response results according to a preset priority order; and obtaining a target response result according to the sequencing result.

In an optional implementation manner of this embodiment, after generating multiple response results corresponding to the target speech, the response results may be further sorted according to a preset priority order, so as to obtain the target response result. The preset priority order is not fixed and can be set according to different scenes.

In an example of the present embodiment, the preset priority order may be a response result that requests the user to re-input, a response result that requests the user to confirm the target change candidate result, a response result that generates the call-up target associated voice application, and a response result that requests the user to switch the question.

In another example of this embodiment, the preset priority order may be a response result that requests the user to re-input, a response result that requests the user to confirm the target change candidate result, a response result that generates a response result that requests the user to confirm the invocation of the target associated voice application, and a response result that requests the user to switch the question.

The advantage of this arrangement is that the response result with the highest priority, i.e. the highest user satisfaction, can be determined, and basis can be provided for enhancing the intelligent degree of the intelligent voice device and improving the user satisfaction degree of the intelligent device.

S440, determining whether the frequency of continuously feeding back the target response result reaches a set threshold value; and if so, providing the text recognition result of the target voice to the target voice application program, and performing user feedback on the response result provided by the target voice application program.

The threshold may be set to 3 times, 4 times, or 5 times, and the like, which is not limited in this embodiment.

In an optional implementation manner of this embodiment, if it is determined that the number of times of feeding back the target response result has reached the set threshold (e.g., 3 times), that is, the target response result has been fed back 3 times continuously, the text recognition result of the target speech may be provided to the target speech application program to respond to the target speech through the target speech application program, and the response result of the target speech application program is fed back to the user.

For example, if a target response result of "pair-miss please say again" is continuously fed back for 5 times for the target speech, in order to prevent a bad dialog experience brought to the user by entering into an endless loop, at this time, a text recognition result of the target speech may be provided to the target speech application, and the target speech application may respond to the target speech.

According to the scheme of the embodiment, when the number of times of continuously feeding back the target response result is determined to reach the set threshold, the text recognition result of the target voice can be provided for the target voice application program, and the response result provided by the target voice application program is subjected to user feedback, so that the problem that a computer program product enters endless loop to bring poor conversation experience to a user can be avoided, and the intelligent degree of the intelligent voice device is further improved.

In order to make the person skilled in the art better understand the parking space navigation method related to the present disclosure, the following describes the present disclosure with a specific example, which mainly includes the following three stages:

first stage, response satisfaction assessment.

On-line satisfaction prediction, the present disclosure models it as a regression problem: given the target speech, response satisfaction is predicted, wherein the lower the satisfaction the greater the likelihood of non-satisfaction.

In the present embodiment, detailed description is made from the level of features, models, and samples.

1. And (5) characterizing.

In an optional implementation manner of this embodiment, two features of the application identifier of the target application program and the text recognition result of the voice data are selected.

2. And (4) modeling.

In an alternative implementation of this embodiment, a transform model may be employed.

3. And (4) sampling.

In this embodiment, by obtaining a plurality of sample data to be labeled, the sample data to be labeled includes: the method comprises the steps that a user inputs a text, context information of the user input text and an application identifier; respectively inputting each sample data to be marked into a pre-trained offline response satisfaction prediction model; acquiring the response satisfaction output by the offline response satisfaction prediction model; and constructing a plurality of training samples according to the input text of the user, the application identification and the response satisfaction degree.

And in the second stage, generating a response result.

In an optional implementation manner of this embodiment, a plurality of different response results may be generated, and the generation condition, the applicable scenario, and the presentation effect of each response result are different.

Illustratively, for the response result "speak again":

the generation conditions are as follows: the speech confidence is lower than a threshold (here, the speech confidence measures the intelligibility of the target speech);

applicable scenarios are as follows: natural language processing problems, including misrecognition or miscut;

and (3) presenting effects: the intelligent speech device playback dialog "did i not hear well just before, please say it again? ".

Illustratively, for the response result "check":

the generation conditions are as follows: the target voice conversion module has transfer candidates (the target voice conversion module is an upstream module and has the functions of rewriting, correcting errors and the like for the target voice, and if the target voice conversion module has a high-confidence candidate, the target voice conversion module directly converts the target voice, and if only the target voice conversion module has a low-confidence candidate, the target voice conversion module does not convert the target voice, but transfers the target voice conversion module to the downstream);

applicable scenarios are as follows: a 'generalization' scenario with limited understanding capability of a target voice application, such as learning-like voice application can understand 'start to do dishes', but cannot understand 'i want to do dishes';

and (3) presenting effects: the user says "i want to do a dish", the intelligent voice device plays the word "do you want to say 'start making a dish'? You can say me again "and if the user says" yes "at this time, the system will directly issue" start to do dishes "to the learning-like speech application, so that the user gets a satisfactory result.

Illustratively, for the response result "strong interrupt":

the generation conditions are as follows: better results are obtained in the associated voice application;

applicable scenarios are as follows: jumping out, the target voice application program has limited capability and no resources or function support;

and (3) presenting effects: the user speaks "play music" in the learning-like speech application, and the system will directly tune up the associated speech application "music" and play the music for the user.

Illustratively, for the response result "query interrupt":

the generation conditions are as follows: there are better results in the associated voice application, and it is currently in a strong interaction state (e.g., the user completes multiple rounds of interaction within a short time above);

applicable scenarios are as follows: the method is consistent with the strong interrupt, but the generation increases the limit of 'strong interaction', and under the strong interaction state, if the false triggering (namely the current target voice application program can actually meet the current target voice and the model prediction is wrong), the damage to the user experience of the user is large, so the damage is weakened by adopting the inquiry interrupt as a fault-tolerant mechanism;

and (3) presenting effects: a user says "play music" in a learning-like voice application program, and an intelligent voice device plays the words "i guess, do you mean listening to music? ", at this point if the user says" yes, "the system will invoke the associated speech application" music "and play the music for the user.

Illustratively, for the response result "convert speech":

the generation conditions are as follows: this is a bottom-of-pocket response, which is generated as long as there is a response satisfaction signal;

the use scenario is as follows: target voices, such as instruction class, chatting class and the like, which cannot be met by other skills;

and (3) presenting effects: different types of target voices are played according to different dialects, and the type (scene characteristic) of the target voice can be distinguished through the intention analyzed by the upstream module in the embodiment; if the instruction type target voice is adopted, the dialect is that the current skill does not support the instruction yet, and the instruction is changed into a troublesome trial test; if the target voice is the chatting target voice, the dialect is that the current skill does not support the chatting wo, and a question is changed to ask me; if the target voice is other target voice, the words are 'unable to understand, you can change questions' and the like.

And a third stage of feedback.

In this embodiment, for the generated multiple response results, the response results may be sorted according to a set priority order, so as to obtain a target response result. The preset priority order is not fixed and can be set according to different scenes.

According to the scheme of the embodiment, aiming at the technical level and available limited data of a third-party developer in the prior art, the understanding and satisfying capacity of most target voice application programs is far lower than that of the associated voice application programs, the scheme is not limited to the target voice application program to respond to the voice of the user, but an optimal mode is selected to interact with the user according to different states of the voice of the user, the voice interaction times are reduced, the voice interaction efficiency is improved, and the satisfaction degree of the user on intelligent voice equipment is improved.

Fig. 5 is a schematic structural diagram of a voice interaction apparatus of an intelligent voice device according to an embodiment of the present disclosure, where the apparatus may perform a voice interaction method of the intelligent voice device related to any embodiment of the present disclosure; referring to fig. 5, the voice interaction apparatus 500 of the smart voice device includes: a response satisfaction prediction module 510, a response result generation module 520, and a target response result determination module 530.

A response satisfaction predicting module 510, configured to obtain a target voice pointing to a currently triggered target voice application, and predict a response satisfaction of the target voice application to the target voice;

a response result generating module 520, configured to generate at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice if it is determined that the response satisfaction does not meet a preset threshold condition;

a target response result determining module 530, configured to determine a target response result in each response result, and perform user feedback according to the target response result.

According to the scheme of the embodiment, the target voice pointing to the currently triggered target voice application program is obtained through a response satisfaction predicting module, and the response satisfaction of the target voice application program to the target voice is predicted; generating at least one response result corresponding to the target voice according to the voice feature and/or the scene feature of the target voice through a response result generation module; the target response result is determined in each response result through the target response result determining module, and user feedback is carried out according to the target response result, so that the problems that the response result of the intelligent voice equipment is single and the intelligent degree is low for the target voice triggering the target voice application program are solved, a new voice interaction mode is provided, the voice interaction efficiency is improved, and the intelligent degree of the intelligent voice equipment is improved.

In an optional implementation manner of this embodiment, the response satisfaction predicting module 510 is specifically configured to

Inputting a text recognition result of the target voice and an application identifier of the target voice application program into a pre-trained online response satisfaction prediction model;

and acquiring the response satisfaction output by the online response satisfaction prediction model.

In an optional implementation manner of this embodiment, the voice interaction apparatus 500 of the intelligent voice device further includes:

the online response satisfaction prediction model determining module is used for acquiring a plurality of sample data to be labeled, and the sample data to be labeled comprises: the method comprises the steps that a user inputs a text, context information of the user input text and an application identifier;

respectively inputting the sample data to be marked into a pre-trained offline response satisfaction prediction model;

acquiring the response satisfaction output by the offline response satisfaction prediction model;

constructing a plurality of training samples according to the user input text, the application identification and the response satisfaction;

and training a preset machine learning model by using each training sample to obtain the online response satisfaction prediction model.

a text recognition result providing module, configured to provide a text recognition result of the target speech to a target speech application program if it is determined that the response satisfaction satisfies a preset threshold condition;

and carrying out user feedback on the response result provided by the target voice application program.

In an optional implementation manner of this embodiment, the response result generating module 520 includes: a first response result generation submodule for

Acquiring a voice confidence coefficient matched with the target voice;

and if the voice confidence coefficient does not meet the preset threshold condition, generating a response result for requesting the user to input again.

In an optional implementation manner of this embodiment, the response result generating module 520 includes: a second response result generation submodule for

Detecting whether a target change candidate result matched with the target voice exists or not;

and if so, generating a response result for requesting the user to confirm the target change candidate result.

In an optional implementation manner of this embodiment, the response result generating module 520 includes: a third response result generation submodule for

Obtaining the response result score of each associated voice application program to the target voice;

if the target associated voice application program with the response result score meeting the preset threshold condition exists, determining the current interaction state according to the historical interaction record matched with the target voice;

if the target associated voice application program is determined not to be in the strong interaction state at present, generating a response result for calling the target associated voice application program;

and if the current state is in the strong interaction state, generating a response result for requesting the user to confirm to invoke the target associated voice application program.

In an optional implementation manner of this embodiment, the response result generating module 520 includes: a fourth response result generation submodule for

Obtaining a scene type matched with the target voice, wherein the scene type comprises: an instruction class or a chat class;

and generating a response result which is matched with the scene type and requests the user to convert the question.

In an optional implementation manner of this embodiment, the target response result determining module 530 includes: a target response result determination submodule for

If the number of the generated response results is multiple, sequencing the response results according to a preset priority order;

and obtaining a target response result according to the sequencing result.

In an optional implementation manner of this embodiment, the target response result determining module 530 includes: a feedback sub-module for

Determining whether the number of times of continuously feeding back the target response result reaches a set threshold value;

and if so, providing the text recognition result of the target voice to a target voice application program, and performing user feedback on a response result provided by the target voice application program.

The voice interaction device of the intelligent voice equipment can execute the voice interaction method of the intelligent voice equipment provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. Technical details that are not described in detail in this embodiment can be referred to a voice interaction method of an intelligent voice device provided in any embodiment of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 606 to a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components of device 600 are connected to I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 606 such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the voice interaction method of the smart voice device. For example, in some embodiments, the voice interaction method of the smart voice device may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 606. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the voice interaction method of the intelligent speech device described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the voice interaction method of the smart voice device.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product of a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A voice interaction method of an intelligent voice device comprises the following steps:

2. The method of claim 1, wherein the predicting the satisfaction of the target speech application's response to the target speech comprises:

3. The method of claim 2, wherein prior to inputting the text recognition result of the target speech and the application identification of the target speech application to a pre-trained online response satisfaction prediction model, further comprising:

obtaining a plurality of sample data to be marked, wherein the sample data to be marked comprises: the method comprises the steps that a user inputs a text, context information of the user input text and an application identifier;

4. The method of claim 1, wherein after predicting the satisfaction of the target speech application's response to the target speech, further comprising:

if the response satisfaction is determined to meet the preset threshold condition, providing the text recognition result of the target voice to a target voice application program;

5. The method according to claim 1, wherein the generating at least one response result corresponding to the target speech according to the speech feature and/or scene feature of the target speech comprises:

acquiring a voice confidence coefficient matched with the target voice;

6. The method according to claim 1, wherein the generating at least one response result corresponding to the target speech according to the speech feature and/or scene feature of the target speech comprises:

7. The method according to claim 1, wherein the generating at least one response result corresponding to the target speech according to the speech feature and/or scene feature of the target speech comprises:

8. The method according to claim 1, wherein the generating at least one response result corresponding to the target speech according to the speech feature and/or scene feature of the target speech comprises:

9. The method of claim 1, wherein said determining a target response result among each of said response results comprises:

and obtaining a target response result according to the sequencing result.

10. The method of claim 1, wherein the performing user feedback according to the target response result comprises:

11. A voice interaction device of an intelligent voice device comprises:

12. The apparatus of claim 11, wherein the response satisfaction prediction module is specifically configured to

13. The apparatus of claim 12, wherein the voice interaction means of the smart voice device further comprises:

14. The apparatus of claim 11, wherein the voice interaction means of the smart voice device further comprises:

15. The apparatus of claim 11, wherein the response result generation module comprises: a first response result generation submodule for

Acquiring a voice confidence coefficient matched with the target voice;

16. The apparatus of claim 11, wherein the response result generation module comprises: a second response result generation submodule for

17. The apparatus of claim 11, wherein the response result generation module comprises: a third response result generation submodule for

18. The apparatus of claim 11, wherein the response result generation module comprises: a fourth response result generation submodule for

19. The apparatus of claim 11, wherein the target response result determination module comprises: a target response result determination submodule for

and obtaining a target response result according to the sequencing result.

20. The apparatus of claim 11, wherein the target response result determination module comprises: a feedback sub-module for

21. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of voice interaction of a smart voice device of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the voice interaction method of the smart voice device according to any one of claims 1 to 10.

23. A computer program product comprising a computer program which, when executed by a processor, implements a method of voice interaction for an intelligent voice device according to any of claims 1-10.