CN115394300A

CN115394300A - Voice interaction method, voice interaction device, vehicle and readable storage medium

Info

Publication number: CN115394300A
Application number: CN202211332377.8A
Authority: CN
Inventors: 唐祥光; 胡梓垣; 孙仿逊; 左佑; 鲍鹏丽; 王合心
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2022-11-25
Anticipated expiration: 2042-10-28
Also published as: CN115394300B

Abstract

The invention discloses a voice interaction method, a voice interaction device, a vehicle and a readable storage medium, and belongs to the technical field of vehicle-mounted voice interaction. The voice interaction method comprises the following steps: acquiring at least one path of first conversation results determined by a local end; grading the first dialogue results to determine the fusion grade corresponding to each first dialogue result; under the condition that a second conversation result sent by the cloud is not received and the fusion grade corresponding to the first conversation result is determined to be the highest grade, determining the first conversation result corresponding to the highest grade as a target conversation result; under the condition of receiving a second dialogue result sent by the cloud, determining the second dialogue result as a target dialogue result; and executing voice interaction according to the target conversation result. The voice interaction method can improve the response speed and sensitivity of the voice interaction system while ensuring the identification accuracy, and can realize rapid conversation while ensuring the accuracy and having faster experience.

Description

Voice interaction method, voice interaction device, vehicle and readable storage medium

Technical Field

The invention belongs to the technical field of vehicle-mounted voice interaction, and particularly relates to a voice interaction method, a voice interaction device, a vehicle and a readable storage medium.

Background

With the wide application of the vehicle-mounted voice system, users pay more and more attention to the recognition accuracy and the response speed of the vehicle-mounted voice system. In the related art, the accuracy of voice recognition is often improved by placing a natural language parsing service in a cloud, but the time consumption is high, so that the response speed is slow, and the use experience of a user is influenced.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a voice interaction method, a voice interaction device, a vehicle and a readable storage medium, which can improve the response speed and sensitivity of a voice interaction system while ensuring the recognition accuracy.

In a first aspect, the present invention provides a voice interaction method, including:

obtaining at least one path of first conversation results determined by a local terminal; the local end comprises at least one branch, each branch determines a first conversation result corresponding to the branch based on a user voice request of a vehicle cabin, and the calculation time delay of each branch is different;

carrying out grading processing on the first dialogue results, and determining fusion grades corresponding to the first dialogue results;

under the condition that a second conversation result sent by a cloud is not received and the fusion grade corresponding to the first conversation result is determined to be the highest grade, determining the first conversation result corresponding to the highest grade as a target conversation result;

under the condition that a second conversation result sent by a cloud is received, determining the second conversation result as a target conversation result;

executing voice interaction according to the target conversation result;

wherein the second dialogue result is determined by the cloud based on the user voice request.

According to the voice interaction method, the corresponding fusion grade is obtained by carrying out grading processing on the first dialogue result sent by the local end multi-branch circuit, and the target dialogue result finally used for executing the voice interaction is determined based on the fusion grade and whether the second dialogue result sent by the cloud end is received, so that the recognition accuracy is guaranteed, the response speed and the sensitivity of a voice interaction system are improved, the accuracy is guaranteed, the experience is faster, the extremely fast dialogue is realized, and the use experience of a user is improved.

According to the voice interaction method of the present invention, in a case where the user voice request includes a plurality of consecutive sub-voice requests, the step of performing a hierarchical processing on the first dialog results to determine a fusion level corresponding to each of the first dialog results includes:

determining the fusion grade corresponding to the second target sub-dialogue result as a second grade under the condition that the first target sub-dialogue result is the cloud identification result;

determining that the fusion grade corresponding to the second target sub-dialogue result is the highest grade under the condition that the first target sub-dialogue result is the result of the local end recognition;

the first target sub-dialog result is a target dialog result corresponding to a first target sub-voice request, the second target sub-dialog result is a first dialog result corresponding to a second target sub-voice request, and the first target sub-voice request is a sub-voice request which is located before the second target sub-voice request and is adjacent to the second target sub-voice request in the plurality of continuous sub-voice requests.

According to the voice interaction method, the step of carrying out grading processing on the first dialogue results and determining the fusion grade corresponding to each first dialogue result comprises the following steps:

performing grading processing on the first dialogue result based on at least one of a text recognition result and the first dialogue result, and determining a fusion grade corresponding to the first dialogue result;

the text recognition result is determined by performing text recognition on the user voice request, and the first dialogue result is determined by performing semantic understanding on the text recognition result.

According to the voice interaction method, the step of performing a grading process on the first dialogue result based on at least one of a text recognition result and the first dialogue result to determine a fusion grade corresponding to the first dialogue result comprises the following steps:

acquiring a text recognition confidence coefficient and a text recognition definition in the text recognition result, and acquiring a field in the first dialogue result, a first confidence coefficient corresponding to the field, an intention and a second confidence coefficient corresponding to the intention;

determining a fusion level corresponding to the first dialogue result based on at least two of the text recognition confidence, the text recognition definition, the domain, the first confidence, the intention, and the second confidence.

According to the voice interaction method of the present invention, after the obtaining of the at least one path of first dialog results determined by the local end, and before the performing of the hierarchical processing on the first dialog results and the determining of the fusion level corresponding to each first dialog result, the method further includes:

determining a frequency level corresponding to the first dialogue result and a credibility corresponding to the first dialogue result based on the first dialogue result;

and determining that the fusion grade corresponding to the first conversation result is the highest grade under the condition that the reliability is greater than a target threshold and the frequency grade is the highest frequency.

According to the voice interaction method, the determining the frequency level corresponding to the first dialogue result and the credibility corresponding to the first dialogue result based on the first dialogue result comprises:

acquiring text recognition definition in a text recognition result, and acquiring a first confidence coefficient corresponding to a field, a second confidence coefficient corresponding to an intention and the user voice request in the first dialogue result; the text recognition result is determined by performing text recognition on the user voice request;

matching the user voice request by adopting a prefix tree, and determining the frequency grade;

determining the confidence level based on the text recognition clarity, the first confidence level, and the second confidence level.

According to the voice interaction method of the present invention, after the voice interaction is performed according to the target dialog result, the method further includes:

updating initial context information corresponding to the user voice request based on the target conversation result, wherein the initial context information is determined by the local end or the cloud end based on the user voice request.

According to the voice interaction method of the present invention, in case that the user voice request includes a plurality of consecutive sub voice requests, the method further comprises:

receiving an interrupt signal sent by target equipment, wherein the interrupt signal comprises an interrupted task ID;

and responding to the interrupt signal, and clearing the task stack corresponding to the task ID and the context information corresponding to the task ID.

In a second aspect, the present invention provides a voice interaction apparatus, including:

the first obtaining module is used for obtaining at least one path of first conversation results determined by the local terminal; the local end comprises at least one branch, each branch determines a first conversation result corresponding to the branch based on a user voice request of a vehicle cabin, and the calculation time delay of each branch is different;

the first processing module is used for carrying out grading processing on the first dialogue results and determining fusion grades corresponding to the first dialogue results;

the second processing module is used for determining the first dialogue result corresponding to the highest level as a target dialogue result under the condition that a second dialogue result sent by a cloud end is not received and the fusion level corresponding to the first dialogue result is determined to be the highest level;

the third processing module is used for determining a second conversation result as a target conversation result under the condition of receiving the second conversation result sent by the cloud;

the fourth processing module is used for executing voice interaction according to the target conversation result;

According to the voice interaction device, the corresponding fusion grade is obtained by carrying out grading processing on the first dialogue result sent by the local end multi-branch circuit, and the target dialogue result finally used for executing the voice interaction is determined based on the fusion grade and whether the second dialogue result sent by the cloud end is received, so that the voice interaction device is beneficial to improving the response speed and the sensitivity of a voice interaction system while ensuring the identification accuracy, ensuring the accuracy and realizing quick experience, and further improving the use experience of a user.

In a third aspect, the present invention provides a vehicle comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voice interaction method according to the first aspect when executing the computer program.

In a fourth aspect, the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of voice interaction as described in the first aspect above.

In a fifth aspect, the present invention provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the voice interaction method according to the first aspect.

In a sixth aspect, the invention provides a computer program product comprising a computer program which, when executed by a processor, implements the method of voice interaction as described in the first aspect above.

One or more technical solutions of the present invention at least have one of the following technical effects:

the corresponding fusion grade is obtained by carrying out hierarchical processing on the first dialogue result sent by the local end multi-branch, and the final target dialogue result for executing the voice interaction is determined based on the fusion grade and whether the second dialogue result sent by the cloud end is received, so that the recognition accuracy is guaranteed, the response speed and the sensitivity of a voice interaction system are improved, the accuracy is guaranteed, the experience is faster, the extremely fast dialogue is realized, and the use experience of a user is improved.

Furthermore, the dynamic grading of the fusion grade is realized by adjusting the fusion grade of the next round based on the result of the previous round in a multi-round conversation scene, so that the end cloud services of different architectures and different systems can be better compatible, and the implementation difficulty is reduced.

Furthermore, before the grading processing is carried out, the high-frequency grading processing is carried out on the first dialogue result to judge whether the first dialogue result is a high-frequency voice command of the user, and when the first dialogue result is determined to be the high-frequency voice command of the user in the vehicle-mounted environment, the first dialogue result is determined to be a target dialogue result, so that the response rate can be further improved on the basis of ensuring the accuracy.

Furthermore, the fusion grade corresponding to the first dialogue result is determined by carrying out grading processing on the first dialogue result based on at least one of the text recognition result and the first dialogue result, so that the accuracy and the precision of the finally determined fusion grade can be improved, and the accuracy of a voice interaction system is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a voice interaction method provided by the present invention;

FIG. 2 is a second schematic flow chart of the voice interaction method provided by the present invention;

FIG. 3 is a third flowchart illustrating a voice interaction method provided by the present invention;

FIG. 4 is a fourth flowchart illustrating a voice interaction method provided by the present invention;

FIG. 5 is a fifth flowchart illustrating a voice interaction method provided by the present invention;

FIG. 6 is a schematic structural diagram of a voice interaction apparatus provided in the present invention;

fig. 7 is a schematic structural diagram of a vehicle provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below clearly with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the invention may be practiced other than those illustrated or described herein, and that the objects identified as "first," "second," etc. are generally a class of objects and do not limit the number of objects, e.g., a first object may be one or more. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/", and generally means that the former and latter related objects are in an "or" relationship.

In the related art, there are three ways for voice interaction:

firstly, a large number of neural network systems in a vehicle-mounted voice system are deployed at a cloud end to perform voice recognition, then a voice recognition result is executed through a local end, and communication between the cloud end and the local end is performed through a network; however, the method is long in time consumption and slow in response speed, so that the user experience is poor, and the voice interaction desire of the user is greatly reduced.

Secondly, deploying cloud service and end service simultaneously, and performing voice recognition by adopting the cloud service when the local end can not support the voice recognition; however, the method greatly affects the response speed under the condition that the local end cannot support the voice recognition, and cannot meet the requirement of quick response under a complex voice recognition situation, thereby causing poor user experience.

Thirdly, the edge calculation is adopted for voice recognition, however, the method needs to be connected with more external devices, so that the design cost is high, the edge calculation is unstable, the calculation power is limited, most functions of the voice recognition system cannot be realized, and the use experience of a user is influenced.

The voice interaction method, the voice interaction device, the vehicle and the readable storage medium provided by the present invention are described in detail with reference to the accompanying drawings through specific implementation manners and application scenarios thereof.

The voice interaction method can be applied to the terminal, and can be specifically executed by hardware or software in the terminal.

The terminal may be a car machine and the terminal may be a device including a microphone or touch panel or other physical user interface.

In various implementations below, a terminal including a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and joystick.

According to the voice interaction method provided by the invention, an execution main body of the voice interaction method can be a vehicle machine or a functional module or a functional entity (such as a terminal cloud fusion management system) which can realize the voice interaction method in the vehicle machine. In the vehicle-mounted environment, due to the complex network conditions, such as when a vehicle runs, the network state changes dynamically due to the location switching, and the complexity and the calculation delay of the voice interaction are far higher than those of the voice interaction in the home environment.

As shown in fig. 1, the voice interaction method includes: step 110, step 120, step 130, step 140 and step 150.

Step 110, obtaining at least one path of first conversation results determined by a local terminal; the local end comprises at least one branch, each branch determines a first conversation result corresponding to the branch based on a user voice request of a vehicle cabin, and the calculation time delay of each branch is different;

in this step, the user voice request is voice information that the user of the vehicle cabin uttered in the context of conducting a voice interaction.

It is understood that the voice interaction scenario includes a single-turn dialog scenario and a multi-turn dialog scenario.

Wherein, in a single-turn conversation scene, the voice request of the user is a request; in a multi-turn conversation scenario, the user voice request may include multiple consecutive sub-voice requests, where the latter sub-voice request is a subsequent request instruction issued based on an execution result corresponding to the former sub-voice request.

The local end comprises at least one branch, and each branch independently identifies the voice request of the user.

Different branches are provided with different calculation powers, different calculation time delays are provided correspondingly, and for example, the calculation power of each branch can be increased step by step.

The first conversation result is the result obtained by the branch of the local end performing voice recognition on the voice request of the user.

In the actual execution process, each branch circuit should independently identify the same user voice request respectively so as to generate a first conversation result corresponding to each branch circuit.

The determination of the first conversation result will be described in the following, and will not be described herein again.

Step 120, performing hierarchical processing on the first dialogue results, and determining fusion levels corresponding to the first dialogue results;

in this step, the fusion level is used to characterize the priority of the first dialog result.

For example, the fusion level may include a plurality of levels such as a/B/C/D/E, and the priority corresponding to each level decreases in order.

It will be appreciated that the accuracy of the first dialog result obtained by different branch processing may vary, with different accuracies corresponding to different priorities.

In a subsequent execution, one of the obtained first dialog results may be determined for execution based on the fusion level.

In the actual execution process, the pre-trained neural network model may be used to perform hierarchical processing on the first dialog results to obtain a fusion level corresponding to each first dialog result, where the neural network model may be any model that can be implemented.

In some embodiments, step 120 may include:

performing grading processing on the first dialogue result based on at least one result of the text recognition result and the first dialogue result, and determining a fusion grade corresponding to the first dialogue result;

In this embodiment, the text recognition result is obtained by performing text recognition (ASR) on the user voice request.

After the text recognition result is obtained, performing semantic understanding (NLU) on the obtained text recognition result to obtain a first dialogue result.

In the actual execution process, the text recognition result obtained after ASR processing can be stored and called when needed.

After the first dialogue result is obtained, grading the text recognition result corresponding to the first dialogue result; or grading the first dialogue result; or the first dialogue result and the text recognition result corresponding to the first dialogue result are graded, so that the fusion grade corresponding to the first dialogue result can be determined.

According to the voice interaction method provided by the invention, the fusion grade corresponding to the first dialogue result is determined by carrying out grading processing on the first dialogue result based on at least one of the text recognition result and the first dialogue result, so that the accuracy and the precision of the finally determined fusion grade can be improved, and the accuracy of a voice interaction system is improved.

In some embodiments, the performing a ranking process on the first dialog result based on at least one of the text recognition result and the first dialog result, and determining a fusion rank corresponding to the first dialog result may include:

acquiring a text recognition confidence coefficient and a text recognition definition in a text recognition result, and acquiring a field in a first dialogue result, a first confidence coefficient corresponding to the field, an intention and a second confidence coefficient corresponding to the intention;

and determining a fusion level corresponding to the first dialogue result based on at least two of the text recognition confidence coefficient, the text recognition definition, the domain, the first confidence coefficient, the intention and the second confidence coefficient.

In this embodiment, the text recognition result may include a text recognition confidence (ASR nbest confidence) and a text recognition sharpness (ASR sharpness).

The first dialog result may include: the domain, a first confidence level corresponding to the domain (i.e., domain confidence level), the intent, and a second confidence level corresponding to the intent (i.e., intent confidence level).

And processing at least two of the confidence coefficient of the ASR nbest, the definition of the ASR, the domain, the first confidence coefficient, the intention and the second confidence coefficient, and determining the fusion level corresponding to the first dialogue result.

In actual implementation, a pre-trained hierarchical model may be employed to perform this step.

As shown in fig. 2, after obtaining the first dialog result and the text recognition result, inputting the ASR nbest confidence level, the ASR intelligibility, the domain, the first confidence level, the intention, and the second confidence level into the hierarchical model, performing processing such as variance calculation and mean calculation on the ASR nbest confidence level by the hierarchical model, performing other processing on the ASR intelligibility, the domain, the first confidence level, the intention, the second confidence level, and the like, and then performing processing such as cat and linear on the obtained result to output the fusion level corresponding to the first dialog result.

Table 1 illustrates fusion levels corresponding to the first dialog results obtained by performing classification processing on the first dialog results output by each branch, where the first dialog results are obtained by identifying a user voice request "open a window" by each branch.

TABLE 1

The fusion grade corresponding to the first dialog result output by the branch 2 is the highest grade, which is grade a.

It should be noted that the hierarchical model may be obtained by training based on the sample first dialogue result and the sample fusion level corresponding to the sample first dialogue result; or the hierarchical model can be obtained by training based on the sample text recognition result and the sample fusion grade corresponding to the sample text recognition result; or the hierarchical model can be obtained by training based on the sample text recognition result, the sample text recognition result and the sample fusion level corresponding to the sample text recognition result.

The training process of the hierarchical model is explained below.

1. Data collection and annotation

Firstly, listening a voice request of a sample user, marking a corresponding field, intention and the like of the voice request, and acquiring the field and the intention of the sample;

marking the sample field and the sample intention whether the sample field and the sample intention are in the racing list;

judging fusion levels (such as A/B/C/D/E) according to the definition and the integrity of the voice of the sample user, and labeling the levels to obtain sample fusion levels;

training samples of the hierarchical model can be obtained based on the obtained data.

2. Training

The sample user speech requests to go through ASR to obtain ASR nbest confidence, and goes through AJ system (other speech model) to obtain ASR definition;

and respectively coding the confidence coefficient of the ASR nbest, the definition of the ASR, the field of the sample and the confidence coefficient thereof, and the intention and the confidence coefficient thereof, so as to train the multi-class model.

3. Corpus examples.

4. The annotation criteria can be determined based on the demotion process based on clarity (which can include levels of 0, 0.3, 0.6, and 1, etc.), sample domain confidence (which can include levels of 0, 0.5, and 1, etc.), sample intent confidence (which can include levels of 0, 0.5, and 1, etc.), and whether the sample user voice request is a background sound and belongs to arbitrary chat content, etc.

In the invention, the fusion grade of the first dialogue result is determined by adopting the pre-trained hierarchical model, so that the accuracy and the calculation rate of the calculation result can be improved, and the method has higher learning capability.

According to the voice interaction method provided by the invention, the fusion grade corresponding to the first dialogue result is determined based on at least two of the text recognition confidence, the text recognition definition, the field, the first confidence, the intention and the second confidence, so that the accuracy and the precision of the finally determined fusion grade can be improved, and the accuracy of a voice interaction system is improved.

Step 130, determining the first dialogue result corresponding to the highest level as a target dialogue result under the condition that the second dialogue result sent by the cloud is not received and the fusion level corresponding to the first dialogue result is determined to be the highest level;

in this step, the target dialog result is the dialog result that is ultimately used for execution.

The second conversation result is determined by the cloud based on the user voice request, and the user voice request is the same as the user voice request corresponding to the local end.

It should be noted that, in the present invention, the fusion level corresponding to the second session result sent by the cloud is determined as the highest level by default, that is, determined as level a.

It can be understood that, when the car machine works normally, after the user wakes up the voice interaction system, the voice request of the user can be processed through at least one branch of the local end and the processing route of the cloud end, and the computation delay of the cloud end may be higher than that of the partial branch of the local end. Before receiving the second session result sent by the cloud, the first session result sent by one or more branches of the local end may be received.

Before the second dialogue result is not received, the received fusion levels corresponding to the first dialogue results sent by each branch are compared, and if the first dialogue result corresponding to the highest level (namely the first dialogue result corresponding to the level A) is determined to exist, the first dialogue result corresponding to the level A is determined as the target dialogue result, namely the first dialogue result corresponding to the level A is recognized by the local terminal and is the dialogue result used for execution in the current round.

And under the condition that the second dialogue result is not received and the first dialogue result corresponding to the highest level is not obtained, continuously receiving the first dialogue result sent by each branch and carrying out classification processing on the first dialogue result, and carrying out the following processing based on the fusion level.

In some examples, in a target period, in a case where the second dialog result is not received and the first dialog result corresponding to the highest level is not obtained, the first dialog result with the determined fusion level being the second level (i.e., level B) is determined as the target dialog result.

The target time period may be customized based on a user, for example, set to a time period from the 3 rd time to the 5 th time since the beginning of the current voice interaction, and of course, may also be set to other time periods based on actual requirements, for example, from the 2 nd time to the 4 th time, and the like, which is not limited in the present invention.

In other examples, in the target period, in a case that the second dialogue result is not received and the first dialogue result corresponding to the highest level and the first dialogue result corresponding to the second level are not obtained, the current round of voice interaction is ended.

For example, in the period of 3s-5s, under the condition that the second dialogue result is not received, if the first dialogue result with the fusion level of A grade is obtained, the first dialogue result with the fusion level of A grade is determined as the target dialogue result;

in case that the second dialog result is not received and the first dialog result of class a is not obtained during 3s-5s, the first dialog result of class B having the fusion level obtained during 3s-5s may also be determined as the target dialog result.

And during the period from 3s to 5s, under the conditions that the second dialogue result is not received, the first dialogue result of the A level is not obtained, and the first dialogue result of which the fusion level is the B level is not obtained, the current round of voice interaction is ended.

Step 140, determining the second dialogue result as a target dialogue result under the condition of receiving the second dialogue result sent by the cloud;

in this step, when the first dialog result corresponding to the highest level is not obtained and the second dialog result sent by the cloud is received, the level of the second dialog result is determined as the highest level, and the second dialog result is determined as the target dialog result, that is, the second dialog result obtained by the cloud is the dialog result used for execution in this round.

And 150, executing voice interaction according to the target conversation result.

In this step, the target dialog result is one of the first dialog result and the second dialog result.

And the fusion grade corresponding to the target conversation result is the highest grade.

Performing voice interaction can take many forms:

firstly, a control instruction corresponding to voice interaction is executed.

For example, the user voice request is "open skylight", and performing the voice interaction may include opening skylight.

And secondly, broadcasting voice replies.

For example, the user voice request is "how long to get to the destination", and performing the voice interaction may include broadcasting "30 minutes away from the destination".

And thirdly, executing a control command corresponding to the voice interaction and broadcasting voice replies.

For example, the user voice request is "open skylight", and performing voice interaction may include opening skylight and reporting "skylight opened".

In the invention, the problem that the response speed and accuracy cannot be balanced when cloud service and end service cooperate can be solved by configuring the semi-stateless characteristic of multi-channel + end cloud fusion in a stepped manner; the problem that the end service and the cloud service are completely split can be solved through the end cloud integration management technology.

According to the voice interaction method provided by the invention, the corresponding fusion grade is obtained by carrying out classification processing on the first dialogue result sent by the local multi-branch circuit, and the target dialogue result finally used for executing the voice interaction is determined based on the fusion grade and whether the second dialogue result sent by the cloud is received, so that the recognition accuracy is ensured, the response speed and the sensitivity of a voice interaction system are improved, the experience is faster while the accuracy is ensured, the extremely fast dialogue is realized, and the use experience of a user is improved.

In some embodiments, after step 110 and before step 120, the method may further comprise:

and under the condition that the reliability is greater than the target threshold and the frequency level is the highest frequency, determining that the fusion level corresponding to the first conversation result is the highest level.

In this embodiment, the frequency class is used to characterize whether the first dialog result is a high-frequency voice command that is routinely used by a user of the vehicle cabin.

It is understood that in the vehicle-mounted environment, there are high-frequency voice instructions unique to the environment, such as "open window" and "navigate to XX spot".

The target threshold may be customized based on a user, such as setting to 0.99, 0.98, or 0.8, etc., the present invention is not limited thereto, and in an actual application process, an optimal value may be determined based on actual requirements.

Before the classification processing, a high-frequency classification processing may be performed. And processing the first dialogue result to determine a frequency grade corresponding to the first dialogue result and the credibility corresponding to the first dialogue result, and comparing the credibility with a target threshold.

It should be noted that, in the present invention, after determining that the fusion level of the first dialog result is the highest level based on the frequency level and the confidence level corresponding to the first dialog result, the process may jump to step 130 without performing step 120.

For example, in the confidence level

If the frequency level corresponding to the first dialogue result is higher than 0.99, determining the fusion level corresponding to the first dialogue result as A level; after the fusion level corresponding to the first conversation result is determined to be the A level, under the condition that the second conversation result sent by the cloud is not received, the first conversation result corresponding to the A level is determined to be the target conversation result.

In an actual implementation process, a trained high-frequency hierarchical model may be used to perform high-frequency hierarchical processing on the first dialog result, so as to determine a fusion level of the first dialog result.

It will be appreciated that the delay of the algorithm for the high frequency classification should be lower than the delay of the algorithm for the classification as generally used above.

As shown in fig. 3, the first dialog result is input to the high-frequency hierarchical model, the fusion level output by the high-frequency hierarchical model is obtained, and the fusion level is determined.

Under the condition that the fusion grade output by the high-frequency grading model is the highest grade, directly determining a first dialogue result corresponding to the highest grade as a target dialogue result;

in the case that the fusion level output by the high-frequency hierarchical model is not the highest level, the first dialog result is input to the hierarchical model for further hierarchical processing, and the specific hierarchical processing process has been described in the above embodiment and is not described herein again.

According to the voice interaction method provided by the invention, before the grading processing is carried out, the high-frequency grading processing is carried out on the first dialogue result to judge whether the first dialogue result is a high-frequency voice instruction of the user, and when the first dialogue result is determined to be the high-frequency voice instruction of the user in the vehicle-mounted environment, the first dialogue result is determined to be a target dialogue result, so that the response rate can be further improved on the basis of ensuring the accuracy.

In some embodiments, determining, based on the first dialog result, a frequency level corresponding to the first dialog result and a confidence level corresponding to the first dialog result may include:

acquiring text recognition definition in a text recognition result, and acquiring a first confidence coefficient corresponding to a field, a second confidence coefficient corresponding to an intention and a user voice request in a first dialogue result; the text recognition result is determined by performing text recognition on the user voice request;

matching a user voice request by adopting a prefix tree, and determining a frequency grade;

based on the text recognition sharpness, the first confidence level, and the second confidence level, a confidence level is determined.

In this embodiment, the text recognition definition is ASR definition.

And the first confidence corresponding to the domain and the second confidence corresponding to the intention are the confidence of the domain intention entering the NLU.

The user voice request is Query.

For example, in actual implementation, the confidence of the domain intent of the NLU, ASR clarity, and Query to high frequency ranking model may be input; adopting a prefix tree to match whether Query belongs to high frequency or not so as to determine the frequency grade; and the reliability is calculated by the following formula.

Wherein the content of the first and second substances,

is the confidence level;

is a first confidence;

is the second confidence; a. b and c are respectively an adjustment factor, a floating point scalar.

In some examples, when

>0.99 and the frequency level belongs to high frequency, the fusion level is determined as A, otherwise, a default level D is assigned.

According to the voice interaction method provided by the invention, the prefix tree is adopted to match the voice request of the user, the frequency grade is determined, and the reliability is determined based on the text recognition definition, the first confidence coefficient and the second confidence coefficient, so that the method has higher accuracy and precision.

In some embodiments, after step 150, the method may further comprise: and updating initial context information corresponding to the user voice request based on the target conversation result, wherein the initial context information is determined by the local end or the cloud end based on the user voice request.

In this embodiment, the initial context information includes information such as the initially identified domain, intent, and state.

As shown in fig. 4, the first dialog result obtained after NLU processing is processed in a hierarchical manner, and after a certain path of result is adopted by the vehicle-mounted large-screen executor, an adoption signal (i.e., a target dialog result) and the adoption details are notified to the cloud-end fusion management system deployed on the vehicle-mounted large-screen, and the cloud-end fusion management system performs state and information fusion.

The result returned after the large-screen actuator adopts includes information such as the field, the intention, the state (continue/end), the adoption source (end/cloud), the channel ID and the like.

Updating the context, and reading the context which is most similar to the adopted result in the temporary storage area, wherein if the local end identification result is adopted, reading according to the channel ID, and if the cloud end identification result is adopted, reading any end result; the returned information is used to verify the initial context information.

And finally updating the task stack based on the adoption state. If the adoption state = end, quitting the voice assistant and emptying the whole dialogue management system;

if the adoption state = continue, judging whether the current user voice request is a new task, and if the current user voice request is determined to be the new task, exiting the old task and entering the new task; if the task is determined to be an old task, the information such as the field and the intention is updated.

According to the voice interaction method provided by the invention, the initial context information corresponding to the voice request of the user is updated based on the target conversation result, and on the basis of realizing quick response, the information can be further updated in time, so that the cloud information at the end is kept consistent.

The following describes an implementation of the present invention in a multi-turn dialog scenario.

In some embodiments, in the case that the user voice request includes a plurality of consecutive sub voice requests, step 120 may include:

determining that the fusion grade corresponding to the second target sub-dialog result is the second grade under the condition that the first target sub-dialog result is the result identified by the cloud;

determining that the fusion grade corresponding to the second target sub-dialogue result is the highest grade under the condition that the first target sub-dialogue result is the result of local end recognition;

the first target sub-speech request is a target speech request corresponding to the first target sub-speech request, the second target sub-speech request is a first speech request corresponding to the second target sub-speech request, and the first target sub-speech request is a sub-speech request which is located before the second target sub-speech request and is adjacent to the second target sub-speech request in the plurality of continuous sub-speech requests.

In this embodiment, the first target sub-voice request is a sub-voice request executed in the previous round in a multi-round dialog scenario. It can be understood that the first target sub-dialog result may be a second dialog result output by the cloud end, or may be a first dialog result output by a branch at the local end.

The second target sub-voice request is a sub-voice request to be executed in the current round under the multi-round conversation scene. The second target sub-dialog result is determined based on the type of end performed in the previous round.

It should be noted that, in a multi-round conversation scene, after a preliminary grading result is obtained, grading adjustment needs to be performed according to the situation of a multi-round task before the preliminary grading result is sent to a vehicle-mounted large-screen actuator, so that it is ensured that end cloud services of different architectures and different systems can be well compatible, and the implementation difficulty is reduced.

For example, in a multi-turn conversation scene, if the previous turn adopts a first conversation result output by a local end, the fusion grade of the current turn is forcibly determined as grade A;

if the second dialogue result output by the cloud end is adopted in the previous round, the fusion grade of the current round is forcibly determined as the B grade.

And sending the finally determined grading result to a vehicle-mounted large-screen actuator for execution.

In the invention, the problem of a simple cooperation mode that the service is degraded to the second-level service only by the fact that the first-level service cannot be accepted at all in the prior art can be solved through a dynamic classification technology.

According to the voice interaction method provided by the invention, the dynamic grading of the fusion grade is realized by adjusting the fusion grade of the next round based on the result of the previous round in a multi-round conversation scene, so that the cloud-end services of different architectures and different systems can be better compatible, and the implementation difficulty is reduced.

In some embodiments, after the grading result is sent to the vehicle-mounted large-screen actuator and before the end cloud fusion, the result can be temporarily stored in a temporary storage queue.

The scratch does not update the context and can accept different channel multiple requests of the same msgId, as shown in fig. 5.

If msgId1 starts processing and adopts msgId2 before it is adopted after processing, msgId1 is discarded, and the contexts of msgId1 and msgId2 are the same, and msgId1 is not adopted to update the context (i.e. the condition for context update is adopted signal).

In some embodiments, in a case where the user voice request includes a plurality of consecutive sub voice requests, the method may further include:

In this embodiment, table 2 illustrates a hierarchical processing result and a processing result in the case of an interruption.

TABLE 2

It can be understood that in a multi-turn conversation situation, an abnormal interruption may occur, that is, some operation causes that the current task cannot continue, and an interruption signal and an interruption detail need to be broadcast through an interruption source, so as to maintain each system or module in a multi-turn state to perform state updating.

The target device can be any vehicle-mounted device, such as a vehicle-mounted large screen or a UI manager of the vehicle-mounted large screen.

For example, under the condition of task interruption caused by UI change, the UI manager of the vehicle-mounted large screen sends an interruption signal to the end cloud fusion management system to inform the interrupted task ID; and the end cloud fusion management system receives the task ID, judges whether the corresponding task in the task stack needs to be destroyed, and empties the task stack corresponding to the task ID and the context information corresponding to the task ID under the condition of determining that the task stack needs to be emptied.

Under the condition of task interruption caused by voice switching intention, the end cloud fusion management system sends an interruption signal to the vehicle-mounted large-screen UI manager to inform the interrupted task ID; the vehicle-mounted large-screen UI manager judges whether the cards displayed by the UI and other maintained resources of the cards need to be destroyed.

According to the voice interaction method provided by the invention, in a multi-turn conversation scene, when abnormal interruption occurs, an interruption signal comprising the interrupted task ID is sent to clear the task stack corresponding to the task ID and the context information corresponding to the task ID, so that the working state of a more system can be effectively maintained, and the normal working state of the voice interaction method is ensured.

According to the voice interaction method provided by the invention, the execution main body can be a voice interaction device. The voice interaction device provided by the invention is explained by taking the voice interaction method executed by the voice interaction device as an example.

The invention also provides a voice interaction device.

As shown in fig. 6, the voice interaction apparatus includes: a first obtaining module 610, a first processing module 620, a second processing module 630, a third processing module 640, and a fourth processing module 650.

A first obtaining module 610, configured to obtain at least one path of first session result determined by a local end; the local end comprises at least one branch, each branch determines a first conversation result corresponding to the branch based on a user voice request of a vehicle cabin, and the calculation time delay of each branch is different;

a first processing module 620, configured to perform hierarchical processing on the first dialog results, and determine a fusion level corresponding to each first dialog result;

the second processing module 630 is configured to determine, when a second session result sent by the cloud is not received and the fusion level corresponding to the first session result is determined to be the highest level, the first session result corresponding to the highest level is determined to be a target session result;

the third processing module 640 is configured to determine, in a case that a second session result sent by the cloud is received, the second session result as a target session result;

a fourth processing module 650, configured to perform voice interaction according to the target dialog result;

and the second conversation result is determined by the cloud based on the voice request of the user.

According to the voice interaction device provided by the invention, the corresponding fusion grade is obtained by carrying out grading processing on the first dialogue result sent by the local multi-branch circuit, and the target dialogue result finally used for executing the voice interaction is determined based on the fusion grade and whether the second dialogue result sent by the cloud is received, so that the recognition accuracy is ensured, the response speed and the sensitivity of a voice interaction system are improved, the accuracy is ensured, the experience is faster, the extremely fast dialogue is realized, and the use experience of a user is improved.

In some examples, in a case that the user voice request includes a plurality of consecutive sub voice requests, the first processing module 620 may be further configured to:

determining the fusion grade corresponding to the second target sub-dialog result as the highest grade under the condition that the first target sub-dialog result is the result of local end identification;

In some examples, the first processing module 620 may be further configured to:

and determining a fusion level corresponding to the first dialogue result based on at least two of the text recognition confidence, the text recognition definition, the domain, the first confidence, the intention and the second confidence.

In some examples, the apparatus may further include:

the fifth processing module is used for determining the frequency grade corresponding to the first dialogue result and the credibility corresponding to the first dialogue result based on the first dialogue result after acquiring at least one path of first dialogue results determined by the local terminal and before performing grading processing on the first dialogue results and determining the fusion grade corresponding to each first dialogue result;

and the sixth processing module is used for determining that the fusion grade corresponding to the first conversation result is the highest grade under the condition that the reliability is greater than the target threshold and the frequency grade is the highest frequency.

In some examples, the fifth processing module may be further configured to:

based on the text recognition clarity, the first confidence level, and the second confidence level, a confidence level is determined.

In some examples, the apparatus may further include: and the seventh processing module is used for updating initial context information corresponding to the user voice request based on the target dialogue result after the voice interaction is executed according to the target dialogue result, wherein the initial context information is determined by the local end or the cloud end based on the user voice request.

In some examples, the apparatus may further include:

the seventh processing module is used for receiving an interrupt signal sent by the target equipment under the condition that the user voice request comprises a plurality of continuous sub-voice requests, wherein the interrupt signal comprises an interrupted task ID;

and the eighth processing module is used for clearing the task stack corresponding to the task ID and the context information corresponding to the task ID in response to the interrupt signal.

The voice interaction device in the present invention may be an electronic device, or may be a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. For example, the electronic device may be a vehicle or a vehicle machine on the vehicle, and the invention is not limited in particular.

The voice interaction device in the invention can be a device with an operating system. The operating system may be an Android operating system, an IOS operating system, or other possible operating systems, which is not limited in the present invention.

The voice interaction device provided by the present invention can implement each process implemented by the method examples of fig. 1 to fig. 5, and is not described here again to avoid repetition.

In some examples, as shown in fig. 7, the present invention further provides a vehicle 700, including a processor 701, a memory 702, and a computer program stored on the memory 702 and executable on the processor 701, where the computer program, when executed by the processor 701, implements each process of the above example of the voice interaction method, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The present invention further provides a non-transitory computer readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process of the voice interaction method example, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

Wherein the processor is a processor in the electronic device described in the above example. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described voice interaction method.

The invention further provides a chip, which includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction to implement each process of the voice interaction method example, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips referred to in the present invention may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of embodiments of the present invention is not limited to performing functions in the order illustrated or discussed, but may include performing functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the description of the foregoing embodiments, those skilled in the art can clearly understand that the above exemplary method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the methods according to the examples of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

In the description of the specification, reference to the description of the terms "one example," "some examples," "illustrative examples," "example," "specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the example or example is included in at least one example or example of the invention. In this specification, schematic representations of the above-described terms do not necessarily refer to the same example or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more examples or examples.

While examples of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to these examples without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method of voice interaction, comprising:

executing voice interaction according to the target conversation result;

2. The method according to claim 1, wherein in a case that the user voice request includes a plurality of consecutive sub-voice requests, the step of performing a hierarchical processing on the first dialog results to determine a fusion level corresponding to each of the first dialog results includes:

3. The method of claim 1, wherein the step of performing hierarchical processing on the first dialog results to determine a fusion level corresponding to each of the first dialog results comprises:

4. The method according to claim 3, wherein the determining a fusion level corresponding to the first dialog result by performing a ranking process on the first dialog result based on at least one of a text recognition result and the first dialog result comprises:

5. The method according to any one of claims 1 to 4, wherein after the obtaining of the at least one path of first dialog results determined by the local end, and before the performing of the hierarchical processing on the first dialog results and determining the fusion level corresponding to each first dialog result, the method further comprises:

and under the condition that the reliability is greater than a target threshold value and the frequency level is the highest frequency, determining that the fusion level corresponding to the first conversation result is the highest level.

6. The method of claim 5, wherein the determining the frequency level and the confidence level of the first dialog result based on the first dialog result comprises:

determining the confidence level based on the text recognition sharpness, the first confidence level, and the second confidence level.

7. The method of any one of claims 1-4, wherein after the performing a voice interaction with the target dialog result, the method further comprises:

8. A voice interaction method according to any one of claims 1 to 4, characterised in that in the case where the user voice request comprises a plurality of consecutive sub voice requests, the method further comprises:

9. A voice interaction apparatus, comprising:

the third processing module is used for determining a second dialogue result as a target dialogue result under the condition of receiving the second dialogue result sent by the cloud end;

10. A vehicle comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voice interaction method of any of claims 1-8 when executing the program.

11. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for voice interaction according to any one of claims 1 to 8.