CN108962233B

CN108962233B - Voice conversation processing method and system for voice conversation platform

Info

Publication number: CN108962233B
Application number: CN201810835994.7A
Authority: CN
Inventors: 林永楷; 周伟达; 樊帅; 李春
Original assignee: AI Speech Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2020-11-17
Anticipated expiration: 2038-07-26
Also published as: CN108962233A

Abstract

The embodiment of the invention provides a voice conversation processing method for a voice conversation platform. The method comprises the following steps: acquiring n semantic results with highest possibility of voice data input by a user according to voice recognition and understanding; when n is greater than 1, determining the field related to each semantic result, and judging a key semantic slot corresponding to each semantic result; adding m semantic results with key semantic slots to a disambiguation candidate list; and when m is greater than 1, automatically disambiguating the disambiguation candidate list according to the existing resources to obtain l semantic results, wherein the existing resources comprise historical context information, historical disambiguation records, voice conversation platform resources and/or a customized disambiguation rule base. The embodiment of the invention also provides a voice conversation processing system for the voice conversation platform. According to the embodiment of the invention, different importance is set for semantic slots in different semantic fields, so that false ambiguity caused by semantic analysis is automatically filtered out, and the voice interaction effect is further improved through automatic disambiguation.

Description

Voice conversation processing method and system for voice conversation platform

Technical Field

The present invention relates to the field of voice dialog, and in particular, to a voice dialog processing method and system for a voice dialog platform.

Background

With the development of artificial intelligence voice technology, more and more devices realize the function of operating corresponding instructions through the voice of a user. For example, when the user says "inquire about the weather in tomorrow", the corresponding device can feed back how the weather is tomorrow to the user, so that the operation mode of the user is simpler and more convenient.

However, the same words often have different meanings, so that the same words may correspond to different operations, for example, when the user says "play my singer", the intention of the user may be to play an integrated program, "i am singer", or may be to play a yunynpeng, sunpleasing meeting, "i am singer". For this case, the user is usually confirmed which one is played or either one is randomly played.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

due to the lack of an effective automatic disambiguation mechanism, the user is required to confirm the ambiguity if only the ambiguity occurs, and thus the user experience of the voice interaction environment will be disastrous. If only the result with the highest possibility is used under the condition of multiple ambiguities, the ambiguity resolution is not needed because only one result exists; equivalently, in order to avoid possible difficulties from the technical level, the experience of part of users is lost, and a scene needing disambiguation is bypassed. But the overall effect of the intelligent voice dialogue system is reduced simultaneously with the user experience because some very similar voice recognition and other possible semantic parsing results are abandoned. And randomly selecting any one of them to be played, it is not always the "my singer" that the user really wants to play.

Disclosure of Invention

The method and the device aim to at least solve the problems that in the prior art, too much ambiguity needs to be confirmed by a user to damage a voice interaction environment, and the existing ambiguity disambiguation cannot meet a use scene with higher complexity.

In a first aspect, an embodiment of the present invention provides a voice dialog processing method for a voice dialog platform, including:

acquiring n semantic results with highest possibility of voice data input by a user according to voice recognition and understanding;

when n is greater than 1, determining the field related to each semantic result, and judging whether the semantic slot corresponding to each semantic result is a key semantic slot in the field;

adding m semantic results with key semantic slots to a disambiguation candidate list, wherein m is less than or equal to n;

and when m is greater than 1, automatically disambiguating the disambiguation candidate list according to the existing resources to obtain l semantic results, wherein the existing resources comprise historical context information, historical disambiguation records, voice conversation platform resources and/or a customized disambiguation rule base.

In a second aspect, an embodiment of the present invention provides a voice dialog processing system for a voice dialog platform, including:

the semantic understanding acquisition program module is used for acquiring n semantic results with highest possibility of voice data input by a user according to voice recognition and understanding;

the key semantic groove determining program module is used for determining the field related to each semantic result when n is greater than 1, and judging whether the semantic groove corresponding to each semantic result is a key semantic groove in the field;

a disambiguation candidate list determining program module for adding m semantic results having a key semantic slot to a disambiguation candidate list, wherein m is less than or equal to n;

and the automatic disambiguation program module is used for automatically disambiguating the disambiguation candidate list according to the existing resources to obtain a semantic result I when m is greater than 1, wherein the existing resources comprise historical context information, historical disambiguation records, voice conversation platform resources and/or a customized disambiguation rule base.

In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method for voice dialog processing for a voice dialog platform of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the voice conversation processing method for a voice conversation platform according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the semantic slot position information contained in the semantic parsing result is fully utilized through the dialogue ambiguity detection, different importance is set for semantic slots in different semantic fields, and an automatic disambiguation mechanism is introduced only when the key semantic slot is ambiguous, so that the false ambiguity caused by the semantic parsing can be automatically filtered.

Meanwhile, an automatic disambiguation mechanism is matched with a highly customizable automatic disambiguation rule base based on historical multi-round context information and data service query results, invalid semantic results can be automatically eliminated under the condition that various semantic analysis results all contain key semantic slots, an expiration date is set by storing user historical selection records, and when a user requests the same content again in a short time, an ambiguity disambiguation module automatically reads the historical records, so that the user is prevented from carrying out multiple ambiguity selections on the same problem. The effect of voice interaction is improved, and meanwhile, automatic disambiguation also meets the use scene with higher complexity.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a voice conversation processing method for a voice conversation platform according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice dialog processing system for a voice dialog platform according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a voice dialog processing method for a voice dialog platform according to an embodiment of the present invention, which includes the following steps:

s11: acquiring n semantic results with highest possibility of voice data input by a user according to voice recognition and understanding;

s12: when n is greater than 1, determining the field related to each semantic result, and judging whether the semantic slot corresponding to each semantic result is a key semantic slot in the field;

s13: adding m semantic results with key semantic slots to a disambiguation candidate list, wherein m is less than or equal to n;

s14: and when m is greater than 1, automatically disambiguating the disambiguation candidate list according to the existing resources to obtain l semantic results, wherein the existing resources comprise historical context information, historical disambiguation records, voice conversation platform resources and/or a customized disambiguation rule base.

In this embodiment, the method may be adapted to a device with voice interaction, for example, a smart speaker, a smart phone, and the like. For example, when a user wants to play a piece of audio through the smart speaker, the user can directly input voice to the smart speaker.

For step S11, the Speech data input by the user is subjected to ASR (Automatic Speech Recognition) and NLU (Natural Language Understanding) processing corresponding to the ASR, so as to obtain n ASR hyptheses (ASR hypotheses, also called Natural Language Understanding) with the highest possibility_nbest input or top_nInput) and its corresponding semantic parsing results.

In step S12, determining the domain to which each semantic result relates, and determining whether the semantic slot corresponding to each semantic result is a key semantic slot in the domain, where a key semantic slot is a slot having a primary meaning for a certain domain when a sentence is parsed into a semantic result. Since semantics can contain very much information, without defining a key semantic slot, a large number of inaccurate semantic parsing results will result.

Such as "playing five minutes of fringen's music".

The resolution in the music domain is:

"singer" - "Zhonglun" - "operation" - "play duration" - "five minutes"

And in the radio domain will be resolved into:

"operate" ═ play "-" column "-" five minutes "-" keyword "-" Zhoujilun "

Although the parsing in the radio field is also correct, because the names of the columns in the radio field are of various types, the importance degree of the column slot is relatively low when the importance degree of the semantic slot is set, and compared with the name of the singer in the music field, the columns in the radio field do not meet the requirement of the key semantic slot of the ambiguity candidate, and the singer is the key semantic slot in the music field, so the parsing in the music field is reserved, and the semantic parsing in the radio field is directly filtered.

For another example, play a little red cap "

The resolution in the music domain is:

"operate" as "play" and "song name" as "little red cap"

The resolution in the story domain is:

"operate" as "play" and "story name" as "little red cap"

Because the "story name" is a key semantic slot for the story field and the "song name" is a key semantic slot for the music field, the two semantic parsing results pass the ambiguity detection of the initial step at the same time.

For step S13, in step S12, saying "play small red hat", since two semantic parsing results pass ambiguity detection in the initial step at the same time, n is 2 ASR hypotheses with the highest probability, and m is 2 after preliminary ambiguity detection, 2 semantic parsing results are all selected into the list of ambiguity candidates. In this case, m is 2.

For step S14, if the preliminary disambiguation has resulted in a semantic result when the determined m is 1 in step S13, the corresponding instruction is directly executed.

If m >1 is determined in step S13, the disambiguation candidate list is processed according to the existing resources. The automatic disambiguation can be performed according to historical Toronto context information in the existing resources, data services including related resources, a customized automatic disambiguation rule base and historical disambiguation records. Thereby disambiguating the semantic results in the disambiguation candidate list. And when only one semantic result is left in the disambiguation candidate list, determining the semantic result corresponding to the voice input by the user, and performing corresponding operation.

According to the embodiment, the semantic slot position information contained in the semantic parsing result is fully utilized through the dialogue ambiguity detection, different importance is set for the semantic slots in different semantic fields, and only when the key semantic slot is ambiguous, an automatic disambiguation mechanism is introduced, so that the false ambiguity caused by the semantic parsing can be automatically filtered.

As an implementation manner, in this embodiment, the method further includes:

when the disambiguation candidate list is automatically disambiguated according to the existing resources to obtain more than l semantic results, feeding the more than l semantic results back to the user for the user to confirm;

when a user inputs a confirmation instruction corresponding to the feedback, determining a semantic result corresponding to the voice data input by the user, and executing corresponding operation;

and when the user inputs an abnormal instruction, feeding back abnormal prompt information.

In the present embodiment, after automatic disambiguation, if the disambiguation candidate list has more than 1 semantic result, feedback is given to the user.

For example, the above embodiments refer to "play a small red cap". After the automatic disambiguation is finished, if two semantic results in the ambiguity candidate list need disambiguation, the ambiguity detection module sets disambiguation flag information as "To find children's story and music" and "To listen To" To guide the user To select and enter a monitoring state, and then calls the disambiguation processing module, and the disambiguation processing module returns TTS (Text To Speech, from Text To Speech) "needing To be broadcasted according To the status flag bit set by the detection module.

When a new round of input comes, the disambiguation processing module judges whether the input of the user is 'select' or 'execute a new task', and if the input of the user is the new task, the disambiguation processing module directly jumps out of the disambiguation operation to execute the new task.

If the user makes a selection, it is necessary to determine whether the user's answer is abnormal. And if the abnormal state is detected, entering an abnormal prompt flow to prompt the user to reselect and enter the monitoring state again, and if the abnormal state is not detected, taking the semantic result selected by the user as a final semantic result and executing related operation.

Since the user is entering voice information, it is likely that the user will not reply as prompted when the disambiguation module guides the user, and further adjustments are needed.

For example, when the TTS content is "for you to find the children's story and music, which you want to listen to"

The user can reply with:

"My wants to hear children's story", "children's story", "music", "song", "first", "second", "front", "rear", "not story", "not music"

There is also the possibility of saying an unusual utterance such as "joke", "sixth", etc.

Meanwhile, in order to ensure that the disambiguation module does not influence the intention switching of the user, when the user directly speaks a new task like 'forgetting to play', the disambiguation module jumps according to the semantic result of the newly input voice.

When the user speaks an abnormal utterance, in addition to generating an abnormal TTS prompt (e.g., "i do not hear understand, are you going to listen to music or a story"), the number of exceptions is recorded, and when the number of exceptions exceeds a certain number of times (e.g., twice), the system prompts the user to "also do not hear, or try again" such similar sentences.

While waiting for the user to select, the disambiguation system sets an effective duration for the monitoring, preventing the user from not answering for a long time or the user has left. When the user does not answer within the validity period, the system may perform different operations according to a predetermined configuration, for example:

the NLU with the highest probability is selected by default and the user is prompted to "just about to play a small red cap of the story for you" a similar TTS. Or prompt the user to change the speech and turn off the monitoring.

The overall flow is as follows:

it can be seen from this embodiment that the selection by the user is only made by interaction when an indistinguishable "true ambiguity" occurs. Corresponding execution methods are provided for different utterances which the user may answer, so as to ensure stable operation of the system.

As an embodiment, in this embodiment, when the existing resource at least includes the historical context information and/or the customized disambiguation rule base:

querying the historical context information and/or the customized disambiguation rule base for information corresponding to each semantic result in the disambiguation candidate list;

and disambiguating each semantic result in the disambiguation candidate list according to the corresponding information.

In this embodiment, continuing to take "i want to listen to a small red cap" as an example, for example, when the user explicitly expresses "i want to listen to a story" in the first round of conversation and expresses "i want to listen to a small red cap" in the second round, the semantic result of the story field will be automatically selected for the user at this time.

Since there are some program names ending in the domain name, such as "spring story played", it will be confirmed whether there is a value ending in "story" in the key slot by customizing the disambiguation rule base (at this time, "song name" satisfies the rule for "spring story"), if not, the semantic resolution result that does not satisfy the condition will be automatically discarded, such as "story that my want to listen to kite", if the semantic resolution module accidentally resolves "kite" into song name, since song name does not end in "story", the automatic disambiguation will automatically filter out the incorrect resolution.

As an implementation manner, in this embodiment, when the existing resource at least includes the history disambiguation record:

inquiring whether the voice data input by the user has a historical disambiguation record within a preset time range;

and when the historical disambiguation record exists, determining a semantic result corresponding to the voice data input by the user according to the historical disambiguation record.

In this embodiment, the auto-disambiguation processing module also uses the previously recorded historical disambiguation record to directly select the last disambiguation result for the user when the user sends the same request for disambiguation again in a short time.

As an implementation manner, in this embodiment, when the existing resources at least include the voice dialog platform resource:

querying voice conversation platform resources corresponding to each semantic result in the disambiguation candidate list;

disambiguating semantic results that do not have a corresponding voice dialog platform resource.

In this embodiment, the voice conversation platform includes a multimedia resource that the user wants to query or play, for example, the platform resource in the music field is a song library, the platform resource in the stock market field is a story library, and the specific resource is an audio file or a video file, for example, a song "heart is too soft" or a story "small red hat" is a resource.

Since semantics do not know whether the data services of the voice dialog platform contain relevant resources, the auto-disambiguation processing module will also utilize semantic results in conjunction with resource searches, and will also disambiguate semantic results for which no resources are found.

For example, as illustrated above, the user says "i want to hear" my is singer ", because the audio data service provider has not included the audio collection of" i are singers ", the automatic disambiguation module will automatically filter out the semantic resolution in the audio field, and avoid prompting the user" i do not find audio "after selecting audio"

According to the implementation method, based on historical multi-round context information and data service query results in automatic disambiguation, a highly customizable automatic disambiguation rule base is matched, invalid semantic results can be automatically eliminated under the condition that various semantic analysis results all contain key semantic slots, the validity period is set by storing the user history selection records, and when the user requests the same content again in a short time, the ambiguity disambiguation module automatically reads the history records, so that the user is prevented from carrying out ambiguity selection on the same problem for multiple times.

Fig. 2 is a schematic structural diagram of a voice dialog processing system for a voice dialog platform according to an embodiment of the present invention, and the technical solution of this embodiment is applicable to a voice dialog processing method for a voice dialog platform of a device, and the system can execute the voice dialog processing method for the voice dialog platform according to any of the above embodiments and is configured in a terminal.

The embodiment provides a voice dialogue processing system for a voice dialogue platform, which comprises: a semantic understanding acquisition program module 11, a key semantic slot determination program module 12, a disambiguation candidate list determination program module 13 and an automatic disambiguation program module 14.

The semantic understanding acquiring program module 11 is configured to acquire n semantic results with the highest possibility of the voice data input by the user according to voice recognition and understanding; the key semantic groove determining program module 12 is configured to determine, when n >1, a field to which each semantic result relates, and determine whether a semantic groove corresponding to each semantic result is a key semantic groove in the field; the disambiguation candidate list determining program module 13 is configured to add m semantic results having key semantic slots to the disambiguation candidate list, where m is ≦ n; the automatic disambiguation program module 14 is configured to, when m >1, automatically disambiguate the disambiguation candidate list according to existing resources to obtain l semantic results, where the existing resources include historical context information, historical disambiguation records, voice dialog platform resources, and/or a customized disambiguation rule base.

Further, the system method further comprises: user parser module for

Further, when the existing resource includes at least historical context information and/or a custom disambiguation rule base:

Further, when the existing resource includes at least a historical disambiguation record:

Further, when the existing resources include at least voice dialog platform resources:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice conversation processing method for the voice conversation platform in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a voice dialog processing method for a voice dialog platform in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method for voice dialog processing for a voice dialog platform of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice dialog processing method for a voice dialog platform, comprising:

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1, wherein, when the existing resources include at least historical context information and/or a custom disambiguation rule base:

4. The method of claim 1, wherein, when the existing resource includes at least a historical disambiguation record:

5. The method of claim 1, wherein, when the existing resources include at least voice conversation platform resources:

6. A voice dialog processing system for a voice dialog platform, comprising:

7. The system of claim 6, wherein the system method further comprises: user parser module for

8. The system of claim 6, wherein when the existing resources include at least historical context information and/or a custom disambiguation rule base:

9. The system of claim 6, wherein, when the existing resource includes at least a historical disambiguation record:

10. The system of claim 6, wherein when the existing resources include at least voice conversation platform resources: