CN106649825B

CN106649825B - Voice interaction system and creation method and device thereof

Info

Publication number: CN106649825B
Application number: CN201611247830.XA
Authority: CN
Inventors: 曾永梅; 李波; 朱频频
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2020-03-24
Anticipated expiration: 2036-12-29
Also published as: CN106649825A

Abstract

The invention provides a voice interaction system and a creating method and device thereof. A method for creating a voice interaction system, comprising: receiving a voice user interaction flow chart, wherein the voice user interaction flow chart comprises a plurality of flows which are circulated according to a preset flow; creating a knowledge base based on the plurality of processes, wherein the plurality of processes comprise a first process and a second process positioned at the downstream of the first process, the answer of a first knowledge point corresponding to the first process is a question-type answer, and the question of a second knowledge point corresponding to the second process is a response to the question-type answer of the first knowledge point; providing a language model for performing speech recognition on a speech input of a user; and providing knowledge points in the knowledge base for performing semantic recognition on the obtained speech recognition result. The invention realizes the circulation between the processes by establishing the knowledge base and utilizing the matching of the knowledge points in the knowledge base, thereby reducing the implementation difficulty.

Description

Voice interaction system and creation method and device thereof

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a voice interaction system and a method and a device for creating the voice interaction system.

Background

Human-computer interaction is the science of studying the interactive relationships between systems and users. The system may be a variety of machines, and may be a computerized system and software. For example, various artificial intelligence systems, such as intelligent customer service systems, voice control systems, and the like, may be implemented through human-computer interaction. Artificial intelligence semantic recognition is the basis for human-machine interaction, which is capable of recognizing human language for conversion into machine-understandable language.

The intelligent question-answering system is a typical application of human-computer interaction, wherein when a user proposes a question, the intelligent question-answering system gives an answer to the question. The voice interactive system is a special intelligent question-answering system, i.e. the questions put forward by the user are input in the form of voice. Therefore, in the voice interaction system, the user question in the voice form, that is, the user question in the text form, needs to be recognized as the voice input first, and then the user question is understood through the semantic parsing process, and a corresponding answer is given.

Conventionally, a voice interactive system is designed to develop VoiceXML corresponding to a user to realize semantic understanding and subsequent processing based on a voice user interactive flowchart given by a client. VoiceXML is based on the XML language specification and is a markup language applied to voice browsing. Voice WEB-based applications and services can be built using VoiceXML.

Based on the traditional design mode, the linguistic data to be recognized are written into grammar to generate a language model, and then the linguistic data to be understood are classified well to generate a semantic model. Then, loading the language and semantic model in the voice interaction system, and writing vxml (voice extensible markup language) corresponding to each semantic classification. The language model is used to recognize the user's speech input and convert it into textual user input. The semantic model is used to understand the meaning of the user input in the text form to determine the subsequent flow. For example, the category tag of the bill query is: and (6) bill. The vxml needs to be written, and when the semantic analysis result is identified as bill, the corresponding process is followed, for example, the following process is: "do you want to query about the bill in which month? And then waiting for the user to input and recognize again, wherein the current month is recognized, and the classification tag of the current month is month.

The development mode based on the VoiceXML needs a special developer to write vxml, so that the implementation difficulty is increased, and people who need to write a semantic model need to agree with the semantic classification tag to continue the operation, so that the communication cost is increased. Every time the flow is added, deleted and changed, the language model, the semantic model and the vxml are required to be reloaded in the system, and the real-time effect cannot be achieved.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

The invention provides a voice interaction system and a creating method and device thereof, and aims to solve the problem that development and implementation difficulty of circulation among flows is high in the creating process of the voice interaction system.

In a first aspect, the present invention provides a method for creating a voice interactive system, comprising:

receiving a voice user interaction flow chart, wherein the voice user interaction flow chart comprises a plurality of flows which are circulated according to a preset flow;

creating a knowledge base based on the plurality of processes, the knowledge base including a plurality of knowledge points corresponding to the plurality of processes, each knowledge point including a question and an answer thereto,

the plurality of processes comprise a first process and a second process positioned at the downstream of the first process, wherein the answer of a first knowledge point corresponding to the first process is a question-type answer, and the question of a second knowledge point corresponding to the second process is a response to the question-type answer of the first knowledge point;

providing a language model for performing speech recognition on a speech input of a user; and

knowledge points in the knowledge base are provided for performing semantic recognition on the obtained speech recognition results.

In a second aspect, the present invention provides an apparatus for creating a voice interactive system, comprising:

the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a voice user interaction flow chart, and the voice user interaction flow chart comprises a plurality of flows which are circulated according to a preset flow;

a knowledge base creation module for creating a knowledge base based on the plurality of processes, the knowledge base including a plurality of knowledge points corresponding to the plurality of processes, each knowledge point including a question and an answer thereto,

a language model training module for providing a language model for performing speech recognition on a speech input of a user; and

and the knowledge point distribution module is used for providing knowledge points in the knowledge base for performing semantic recognition on the obtained voice recognition result.

In a third aspect, the present invention provides a voice interaction system, including:

a knowledge base created by the above method;

the speech recognition module is used for executing speech recognition on the speech input of the user by adopting the language model provided by the method;

the semantic recognition module is used for executing semantic recognition on the voice recognition result by adopting the corresponding knowledge points in the knowledge base; and

and the output module is used for providing response output for the user based on the voice recognition result.

The invention realizes the circulation between the processes by establishing a knowledge base and utilizing the matching of knowledge points in the knowledge base. This avoids the writing of vxml by a dedicated developer, reducing implementation difficulty. Compared with the traditional design based on VoiceXML, the method has the key points that when the processes of adding and deleting are carried out, only corresponding knowledge points need to be added and deleted in the knowledge base, the method can take effect in real time, and the deployment is flexible.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.

FIG. 1 is a flow diagram illustrating a method for creating a voice interaction system in accordance with an aspect of the present invention;

FIG. 2 shows one example of a voice user interaction flow diagram;

FIG. 3 is a flow chart illustrating a method of extending a standard question in accordance with an aspect of the present invention;

FIG. 4 is a block diagram illustrating an apparatus for creating a voice interaction system in accordance with an aspect of the present invention; and

FIG. 5 is a block diagram illustrating an expansion unit according to another aspect of the present invention; and

FIG. 6 illustrates a block diagram of a voice interaction system in accordance with an aspect of the subject invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.

For voice interactive systems, the user presents questions in the form of voice input. In order to answer the question of the user, the background process of the voice interaction system mainly comprises two parts: a speech recognition part and a semantic recognition part. The voice recognition part is used for performing voice recognition on the voice input of the user based on the voice model so as to obtain the user question in the form of characters. The semantic recognition part is used for understanding the user questions in the text form based on a semantic model so as to know the intention of the user and further give answers.

The speech recognition technology is mainly composed of a language model training phase and a recognition phase using a language model. The speech recognition part is the recognition stage using the language model.

The language model training phase is to perform modeling of a language model through training of a large amount of corpora, for example, using an SRILM tool. SRILM is known as Stanford Research Institute Language Modeling Toolkit (Stanford Institute Language Modeling tool), and the main goal is to support the estimation and evaluation of Language models. After the language model is established, the speech input by the user is recognized by utilizing the language model. In the speech recognition process, whether the language model is accurate or not is crucial to the recognition result. A more sophisticated language model may provide more accurate speech recognition results.

In the invention, a knowledge base for semantic recognition is designed, and the knowledge base comprises a plurality of knowledge points. The most primitive and simplest form of knowledge point is the usual FAQ in normal times, the common form being a "question-and-answer" pair. In the invention, the 'standard question' is a word for representing a certain knowledge point, and the main aim is to clearly express and facilitate maintenance. For example, "the tariff for a coloring ring back tone" is a clear description of the standard. The term "question" should not be construed narrowly as "question" but rather broadly as "input" with corresponding "output". For example, for semantic recognition for a control system, an instruction of a user, for example "turn on radio", should also be understood as a "question", in which case the corresponding "answer" may be a call to a control program for executing the corresponding control.

Therefore, the voice interactive system of the present invention is different from the traditional voice interactive system designed based on VoiceXML in that in the semantic recognition part, that is, in the process of searching for the standard question from the knowledge base based on the voice recognition result, the matched standard question is found, that is, the semantic of the voice recognition result is considered to be "understood", so that the "answer" corresponding to the matched standard question can be provided to the user.

In the invention, the matched standard questions can be determined through semantic similarity calculation between the speech recognition result and all the standard questions in the knowledge base. For example, the standard question having the highest semantic similarity may be determined to be the matched standard question, the target service that the user wishes to handle may be determined from the matched standard question, and the answer associated with the standard question may be provided to the user. For example, if the matched standard question is "CRBT tariff", the answer (e.g., CRBT tariff) associated with the standard question may be output to the user.

The knowledge base and the knowledge points therein will be first described below.

In order to identify user questions more accurately and efficiently, the intelligent question-answering system also develops the concept of abstract semantics. Abstract semantics are further abstractions of ontological class properties. The abstract semantics of a category describe different expressions of a class of abstract semantics through a set of abstract semantic expressions, which are extended over the constituent elements in order to express more abstract semantics. A wide variety of specific semantics can be expressed when these augmented elements are assigned corresponding values once.

Each abstract semantic expression may include primarily missing semantic components and semantic rule words. Missing semantic components are represented by semantic component identifiers, and when the missing semantic components are filled with corresponding values (i.e., content), a wide variety of specific semantics can be expressed.

The semantic component tokens of the abstract semantics may include:

[ concept ]: a word or phrase representing a composition of a subject or object.

Such as: color ring back tone in how to open color ring back tone "

[ action ]: a word or phrase representing an action component.

Such as: handling of how credit card is handled "

[ attribute ]: a word or phrase representing an attribute component.

Such as: color of the color of iphone "

[ adoptive ]: a word or phrase indicating a modifying component.

Such as: cheap in 'which brand of refrigerator is cheap'

Some examples of major abstract semantic categories are:

concept what is said

Attribute constructs what [ concept ] is

How the behavior is [ concept ] [ action ]

Where the action site [ concept ] is

Reason for behavior [ concept ] why [ action ]

Behavior prediction [ concept ] will not [ action ]

Behavior judgment [ concept ] presence or absence [ attribute ]

Whether [ attribute ] of attribute status [ concept ] is [ adaptive ]

Attribute judgment whether [ concept ] is [ attribute ]

Attribute reason [ attribute ] why [ attribute ] is so [ adaptive ]

Concept comparison where the distinction between [ concept1] and [ concept2] is

Attribute comparison what the attribute differs between [ concept1] and [ attribute ] of [ concept2]

The component judgment of the question at the abstract semantic level can be generally judged by part-of-speech tagging, wherein the part-of-speech corresponding to concept is a noun, the part-of-speech corresponding to action is a verb, the part-of-speech corresponding to attribute is a noun, and the adjective corresponding to adoptive is.

Taking how [ action ] the abstract semantics [ concept ] of the category is "behavior mode" as an example, the abstract semantics set of the category may include a plurality of abstract semantic expressions:

abstract semantic categories: behavioral patterns

Abstract semantic expression:

[ concept ] [ need | should? How is < then can be? < proceed? < action >

b.{[concept]～[action]}

c. [ concept ] <? > [ action ] < method | manner | step? < CHEM > A

d. < what is | what is present and absent > < what is by | in > [ concept ] [ action ] <? < method > ]

e. "how to" act "to" concept

The four abstract semantic expressions a, b, c and d are all used for describing the abstract semantic category of behavior mode. The symbol "|" represents "or" relationship, symbol "? "indicates the presence or absence of the component. Taking the above abstract semantic expression c as an example, the following abstract semantic expressions can be developed:

c1.[ concept ] < of [ action ] < method >

c2.[ concept ] < of [ action ] < mode >

c3 > [ concept ] < action ] < step >

c4.[ concept ] < of [ action ]

c5.[ concept ] [ action ] < method >

c6.[ concept ] [ action ] < mode >

c7.[ concept ] [ action ] < step >

c8.[concept][action]

In the above abstract semantic expression, in addition to the semantic component character which is an abstract of the missing semantic component, other concrete words such as "how", "should", "method", and the like appear, and these words need to be used in the abstract semantic rule, so they may be collectively referred to as semantic rule words.

Some basic concepts about knowledge points for intelligent question answering are introduced above, which is helpful for understanding the contents of the present invention.

FIG. 1 is a flow chart illustrating a method 100 for creating a voice interaction system in accordance with an aspect of the present invention. As shown in fig. 1, the method 100 may include the steps of:

step 101: a voice user interaction flow chart is received, the voice user interaction flow chart comprising a plurality of flows that are circulated according to a predetermined flow.

The voice interaction flowchart is a diagram representing a flow of a flow when a user uses the voice interaction system. Each node in the flow chart represents a flow, and the flow is changed from one flow to the next flow according to different user problems. Shown in FIG. 2 is flow 1, flows 11, 12, 13 downstream of flow 1, flows 111, 112 downstream of flow 11, flow 131 downstream of flow 13, and flows 1121, 1122 downstream of flow 112.

The user's interaction with the voice interactive system may be circulated according to the relationship of the flows in the flow chart. For example, in the process 1 stage, the process 11, the process 12 or the process 13 is selectively entered through the recognition of the user input. Assuming that the process 12 is entered, the whole interactive process is ended. When the process proceeds to the process 11, the process proceeds to the process 111 or the process 112 selectively based on the recognition of the user input. Assuming that the process 111 is entered, the whole interactive process is ended. Upon entering flow 112, the process proceeds to flow 1121 or flow 1122, depending on the recognition of the user input.

Each flow in fig. 2 differs depending on a service object of the voice interaction system. For example, for a voice interactive system for a telecom operator, the flows may be "call charge query", "color ring service transaction", "traffic packet subscription", and so on.

The flow chart shown in fig. 2 may be only part of a complete flow chart, for example, there may also be another flow upstream of flow chart 1, from which flow chart 1 may be entered. Other flows not depicted may also exist downstream of the flows 111, 112, 131, 1121, 1122.

After receiving the flow chart, the requirements of the system user can be known, so that the system can be customized according to the requirements.

At step 102, a knowledge base is created based on the plurality of processes in the flow chart, the knowledge base comprising a plurality of knowledge points corresponding to the plurality of processes, each knowledge point comprising a question and an answer thereto.

Namely, a knowledge point is constructed for each flow in the flow chart, so that a plurality of knowledge points are obtained correspondingly. The establishment of knowledge points particularly requires that, for an upstream process having a downstream process, the answers of the knowledge points corresponding to the upstream process are question-type answers, and the questions of the knowledge points corresponding to the downstream process are responses to the question-type answers of the knowledge points corresponding to the upstream process.

It is assumed that the plurality of processes includes a first process and a second process located downstream of the first process. The answer of the first knowledge point corresponding to the first process is a question-type answer, and the question of the second knowledge point corresponding to the second process is a response to the question-type answer of the first knowledge point. The first and second flows are only generally referred to herein to indicate relative upstream and downstream relationships between the flows.

Taking fig. 2 as an example, regarding the relationship between the process 1 and the processes 11, 12 and 13, the process 1 is an upstream process, i.e. a first process; the processes 11, 12 and 13 are all the downstream processes of the process 1, and are the second process. However, in terms of the relationship between the flow 11 and the flows 111 and 112, the flow 11 becomes an upstream flow, i.e., a first flow, and the flows 111 and 112 become downstream flows, i.e., a second flow.

As described above, the knowledge point exists in the form of "question-answer", and both the "question" and the "answer" herein should be understood in a broad sense. For example, a "question" may be directly an instruction or a statement, rather than a question in the traditional syntax, and accordingly, an "answer" may be a function or command call to execute the instruction, and an "answer" may also be a question.

Here, the "answer" itself may be in the form of a question in grammar. For example, assuming that the process 13 is a "bill query" process, the corresponding knowledge point "question" may be "bill query", and the "answer" is "what month of the bill you want to query? ". The "question" of the knowledge point corresponding to the downstream flow 131 of the flow 13 may be "month of query", and the "answer" is "month xx, and the bill is yyyy element" (where the month xx is the actual input month of the user).

Therefore, when the user inputs "i want to search for a bill", the semantic similarity calculation can be performed to find the question "bill query" with the highest matching degree, that is, the process 13 is entered, at this time, the output answer is not the specific bill detail, but is a question "do you want to search for a bill in which month? ". When the user inputs a specific query month, the question "query month" with the highest matching degree is found through semantic similarity calculation, that is, the process 131 is entered, and the output answer is the answer that the end user wants to know "the bill of xx month is yyy element".

As another example, assume that the answer to the knowledge point of flow 1 is a question-type answer "what business you want to do? "the question of the knowledge point of the flow 11, the flow 12, and the flow 13 is" traffic packet service "," color ring service ", and" bill query ". Then, through semantic similarity calculation of the speech recognition result input by the user, the knowledge point of the process 11, the process 12 or the process 13 can be located, so that a corresponding answer of the knowledge point is given, and circulation from the process 1 to the process 11, the process 12 or the process 13 is realized.

Through the method, namely the relation between the knowledge points of the upstream flow and the downstream flow is established, and the circulation between the flows in the flow chart is completed in a mode of calculating and positioning the next knowledge point through semantic similarity.

When the user inputs the information to the system, the system can immediately understand the meaning of the user by using the most ideal condition of the information. However, rather than using standard questions, users often use some variant form of standard questions. For example, if the standard form of a station switch for a radio is "change station", then the command that the user may use is "switch station", and the machine also needs to be able to recognize that what the user has expressed is the same meaning.

As described above, the present invention is a problem of matching a user question to a knowledge point by semantic similarity calculation. In order to obtain better result of semantic similarity calculation, the standard questions of each knowledge point are expanded by a plurality of expansion questions. When performing semantic recognition, it is practical to perform semantic similarity calculation on a user question in text form (i.e., a speech recognition result) together with a question including a standard question and an expanded question at each knowledge point to obtain a question with the highest match.

To this end, in the present invention, establishing each knowledge point includes establishing a standard question of the knowledge point, an extended question associated therewith, and a corresponding answer. The establishment of the standard questions and answers is compiled from knowledge provided by the customer (i.e. the customer who customizes the voice interactive system, e.g. a bank, a telecommunications carrier, etc.). The user has corresponding descriptions for each flow in the flow chart, for example, what information needs to be fed back according to the user input content, and the like. The standard questions and answers for each knowledge point can be extracted and edited from the knowledge provided by the user.

The expansion of each standard question, if done in the form of a manual "thought", is inefficient and may be somewhat missed. In the present invention, extended questions of standard questions are automatically generated using abstract semantic expressions.

To this end, it is first necessary to provide an abstract semantic database comprising a plurality of abstract semantic expressions including missing semantic components, as described above.

Fig. 3 shows a flow diagram of a method 300 of expanding a standard question. As shown in fig. 3, the method 300 may include the following steps.

Step 302, performing abstract semantic recommendation processing on the standard question according to an abstract semantic database to obtain one or more abstract semantic expressions corresponding to the standard question.

For example, one criterion is: "how to look for the violation".

First, an abstract semantic expression corresponding to the standard question in an abstract semantic database needs to be found. In one example, the abstract semantic recommendation first performs word segmentation on the standard question to obtain a plurality of words, wherein the words are semantic regular words or non-semantic regular words.

For example, "how to look for a violation" can be divided into the words "how", "look up", "violation". Of these words, "how" is a semantic rule word, "look" and "violation" are non-semantic rule words.

Then, part-of-speech tagging processing is performed on each non-semantic rule word, for example, "query" is tagged as a verb, and "violation" is tagged as a noun.

And then, performing word type judgment processing on each semantic rule word to obtain word type information of each semantic rule word. A simple understanding of a part of speech is a group of words with commonalities, which may or may not be semantically similar.

And finally, searching and processing the abstract semantic database according to the part of speech information and the part of speech information to obtain an abstract semantic expression matched with the standard question 'how to check the violation'.

In practice, the abstract semantic expression matched with the user satisfies the following conditions:

1) the part of speech corresponding to the missing semantic component of the abstract semantic expression comprises the part of speech of filling content corresponding to the standard question;

2) the corresponding semantic rule words in the abstract semantic expression and the standard questions are the same or belong to the same word class;

3) the order of the abstract semantic expressions is the same as the order of expression of the standard questions.

In the abstract semantic type "behavior mode", the part of speech of the missing semantic component action of the abstract semantic expression e is a verb, the filling content "search" corresponding to the standard question "how to check violation" is also a verb, the part of speech of the missing semantic component concept is a noun, and the filling content "violation" corresponding to the standard question "how to check violation" is also a noun, and therefore, the above condition 1 is satisfied).

Secondly, the semantic rule word "how" in the abstract semantic expression e and the corresponding semantic rule word "how" in the standard question "how to look up the violation" belong to the same word class, and therefore, the above condition 2) is met.

Finally, the order of the abstract semantic expression e is also the same as the expression order of the standard questions, which meets the above condition 3).

Therefore, in the abstract semantic database, an abstract semantic expression e matching the standard question "how to look for violations", i.e., [ how ] [ action ] to [ concept ], is found. The abstract semantic expressions belong to the category of behavior mode, and because the abstract semantic expressions in one category have the same expression meaning, in the invention, a set of abstract semantic expressions in the category of behavior mode is recommended for the standard questions. In other words, all abstract semantic expressions in the category to which the matched abstract semantic expression belongs are recommended as abstract semantic expressions corresponding to the standard question.

Step 304, extracting the content corresponding to the missing semantic components of the one or more abstract semantic expressions from the standard questions, and filling the extracted content into the corresponding missing semantic components to obtain one or more concrete semantic expressions corresponding to the standard questions. These specific semantic expressions serve as extension questions for the standard question.

Taking the above standard question of "how to look up violations" as an example, the following abstract semantic expression is recommended:

[ concept ] [ need | should? How is < then can be? < proceed? < action >

b.{[concept]～[action]}

c. [ concept ] <? > [ action ] < method | manner | step? < CHEM > A

e. "how to" act "to" concept

And (3) expanding the standard question 'how to look for the violation' by using the abstract semantic expression.

In one example, the content corresponding to the missing semantic component of each abstract semantic expression is extracted from the standard question, and the extracted content is filled into the missing semantic component corresponding to each abstract semantic expression to obtain a concrete semantic expression corresponding to the standard question.

With abstract semantic expression a: [ concept ] [ need | should? How is < then can be? < proceed? For example, > [ action ], the content corresponding to the missing semantic component of the expression is extracted from "what", "look up", and "violation":

content corresponding to concept: traffic violation "

Content corresponding to action: "Chao"

Therefore, filling the 'check' and 'violation' violations into the corresponding missing semantic components to obtain a specific semantic expression: [ violation ] [ need | should? How is < then can be? < proceed? [ query ].

Taking abstract semantic expression b { [ concept ] - [ action ] } as an example, extracting the content corresponding to the missing semantic component of the expression from the 'how', 'check' and 'violation':

content corresponding to concept: traffic violation "

Content corresponding to action: "Chao"

Therefore, the 'check' and 'violation' are filled into the corresponding missing semantic components to obtain a specific semantic expression: [ violation ] [ query ].

Is there an abstract semantic expression c? > [ action ] < method | manner | step? For example, the content corresponding to the missing semantic component of the expression is extracted from the 'how', 'check' and 'violation':

content corresponding to concept: traffic violation "

Content corresponding to action: "Chao"

Therefore, filling the 'check' and 'violation' violations into the corresponding missing semantic components to obtain a specific semantic expression: [ violation ] of ? [ query ] method | manner | step?>.

Taking the abstract semantic expression d. which |what | there is or not through | by | in [ concept ] [ action ] of ? [ method ] as an example, extracting the content corresponding to the missing semantic components of the expression from the 'how', 'check' and 'violation':

content corresponding to concept: traffic violation "

Content corresponding to action: "Chao"

Therefore, the 'check' and 'violation' are filled into the corresponding missing semantic components to obtain a specific semantic expression: < what are | what is there | if there > by | use | on > [ violation ] [ query ] < what? [ method ].

The above describes how to extend the standard questions using an abstract semantic database.

The semantic expression and user question relationship is very different from the traditional template matching, in which the template and user question are just matching and unmatching relationships, and the relationship between the semantic expression and user question is represented by a quantized value (similarity), and the quantized value and the similarity between the similar question and the user question can be compared with each other.

Therefore, the semantic recognition calculated by utilizing the semantic similarity has very good recognition rate, and the user embodiment is improved.

Returning to FIG. 1, at step 103, a language model is provided for performing speech recognition on a user's speech input.

For a voice interactive system, it is first necessary to recognize the user's voice input as a text-form user input. As mentioned above, speech recognition requires the use of a language model. The language model is mainly formed by adopting a large amount of linguistic data for training. On the one hand, the larger the corpus used, the more accurate the resulting language model. However, as the number of corpora increases, the computational cost of training and recognition also increases. Thus, in practice, training is often performed with a certain corpus based on a trade-off between cost and performance.

On the other hand, the more targeted the corpus is, the more accurate the trained language model is. For example, for sports applications, a large number of sports-related terms may be used as corpus and trained, and for financial applications, a large number of financial-related terms may be used as corpus. In this way, a more accurate language model is obtained at a certain cost.

In one aspect of the present invention, the corpus of the language model may be selected according to an application domain of the voice interactive system.

However, in order to further improve the recognition accuracy of the language model and reduce the cost, the invention adopts a more targeted strategy. That is, the voice interaction system of the present invention may use different language models based on the current process position instead of using a fixed language model for the voice input of the user.

In particular, for each flow, a language model specific to a downstream flow of the flow is trained to be used to perform speech recognition of user speech input with respect to the downstream flow(s). It is apparent that each flow herein refers to a flow having a downstream flow. Taking fig. 2 as an example, for flow 1, the language model of flows 11, 12, 13 specific to their downstream flows is trained. For flow 11, the language model of flows 111, 112 specific to their downstream flows is trained, and so on.

Specifically, during training, the question sentences in the knowledge points corresponding to the downstream process (es) are used as the speech training corpus to train the language model. It is easy to understand that the user input for recognition of the language model obtained by training is likely to correspond to the question sentence in the knowledge points, so that the recognition accuracy is quite high. In practice, the SRILM tool may be employed for training.

In step 104 knowledge points in a knowledge base are provided for performing semantic recognition on the obtained speech recognition results.

In one example, for each user input from a user, when performing semantic recognition, a semantic similarity calculation may be performed using all knowledge points in the knowledge base.

Preferably, a more targeted strategy is employed. That is, for semantic recognition input by a user, not all knowledge points in the entire knowledge base are adopted, but different knowledge points may be adopted to perform semantic similarity calculation based on the current process position.

Specifically, for each flow, knowledge points corresponding to downstream flows of the flow are provided for performing semantic recognition of the speech recognition result with respect to the downstream flow(s). Taking fig. 2 as an example, for the user input given by the user at the stage of the process 1, after the user input in the form of voice is subjected to voice recognition to obtain a voice recognition result (i.e. after the user input in the form of text), when performing semantic recognition, only three knowledge points corresponding to the processes 11, 12 and 13 are used to perform semantic similarity calculation with the user input.

According to the scheme of the invention, the circulation among the processes is realized by establishing a knowledge base and utilizing the matching of knowledge points in the knowledge base. This avoids the writing of vxml by a dedicated developer, reducing implementation difficulty. Compared with the traditional design based on Voice XML, the method has the key points that when the flow is added and deleted, only corresponding knowledge points are added and deleted in the knowledge base, the method can take effect in real time, and the deployment is flexible.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

FIG. 4 is a block diagram illustrating an apparatus 400 for creating a voice interaction system in accordance with an aspect of the present invention.

As shown in fig. 4, the apparatus 400 may include a receiving module 401, a knowledge base creating module 402, a language model training module 403, a knowledge point allocating module 404, and an abstract semantic base 405.

The receiving module 401 may be configured to receive a voice user interaction flowchart comprising a plurality of flows that flow according to a predetermined flow.

The knowledge base creation module 402 may create a knowledge base based on the plurality of processes, the knowledge base including a plurality of knowledge points corresponding to the plurality of processes, each knowledge point including a question and an answer thereto.

Without loss of generality, for a first process and a second process located downstream of the first process, which are included in the plurality of processes, the answer of the first knowledge point corresponding to the first process is a question-type answer, and the question of the second knowledge point corresponding to the second process is a response to the question-type answer of the first knowledge point.

To this end, the abstract semantic database 405 included in the apparatus 400 includes a plurality of abstract semantic expressions that include missing semantic components.

In an example, the knowledge base creation module 402 can include the expansion unit 4021. The extension unit 4021 may perform abstract semantic recommendation processing on the standard question according to the abstract semantic database, extract content corresponding to missing semantic components of one or more abstract semantic expressions from the standard question when one or more abstract semantic expressions corresponding to the standard question are obtained, and fill the extracted content into the corresponding missing semantic components to obtain one or more concrete semantic expressions corresponding to the standard question, where the concrete semantic expressions are used as extension questions of the standard question.

As shown in fig. 5, the expansion unit 4021 may include a word segmentation subunit 40211, a part of speech tagging subunit 40212, a part of speech determination subunit 40213, and a search subunit 40214.

The participle subunit 40211 may be configured to perform participle processing on the standard questions to obtain a plurality of words, where the words are semantic regular words or non-semantic regular words. The part-of-speech tagging subunit 40212 may be configured to perform part-of-speech tagging on each non-semantic rule word, to obtain part-of-speech information of each non-semantic rule word. The part-of-speech determining subunit 40213 may be configured to perform part-of-speech determining processing on each semantic rule word, to obtain part-of-speech information of each semantic rule word. Finally, the search subunit 40214 may perform search processing on the abstract semantic database according to the part-of-speech information and the part-of-speech information to obtain an abstract semantic expression matching the standard questions.

The abstract semantic expression can also comprise semantic rule words, and the abstract semantic expression matched with the standard questions needs to satisfy the following conditions:

the part of speech corresponding to the missing semantic component of the abstract semantic expression comprises the part of speech of filling content corresponding to the standard question;

the corresponding semantic rule words in the abstract semantic expression and the standard questions are the same or belong to the same word class;

the order of the abstract semantic expressions is the same as the order of expression of the standard questions.

Language model training module 403 may be used to provide language models for performing speech recognition on a user's speech input.

In an example, language model training module 403 may, for each flow, train a language model specific to a downstream flow of the flow to use to perform speech recognition of the user speech input with respect to the downstream flow. In training, the language model training module 403 may use the questions in the knowledge points corresponding to the downstream process (es) as speech corpus training language models. In practice, the language model training module 403 may train the language model using the SRILM tool.

Knowledge points in the knowledge base may be provided by knowledge point assignment module 404 for performing semantic recognition on the obtained speech recognition results.

In an example, the knowledge point assignment module 404 can provide, for each flow, knowledge points corresponding to flows downstream of the flow for performing semantic recognition of speech recognition results with respect to the downstream flow(s).

For a specific implementation manner of the apparatus for creating a voice interaction system in the present invention, reference may be made to the embodiment of the method for creating a voice interaction system, which is not described herein again.

The invention also provides a voice interaction system constructed by adopting the scheme.

FIG. 6 illustrates a block diagram of a voice interaction system 600 in accordance with an aspect of the subject invention.

The speech interaction system 600 may include a knowledge base 601, which knowledge base 601 may be created using the method illustrated in FIG. 1.

The speech interaction system 600 may also include a speech recognition module 602, a semantic recognition module 603, and an output module 604. Semantic recognition module 602 may be used to perform speech recognition on a user's speech input using a language model provided by the method shown in FIG. 1.

Semantic recognition module 603 may be configured to perform semantic recognition on the speech recognition result using corresponding knowledge points in knowledge base 601. The output module 604 may be used to provide a responsive output to the user based on the speech recognition results.

In an example, the semantic recognition module 603 may include a semantic similarity calculation module 6031 that performs semantic similarity calculation on the speech recognition result and the question sentences in the corresponding knowledge points, and a question with the highest semantic similarity among the questions with the semantic similarity higher than the threshold is determined as a matching question. The output module 604 may provide the answer associated with the matching question to the user as the responsive output.

Those of skill in the art would appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for creating a voice interactive system, comprising:

providing knowledge points in the knowledge base for performing semantic recognition on the obtained speech recognition result;

the providing a language model includes:

for each flow, a language model specific to a downstream flow of the flow is trained to be used to perform speech recognition of user speech input with respect to the downstream flow.

2. The method of claim 1, wherein the questions in each knowledge point include a standard question and an expanded question for the standard question.

3. The method of claim 2, wherein the extension query is established by:

providing an abstract semantic database, wherein the abstract semantic database comprises a plurality of abstract semantic expressions, and the abstract semantic expressions comprise missing semantic components;

and performing abstract semantic recommendation processing on the standard questions according to the abstract semantic database, when one or more abstract semantic expressions corresponding to the standard questions are obtained, extracting contents corresponding to missing semantic components of the one or more abstract semantic expressions from the standard questions, and filling the extracted contents into the corresponding missing semantic components to obtain one or more concrete semantic expressions corresponding to the standard questions, wherein the concrete semantic expressions are used as expanded questions of the standard questions.

4. The method of claim 3, wherein the abstract semantic recommendation process comprises:

performing word segmentation processing on the standard questions to obtain a plurality of words, wherein the words are semantic regular words or non-semantic regular words;

respectively carrying out part-of-speech tagging processing on each non-semantic regular word to obtain part-of-speech information of each non-semantic regular word;

respectively carrying out word type judgment processing on each semantic rule word to obtain word type information of each semantic rule word;

and searching and processing an abstract semantic database according to the part-of-speech information and the part-of-speech information to obtain an abstract semantic expression matched with the standard questions.

5. The method of claim 4, wherein the abstract semantic expression further comprises semantic rule words, and wherein abstract semantic expressions that match the standard questions satisfy the following condition:

6. The method of claim 1, wherein the training comprises training a language model using questions in knowledge points corresponding to the downstream process as speech training corpora.

7. The method of claim 6, wherein the language model is trained by employing SRILM tools.

8. The method of claim 1, wherein said providing knowledge points in said knowledge base comprises:

for each flow, knowledge points corresponding to a downstream flow of the flow are provided for performing semantic recognition of speech recognition results with respect to the downstream flow.

9. An apparatus for creating a voice interactive system, comprising:

a language model training module for providing a language model for performing speech recognition on a user's speech input, the language model training module training, for each flow, a language model specific to a downstream flow of the flow to be used to perform speech recognition of the user's speech input with respect to the downstream flow; and

10. The apparatus of claim 9, wherein the questions in each knowledge point include a standard question and an expanded question for the standard question.

11. The apparatus of claim 10, further comprising

An abstract semantic database comprising a plurality of abstract semantic expressions, the abstract semantic expressions comprising missing semantic components;

the knowledge base creating module comprises an expanding unit, wherein the expanding unit is used for carrying out abstract semantic recommendation processing on the standard questions according to the abstract semantic database, when one or more abstract semantic expressions corresponding to the standard questions are obtained, extracting contents corresponding to missing semantic components of the one or more abstract semantic expressions from the standard questions, and filling the extracted contents into the corresponding missing semantic components to obtain one or more concrete semantic expressions corresponding to the standard questions, and the concrete semantic expressions are used as expanded questions of the standard questions.

12. The apparatus of claim 11, wherein the extension unit comprises:

the word segmentation subunit is used for performing word segmentation processing on the standard questions to obtain a plurality of words, wherein the words are semantic regular words or non-semantic regular words;

the part-of-speech tagging subunit is used for respectively performing part-of-speech tagging processing on each non-semantic regular word to obtain part-of-speech information of each non-semantic regular word;

the word class judgment subunit is used for respectively carrying out word class judgment processing on each semantic rule word to obtain word class information of each semantic rule word;

and the searching subunit is used for searching and processing an abstract semantic database according to the part-of-speech information and the part-of-speech information to obtain an abstract semantic expression matched with the standard questions.

13. The apparatus of claim 12, wherein the abstract semantic expression further comprises semantic rule words, and wherein abstract semantic expressions that match the standard questions satisfy the following condition:

14. The apparatus of claim 9, wherein the language model training module trains the language model using questions in knowledge points corresponding to the downstream process as speech corpus.

15. The apparatus of claim 14, wherein the language model training module trains the language model using SRILM tools.

16. The apparatus of claim 9, wherein the knowledge point assignment module provides, for each flow, knowledge points corresponding to a flow downstream of the flow for performing semantic recognition of speech recognition results with respect to the downstream flow.

17. A voice interaction system, comprising:

a knowledge base created by the method of any one of claims 1-8;

a speech recognition module for performing speech recognition on a user speech input using a language model provided by the method of any one of claims 1-8;

18. The voice interaction system of claim 17, wherein the semantic recognition module comprises:

the semantic similarity calculation module is used for performing semantic similarity calculation on the voice recognition result and the question sentences in the corresponding knowledge points, and the question sentence with the highest semantic similarity in the question sentences with the semantic similarity higher than the threshold value is determined as a matched question sentence;

the output module provides answers associated with the matching question sentences to the user as the response output.