CN111680144A

CN111680144A - Method and system for multi-turn dialogue voice interaction, storage medium and electronic equipment

Info

Publication number: CN111680144A
Application number: CN202010496351.1A
Authority: CN
Inventors: 李林峰; 黄海荣
Original assignee: Hubei Ecarx Technology Co Ltd
Current assignee: Ecarx Hubei Tech Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-18

Abstract

The invention provides a method and a system for multi-turn dialogue voice interaction, a storage medium and electronic equipment, which are used for classifying intentions and extracting word slots and word slot types included in received voice query information of a user after the voice query information of the user is converted into text query information, matching a plurality of preset word slot types included by intentions corresponding to the text query information with the word slot types included by the text query information so as to acquire optional or necessary supplementary word slots under the condition that the text query information lacks the word slots, and further matching the supplementary word slots into the word slots corresponding to the preset word slot types lacking in the intentions to generate response information of the text query information so as to realize intelligent voice interaction.

Description

Method and system for multi-turn dialogue voice interaction, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of intelligent voice interaction, in particular to a method and a system for multi-turn dialogue voice interaction, a storage medium and electronic equipment.

Background

With the development of the internet and artificial intelligence, intelligent voice interaction is becoming more and more extensive, and in voice interaction, semantic recognition and information processing are extremely important. In the traditional intelligent voice interaction technology, only single-sentence voice information of a user is usually identified, and the transmission and fusion of semantic information cannot be realized.

For example, the previous sentence of the user says "help me find a train ticket going to the sea, and the next sentence says" find an airplane ticket again? "the second sentence is that the user wants to ask about the ticket to go to shanghai, the system cannot understand, and may ask" ask where to ask about the ticket? "for the information that the user said before, the follow-up dialogue will be asked continuously, and the user experience will be affected.

For another example, the user orders "train ticket to beijing" by means of voice interaction, and then asks "how do the weather there? "that side" in this sentence refers to beijing, but since the conventional voice interaction system can only process one task at a time, when the ticket booking task is switched to weather checking, the destination information in the ticket booking cannot be brought to the weather checking task, which also affects the user experience.

Disclosure of Invention

The present invention provides a method and system, storage medium, and electronic device for multi-turn conversational speech interaction to overcome the above problems or at least partially solve the above problems.

According to one aspect of the invention, a method for multi-turn dialogue voice interaction is provided, which comprises the following steps:

receiving voice query information of a user, and converting the voice query information into text query information;

classifying intentions of the text query information, and determining intentions corresponding to the text query information, wherein the intentions comprise a plurality of preset word slot types and word slot attributes corresponding to the word slot types, and the word slot attributes comprise necessary word slot types and optional word slot types;

extracting word slots and word slot types of the word slots included in the text query information;

matching a plurality of preset word slot types included by the intention corresponding to the text query information with word slot types included by the text query information, and judging whether the text query information lacks the preset word slot types;

if yes, determining the word slot attribute of the default word slot type;

if the missing word slot attribute of the preset word slot type is the necessary word slot type, generating an additional query instruction so as to determine a first additional word slot corresponding to the necessary word slot type according to the voice information fed back by the user;

if the missing word slot attribute of the preset word slot type is the optional word slot type, acquiring a second supplementary word slot corresponding to the optional word slot type from preset word slot information;

matching the first supplementary word groove and the second supplementary word groove into a word groove corresponding to a preset word groove type lacking in the intention according to respective word groove types, and determining a vertical field to which the intention belongs;

and generating response information aiming at the text query information according to the vertical field.

Optionally, performing intent classification on the text query information, and determining an intent corresponding to the text query information includes:

inputting the text query information into a trained intention classification neural network model, and determining possible intentions corresponding to the text query information and probability values of the possible intentions through the intention classification neural network model;

and selecting the possible intention with the maximum probability value as the intention corresponding to the text query information.

Optionally, generating an additional query instruction to determine a first supplemental word slot corresponding to the necessary word slot type according to the voice information fed back by the user includes:

determining a question-following sentence pattern according to the corresponding intention of the text query information and the preset word slot type lacking in the text query information;

generating an additional query instruction according to the question-chasing sentence pattern;

receiving voice information fed back by a user based on the additional query instruction;

and determining the word slot corresponding to the default word slot type determined according to the voice information fed back by the user as a first supplementary word slot.

Optionally, the preset word slot information includes general word slot information and user private information;

the step of obtaining a second supplementary word slot corresponding to the optional word slot type from preset word slot information comprises:

judging whether the general word slot information contains a second supplementary word slot corresponding to the optional word slot type;

if the general word slot information contains a second supplementary word slot corresponding to the optional word slot type, acquiring the second supplementary word slot;

if the general word slot information does not contain a second supplement word slot corresponding to the optional word slot type, judging whether the user private information contains the second supplement word slot corresponding to the optional word slot type;

if the user private information contains a second supplement word slot corresponding to the optional word slot type, acquiring the second supplement word slot;

and if the user private information does not contain the second supplement word slot corresponding to the optional word slot type, ending the acquisition process of the second supplement word slot.

Optionally, the generating response information for the text query information according to the vertical domain includes:

searching and executing an intention processing function corresponding to the intention in the vertical field, and outputting an execution result of the intention processing function as response information; or

And searching response information aiming at the text query information in the vertical field, and outputting the response information.

Optionally, the response message includes a text message and/or a voice message.

According to another aspect of the present invention, there is also provided a system for multi-turn dialogue voice interaction, comprising:

the information conversion module is configured to receive voice query information of a user and convert the voice query information into text query information;

the intention classification module is configured to classify the text query information and determine an intention corresponding to the text query information, wherein the intention comprises a plurality of preset word slot types and word slot attributes corresponding to the word slot types, and the word slot attributes comprise necessary word slot types and optional word slot types;

the word slot extraction module is configured to extract word slots and word slot types of the word slots included in the text query information;

a word slot matching module configured to match a plurality of preset word slot types included in the intention corresponding to the text query information with word slot types included in the text query information, and determine whether the text query information lacks a preset word slot type;

the attribute determining module is configured to determine the word slot attribute of the missing preset word slot type when the text query information lacks the preset word slot type;

the first obtaining module is configured to generate an additional query instruction if the missing word slot attribute of the preset word slot type is the necessary word slot type, so as to determine a first supplementary word slot corresponding to the necessary word slot type according to the voice information fed back by the user;

the second obtaining module is configured to obtain a second supplementary word slot corresponding to the optional word slot type from preset word slot information if the missing word slot attribute of the preset word slot type is the optional word slot type;

a vertical domain determining module configured to match the first supplemental word slot and the second supplemental word slot to a word slot corresponding to a preset word slot type lacking in the intention according to respective word slot types, and determine a vertical domain to which the intention belongs;

and the information generation module is configured to generate response information aiming at the text query information according to the vertical field.

According to another aspect of the present invention, there is also provided a computer-readable storage medium, characterized in that at least one instruction, at least one program, set of codes, or set of instructions is stored in the storage medium, and the at least one instruction, at least one program, set of codes, or set of instructions is loaded by a processor and performs the method of multi-turn conversational speech interaction as described in any one of the above.

According to another aspect of the present invention, there is also provided an electronic device, characterized by comprising a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for multi-turn conversational speech interaction as described in any one of the above.

The invention provides a method and a system for multi-turn dialogue voice interaction, wherein the method comprises the steps of receiving voice query information of a user, converting the voice query information into text query information, classifying intentions of the text query information, extracting word slots and word slot types included in the text query information, matching a plurality of preset word slot types included in the intentions corresponding to the text query information with the word slot types included in the text query information so as to obtain optional or necessary supplementary word slots under the condition that the text query information lacks the word slots, matching the supplementary word slots into the word slots corresponding to the preset word slot types lacking in the intentions to generate response information of the text query information, and realizing intelligent voice interaction. Based on the method provided by the invention, the word slot and the word slot type are matched through the intention corresponding to the text query information, so that the word slot type lacking in the text query information can be quickly judged.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow diagram illustrating a method for multiple rounds of conversational speech interaction, according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a hierarchy of an intent classification neural network model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a system structure of multi-turn dialogue voice interaction according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 is a schematic flowchart of a method for multiple turns of dialog voice interaction according to an embodiment of the present invention, and as can be seen from fig. 1, the method for multiple turns of dialog voice interaction provided by the embodiment of the present invention may include steps S102 to S118.

Step S102, receiving the voice query information of the user, and converting the voice query information into text query information.

The voice query information related to this embodiment may be any voice data output by the user, and the voice query information may be converted into text query information by a voice recognition technology. If the voice query information is "please navigate to Shanghai", it can be converted into text query information by voice recognition technology.

And step S104, performing intention classification on the text query information, and determining the intention corresponding to the text query information. The intention comprises a plurality of preset word slot types and word slot attributes corresponding to the word slot types, wherein the word slot attributes comprise necessary word slot types and optional word slot types.

The purpose of the intent is to describe the user's specific ideas. Generally, only the intention of the user corresponding to the text query information is recognized, and the text query information can be further processed. Intent recognition can be based on certain rules, but also neural network based methods such as FastText, TextCNN, etc.

In embodiments of the present invention, the classification of intent may be achieved by an intent classification neural network model. The intention classification neural network model can be obtained by training based on batch data in advance, and the intention of the text query information can be quickly and accurately output by inputting the text information into the intention classification neural network model. The intention classification neural network model can comprise an intention recognition module, a word slot extraction module, a dialogue management module, a vertical domain processing module, a response control module, a natural language generation module, a dictionary, a database, a knowledge base and the like.

The intention classification neural network model in the present embodiment may also be referred to as a Natural Language Understanding (NLU) model. As shown in fig. 2, the model is composed of a plurality of vertical domains, and the vertical domains are a set composed of algorithms, models, data, and the like, for completing tasks of a certain domain. Each vertical domain performs a task such as listening to music, ordering a meal, navigating, etc. Further, each vertical domain contains several intents, each intent containing some sentence patterns, fixed suffixes, and zero or more word slot types.

For example, assume that the text query information is "please navigate to Shanghai," where the intent is: navigating to a destination, "Shanghai" is a specific destination of navigation, "Shanghai" is a specific word slot, and the corresponding word slot type is the destination. When the intention recognition is performed on the text query information according to the intention classification neural network model, the intention of the text is recognized according to word slots contained in the text, the association between the word slots, the word slot types corresponding to the word slots and the like, but the intention classification neural network model is different from the following entity extraction according to a named entity model, the intention classification neural network model performs the intention recognition, and the named entity model recognizes the word slots contained in the text.

In practical application, an intention classification neural network model personalized for each user can be generated, wherein the model comprises a plurality of vertical fields, each vertical field comprises a plurality of intentions, and each intention comprises a plurality of word slot types or does not comprise a word slot type. Wherein, the attribute of the word slot type can be a necessary word slot type and an optional word slot type.

The relationship of the model, vertical domain, intent, word-slot type is shown in FIG. 2. Taking the intent classification neural network model as model 1 for example, the configuration of the end user will form a table as shown in table 1, which contains all models, verticals, intents, word-slot types, and their relationships.

TABLE 1

Model 1	Vertical field 1	Intention 1	Word groove classType 1
				Model 1	Vertical field 1	Intention 1	Word slot type 2
Model 1	Vertical field 1	Intention 2	Word slot type 1
				…	…	…	…
Model 1	Vertical field 2	Intention 1	Word slot type 1

The following are commonly used word-groove types:

number (the word slot corresponding to the word slot type is, for example, how much money, several persons, the first, etc.)

Time (the word slot corresponding to the word slot type is, for example, yesterday, Monday, next month 1 day, etc.)

Location (the word slot corresponding to the word slot type is, for example, Shanghai, near science and technology center, land family mouth, etc.)

City (word slot corresponding to the word slot type, for example, Shanghai, Beijing, etc.)

Name of a person (the word slot corresponding to the word slot type is for example Zhang three, Li four, etc.)

Furthermore, in the intention classification neural network model provided by the embodiment of the invention, the vertical field, the intention and the word slot are abstracted, and are respectively defined into abstract classes when being realized, and a template and an input interface are made for a user to create a new vertical domain, so that the development efficiency is greatly improved. The definition of each abstract class may be as follows:

vertical domain (domain) abstract class:

the attributes are as follows:

string name; // name

Int index; // index

List < entry >. entries; // intention List

The method comprises the following steps:

String getName()；

Int getIndex()；

List<intention>*getIntention()；

intent (intent) abstract class:

the attributes are as follows:

string name; // name

Int index; // index

List < slots > -slots; // word groove list

Bool actoresp; whether a/answer function is

void action; // function pointer

String response; // reply content

The method comprises the following steps:

String getName()；

Int getIndex()；

List<slot>*getSlot()；

Bool isActOrResp；

void*getAction()；

String getResponse()；

word slot (slot) abstract class:

the attributes are as follows:

string name; // name

Int index; // index

Boost best; // necessary or optional

Int tryCount; // number of attempts

String ask; // follow up contents

Int order; // follow-up sequence

The method comprises the following steps:

String getName()；

Int getIndex()；

Bool isMust()；

Int getTryCount()；

String getAsk()；

Int getOrder()；

the definition of each abstract class is only one implementation form, and in practical application, the abstract class definition can be performed in other manners, which is not described herein again.

In practical application, because the neural network is a mathematical model, before inputting text query information, the text query information needs to be converted into numerical indexes and then input into the intent classification neural network model for further processing.

In general, when intent classification is performed, a plurality of corresponding intents may be recognized. In the embodiment of the present invention, when determining the intention corresponding to the text query information, the text query information may be input into the trained intention classification neural network model, the possible intention corresponding to the text query information and the probability values of the possible intentions are determined by the intention classification neural network model, and then the possible intention with the maximum probability value is selected as the intention corresponding to the text query information.

For example, if there are N intents, the output is N floating point values, each representing a probability of 1 to N intents, respectively. The output intention probability value range is between 0 and 1, the value represents the intention probability, the larger the value is, the higher the possibility of representing the category is, and the smaller the possibility is otherwise.

And step S106, extracting word slots and word slot types of all the word slots included in the text query information.

The embodiment can perform word slot extraction on the text query information based on the named entity recognition model. The word slot extraction is to extract key entities in the sentence, wherein the entities comprise names of people, names of places, names of organizations, time, places, categories, numbers and the like. The word slot extraction method may be based on a certain rule, or may be based on a statistical model or a neural network model. In practical application, preferably, the named entity recognition model based on the neural network is used for extracting word slots, and the word slots and the types of the word slots in the text information can be easily and accurately extracted by using the method on the premise of large-scale training of the corpus.

For example, the text query information is: "navigate to get to Beijing", extract word slot among them and include: "Beijing", the word slot type of the word slot Beijing is the place, the starting position is the 4 th word, and the length is 2 words.

For another example, the text query information is: "I want to buy a train ticket from Hangzhou to Beijing", the word slot in which is extracted includes: the Chinese character 'Hangzhou', the word groove type is the starting place, the starting position is the 5 th word, and the length is 2 words; "Beijing", the word slot type is the destination, the start position is the 8 th word, the length is 2 words.

In the embodiment of the invention, not only can the word slot and the word slot type in the text query information be extracted, but also the information such as the initial position, the length and the like of the word slot in the text query information can be obtained.

Step S108, matching a plurality of preset word slot types included by the intention corresponding to the text query information with the word slot types included by the text query information, and judging whether the text query information lacks the preset word slot types.

Taking the above-mentioned intent classification neural network model as an example, wherein each vertical domain may include several intents, each intent may include several word-groove types. In practical application, a user may create one or more vertical fields in advance as required, each vertical field contains several intentions, each intention contains one or more word slot types, or there may be no word slot type.

After the intention corresponding to the text query information is identified, the word slot type included in the text query information may be matched with a number of preset word slot types included in the intention to determine whether the text query information lacks a preset word slot type.

Step S110, if yes, determining a word slot attribute of the missing preset word slot type. If the word slot type is indeed preset, it is further determined whether the attribute of the missing word slot type is an optional word slot or a necessary word slot.

In step S112, if the missing word slot attribute of the preset word slot type is the necessary word slot type, an additional query instruction is generated, so as to determine a first supplemental word slot corresponding to the necessary word slot type according to the voice information fed back by the user.

The term slot attribute "must" indicates whether the term slot must be extracted from the user sentence, and if not, the user is asked. The contents of the sentence to be asked are the "ask (content to be asked)" attributes in the word slot.

In an optional embodiment of the present invention, determining the first supplemental word slot may include:

s1-1, determining question-chasing sentence pattern according to the corresponding intention of the text query information and the preset word slot type lacking in the text query information;

s1-2, generating an additional query instruction according to the question-chasing sentence pattern;

s1-3, receiving voice information fed back by the user based on the additional query instruction;

and S1-4, determining a word slot corresponding to the missing preset word slot type according to the voice information fed back by the user as a first supplementary word slot.

For the necessary word slot type that is missing, an additional query command based on a pre-set question-asking pattern may be generated, and the content of the additional query command may be generated using "ask (question-asking content)" of the word slot configured by the user. If there are multiple word slots to ask the user, the priority of the word slots may be ranked, e.g., the user may be asked in the order of "order" in the word slots. Smaller "order" values indicate higher priority, i.e., the word slot is queried preferentially. Since "order" is a private attribute, the "getOrder ()" method that calls the word slot obtains the order value.

The additional query command may be a query audio content, and when the user receives the query audio content played by the system, the user replies a voice message.

In practical applications, there is a possibility that the voice information fed back by the user is wrong, for example, the question is asked for by numbers, and the information returned by the user is not a number type information, at this time, the question is failed to be asked. At this time, the question may be repeatedly asked until reaching the limit of the number of attempts, which is defined in "tryCount (number of times of trying to ask the word)" of the word slot, and may be set in advance according to different requirements, which is not limited by the present invention.

In an optional embodiment of the present invention, after converting the voice information into text information, it may be determined whether the text information is a main process of processing the text query information converted from the voice query information or a sub-process of processing missing word slots for the existing text query information. Specifically, when the judgment is performed, whether the previous round of conversation is in a waiting word slot state or not can be judged, if yes, the word slot inquiry processing sub-process is entered, so that the acquisition of the first supplement word slot is realized, and otherwise, the main processing flow is entered.

Step S114, if the missing word slot attribute of the preset word slot type is the optional word slot type, obtaining a second supplemental word slot corresponding to the optional word slot type from the preset word slot information. Optionally, the preset word slot information includes general word slot information and user private information.

In an optional embodiment of the present invention, the obtaining a second supplemental word slot corresponding to the optional word slot type from the preset word slot information includes:

s2-1, judging whether the general word slot information contains a second supplementary word slot corresponding to the optional word slot type;

s2-2, if the general word slot information contains a second supplementary word slot corresponding to the selectable word slot type, acquiring a second supplementary word slot;

s2-3, if the general word slot information does not contain the second supplement word slot corresponding to the optional word slot type, judging whether the user private information contains the second supplement word slot corresponding to the optional word slot type;

s2-4, if the user private information contains a second supplementary word slot corresponding to the optional word slot type, acquiring a second supplementary word slot;

and S2-5, if the user private information does not contain the second supplement word slot corresponding to the optional word slot type, ending the acquisition process of the second supplement word slot.

"general word slot information" as shown in table 2, some general word slot types that can be used by multiple intentions in multiple vertical domains are stored, such as time, city, people, etc. Each word slot in the word slot type in the "common word slot information" is empty at the time of initialization, and when processing of text query information is started, the obtained common word slot is stored in the corresponding common word slot information.

TABLE 2

Word groove type	Word slot
		Time of day	Today's appliances
City	None (representing empty)
		。。。	。。。
Character	Zhang three

If the missing optional word slot type exists in the general word slot information, determining a word slot in a word slot value corresponding to the word slot type as a second supplementary word slot; otherwise, if the general word slot information does not have the missing optional word slot type, the user private information is inquired again.

The user private information is generated from different users. Table 3 below contains some vehicle models, registered cities, owner information, etc.

Name (R)	Description of the invention	Word slot
			Vehicle model	Vehicle model	Botupro
City	Vehicle attribution	Shanghai province
			。。。		。。。
Sex	Sex of vehicle owner	For male

If the required optional word slot type exists in the private information of the user, determining that the word slot corresponding to the word slot type is a second supplementary word slot; otherwise, if the required word slot type does not exist in the private information of the user, it indicates that there is not enough information processing for the user instruction, the information processing fails, and the whole flow is ended.

The private information of the user realizes different data of different users, namely thousands of people. In the process of obtaining the actual optional word slot, if the general word slot information does not have the missing word slot with the optional word slot type, the user private information is inquired. The user private information stores user unique information such as a user's place of ownership, for example, a default value of a place of origin that the user has determined to be stored in a database when buying a car and that the place of ownership can be used as an intention of train tickets, navigation and the like.

For different users, the default values of the selectable word slots are different, for example, if the home location of the user is "Shanghai", the default values of the word slots of the "departure place" for booking the tickets for the trip and the "place" for the weather query are both "Shanghai". For the word slot corresponding to the optional word slot type without the default value, the method provided by this embodiment may store the word slot extracted in the last multiple rounds of conversations in the memory, and use the word slot previously stored in the memory if the user does not provide the word slot next time when the word slot needs to be used.

For example, the previous round of conversation is a train ticket booking, the word slot of the destination extracted from the sentence of the user is ' Beijing ', the next round of user initiates an intention request for inquiring weather, and the user only says ' how much weather there? "there" means "Beijing" and "Beijing" can be used as a query city when looking for weather. The function realization of "general word groove information" has made things convenient for cross-task multiple round conversation, does not need each word groove all to inquire the user, has promoted user experience.

And step S116, matching the first supplementary word groove and the second supplementary word groove into a word groove corresponding to a preset word groove type lacking in the intention according to the respective word groove types, and determining the vertical field to which the intention belongs.

So far, the first supplementary word slot and the second supplementary word slot have been obtained, and further, the first supplementary word slot and the second supplementary word slot may be matched to a word slot corresponding to a preset word slot type lacking in the intention, so as to determine a vertical field to which the intention belongs.

Step S118, generating response information aiming at the text query information according to the vertical field. The response message may include text message and/or voice message.

In an alternative embodiment of the present invention, the step S118 of generating response information for the text query information according to the vertical domain may include: searching and executing an intention processing function corresponding to the intention in the vertical field, and outputting an execution result of the intention processing function as response information; or searching response information aiming at the text query information in the vertical field and outputting the response information.

That is, it can be determined whether to perform further processing by the intention processing function or to directly return preset response information according to the vertical domain of intention.

Both the intention processing function and the preset response information can be predefined in the intentions included in each vertical domain of the model, for example, the "action" defined in the above-mentioned intention abstract class is a processing function, and the "response" is a reply statement. A web address (URL), also called uniform resource locator, may also be defined, and parameters defined so that the processing function may directly go to a remote device or server to perform corresponding actions, such as booking a train ticket vertical, and directly going to a 12306 train ticket website to check and order a train ticket.

That is, after the complete text query information is obtained, the corresponding vertical field and the corresponding intention item can be searched in the intention first, whether an intention processing function is defined or not is checked, if the intention processing function is executed, and if the intention processing function is not executed, the response message is generated by using the predefined content. It should be noted that the intent processing function and the reply message cannot be empty at the same time.

For example, "order a train ticket to beijing from shanghai", the ticket information is returned after the ticket order is successful, and a reply text is generated, for example, "a train ticket to beijing from shanghai at 9 am on tomorrow has been ordered for you, the train number is G105, the car is car number 6, and the seat number is 06A".

The embodiment of the invention provides a multi-turn dialogue voice interaction method, which comprises the steps of after receiving voice query information of a user and converting the voice query information into text query information, classifying intentions of the text query information, extracting word slots and word slot types included in the text query information, matching a plurality of preset word slot types included in the intentions corresponding to the text query information with the word slot types included in the text query information so as to obtain optional or necessary supplementary word slots under the condition that the text query information lacks the word slots, matching the supplementary word slots into the word slots corresponding to the preset word slot types lacking in the intentions to generate response information of the text query information, and realizing intelligent voice interaction. Based on the method provided by the embodiment of the invention, the word slot and the word slot type are matched through the intention corresponding to the text query information, and the word slot type lacking in the text query information can be quickly judged.

In addition, the method provided by the embodiment of the invention supports the creation of a vertical field or an intention based on a template, does not need to rewrite codes, and is convenient for a client to expand the development of a new vertical field; multiple rounds of session context memory across vertical domains; an ultra-long session information memorizing function; the word slot defaults to personalized difference of data, and thousands of people and thousands of faces.

The method provided by the embodiment of the invention can be applied to a vehicle-mounted machine system of an automobile or other scenes for realizing intelligent voice interaction, and the invention is not limited to the method.

Based on the same inventive concept, an embodiment of the present invention further provides a system for multi-turn dialog voice interaction, as shown in fig. 3, the system may include:

an information conversion module 310 configured to receive voice query information of a user and convert the voice query information into text query information;

the intention classification module 320 is configured to perform intention classification on the text query information, determine an intention corresponding to the text query information, wherein the intention includes a plurality of preset word slot types and word slot attributes corresponding to the word slot types, and the word slot attributes include a necessary word slot type and an optional word slot type;

a word slot extracting module 330 configured to extract word slots included in the text query information and word slot types of the word slots;

the word slot matching module 340 is configured to match a plurality of preset word slot types included in the intention corresponding to the text query information with word slot types included in the text query information, and determine whether the text query information lacks the preset word slot types;

an attribute determining module 350, configured to determine a word slot attribute of the missing preset word slot type when the broken text query information lacks the preset word slot type;

the first obtaining module 360 is configured to generate an additional query instruction if the missing word slot attribute of the preset word slot type is the necessary word slot type, so as to determine a first supplementary word slot corresponding to the necessary word slot type according to the voice information fed back by the user;

a second obtaining module 370, configured to obtain, from the preset word slot information, a second supplementary word slot corresponding to the optional word slot type if the missing word slot attribute of the preset word slot type is the optional word slot type;

a vertical domain determining module 380 configured to match the first supplemental word slot and the second supplemental word slot to a word slot corresponding to a preset word slot type lacking in the intention according to the respective word slot types, and determine a vertical domain to which the intention belongs;

the information generating module 390 is configured to generate response information for the text query information according to the vertical domain.

In an alternative implementation of the present invention, the intent classification module 320 may be further configured to:

In an optional implementation of the present invention, the first obtaining module 360 may be further configured to:

determining a question-chasing sentence pattern according to the corresponding intention of the text query information and the preset word slot type lacking in the text query information;

and determining a word slot corresponding to the missing preset word slot type according to the voice information fed back by the user as a first supplementary word slot.

In an optional implementation of the present invention, the preset word slot information includes general word slot information and user private information; the second acquisition module 370 may be further configured to:

the step of obtaining a second supplementary word slot corresponding to the optional word slot type from the preset word slot information includes:

if the general word slot information contains a second supplementary word slot corresponding to the selectable word slot type, acquiring a second supplementary word slot;

if the user private information contains a second supplement word slot corresponding to the selectable word slot type, acquiring a second supplement word slot;

In an optional implementation of the present invention, the information generating module 390 may be further configured to:

Wherein the response message comprises text message and/or voice message.

In an alternative implementation of the present invention, there is further provided a computer-readable storage medium, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded by a processor and executes the method for multi-turn dialogue voice interaction according to any one of the embodiments.

In an alternative implementation of the present invention, there is further provided an electronic device, including a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for multi-turn conversational speech interaction as described in any one of the embodiments above.

It can be clearly understood by those skilled in the art that the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiments, and for the sake of brevity, no further description is provided herein.

Those of ordinary skill in the art will understand that: the above-described method, if implemented in software and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims

1. A method of multi-turn conversational voice interaction, comprising:

if yes, determining the word slot attribute of the default word slot type;

2. The method of claim 1, wherein the text query information is subjected to intent classification, and determining the intent corresponding to the text query information comprises:

3. The method of claim 1, wherein generating an additional query instruction to determine a first additional word slot corresponding to the necessary word slot type according to the voice information fed back by the user comprises:

4. The method of claim 1, wherein the preset word slot information includes general word slot information and user private information;

5. The method of claim 1, wherein the generating response information for the text query information according to the vertical domain comprises:

6. The method according to any of claims 1-5, wherein the response message comprises a text message and/or a voice message.

7. A system for multiple rounds of conversational voice interaction, comprising:

8. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded by a processor and which performs a method of multi-turn conversational speech interaction according to any one of claims 1-6.

9. An electronic device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of multi-turn conversational speech interaction of any one of claims 1-6.