CN111428483B

CN111428483B - Voice interaction method and device and terminal equipment

Info

Publication number: CN111428483B
Application number: CN202010244784.8A
Authority: CN
Inventors: 刘杰; 张晴
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2022-05-24
Anticipated expiration: 2040-03-31
Also published as: WO2021196981A1; CN111428483A

Abstract

The application is suitable for the technical field of artificial intelligence, and provides a voice interaction method, a voice interaction device and terminal equipment, wherein the method comprises the following steps: when a user statement to be replied is received, acquiring historical dialogue data; identifying target entity information in the user statement and identifying historical entity information in the historical dialogue data; extracting key entity information associated with the user statement from the historical entity information; generating a target interactive statement according to the target entity information and the key entity information; and outputting a reply sentence corresponding to the target interactive sentence. By adopting the method, the accuracy of dialog state tracking and user intention identification can be improved, the natural language processing capability of the dialog system is improved, and the reply rationality of the dialog system in a multi-turn dialog process is enhanced.

Description

Voice interaction method and device and terminal equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a voice interaction method, a voice interaction device and terminal equipment.

Background

Natural Language Processing (NLP) is an important component of Artificial Intelligence (AI), and its typical application scenario includes task-oriented dialog system and machine translation. In a multi-turn dialog scenario based on natural language dialogs, how to track and determine the intent of a user is a crucial link. In the process of Dialog State Tracking (DST), a model needs to be dynamically adjusted in combination with a historical dialog to extract key information implied in a user statement, so as to determine a user intention and complete a corresponding response in combination with a dialog system.

In the prior art, the DST method based on machine learning needs a model to understand the contents of multiple rounds of conversations well, which has extremely high requirements on the model and limits the precision of the DST method to a great extent. However, due to the high abstraction of natural language and the complexity of multiple rounds of dialog, current machine learning techniques have difficulty in completely and accurately understanding multiple rounds of dialog, i.e., accurately tracking the state of multiple rounds of dialog and determining user intent, in a practical application scenario.

Disclosure of Invention

The embodiment of the application provides a voice interaction method, a voice interaction device and terminal equipment, and can solve the problems that in the prior art, the difficulty in tracking the state of multiple rounds of conversations is high, and the intention of a user cannot be accurately determined.

In a first aspect, an embodiment of the present application provides a voice interaction method, including:

when a user statement to be replied is received, acquiring historical dialogue data, and identifying target entity information in the user statement and historical entity information in the historical dialogue data through a named entity identification model; then, extracting key entity information associated with the user statement from the historical entity information, so that the current user statement can be rewritten according to the target entity information and the key entity information to generate a target interactive statement; by outputting the reply sentence corresponding to the target interactive sentence, the interaction requirement of the user can be met. In the embodiment, the entity information in the historical conversation turn is acquired, and the statement of the current conversation turn is rewritten by combining the entity information in the current conversation turn, so that the conversation state tracking problem in the multi-turn conversation can be converted into the single-turn conversation problem, the user intention can be conveniently replied by utilizing the existing mature single-turn conversation technology, the accuracy of user intention identification can be improved, and the language processing capability of the conversation system can be improved.

In a possible implementation manner of the first aspect, the key entity information associated with the user statement is extracted from the historical entity information, and a candidate user intention matched with the user statement may be preliminarily determined according to the target entity information and the historical entity information; and then respectively calculating the distribution probability of each historical entity information in the historical dialogue data, so that the key entity information can be extracted from the historical entity information according to the distribution probability and the candidate user intention.

In a possible implementation manner of the first aspect, the calculating of the distribution probability of each piece of historical entity information in the historical dialogue data may be implemented by calling a preset pointer to generate a network model, where the pointer generating network model includes a coding module, and the coding module may be used to code each piece of historical entity information respectively to obtain the distribution probability corresponding to each piece of historical entity information.

In a possible implementation manner of the first aspect, when the key entity information is extracted, candidate entity information associated with any candidate user intention may be preliminarily extracted from the historical entity information; and then extracting candidate entity information of which the probability value of the distribution probability is greater than a preset probability threshold value as key entity information.

In a possible implementation manner of the first aspect, for candidate entity information in which a difference value between a target probability value and a preset probability threshold is smaller than a preset difference value, and the target probability value is smaller than the preset probability threshold, an inquiry statement may be generated according to the candidate entity information and the key entity information corresponding to the target probability value, and a user is invited to identify the candidate entity information corresponding to the target probability value; if the confirmation information of the user for the inquiry statement is received, the candidate entity information corresponding to the target probability value can be determined as the key entity information, and the target probability value is the probability value of the distribution probability of any candidate entity information in the historical dialogue data.

In a possible implementation manner of the first aspect, when the target interactive statement is generated according to the target entity information and the key entity information, the target basic statement may be determined first, and then the target basic statement is rewritten by using the target entity information and the key entity information to obtain the target interactive statement, so that difficulty in directly generating the target interactive statement is reduced.

In a possible implementation manner of the first aspect, a plurality of basic sentences may be obtained from user sentences containing target entity information and historical dialogue data containing key entity information; and then respectively calculating the matching degrees between the plurality of basic sentences and the entity information to be evaluated, and identifying the basic sentence corresponding to the maximum matching degree as the current target basic sentence, wherein the entity information to be evaluated comprises all target entity information and key entity information.

In a possible implementation manner of the first aspect, any one of the base statements may include a plurality of semantic slots, respectively, and the matching degree between the entity information to be evaluated and the plurality of base statements may be determined based on a ratio between the number of key slots and the number of semantic slots in the base statements. Therefore, for any basic statement, the number of semantic slots in the basic statement and the number of entity information to be evaluated can be counted firstly; then determining the number of key slot positions respectively matched with the information of each entity to be evaluated in the basic statement; after the ratio between the number of key slots and the number of semantic slots in the basic sentence is calculated, the ratio can be used as the matching degree between the entity information to be evaluated and the basic sentence.

In a possible implementation manner of the first aspect, the pointer generation network model may further include a decoding module, and the decoding module may be obtained by training a plurality of types of training data, where the plurality of types of training data include a plurality of entity information and a base statement corresponding to each entity information. Therefore, the target interactive statement can be generated by rewriting the target basic statement by using a decoding module. Specifically, if the target basic statement is the current user statement, the decoding module may be adopted to decode the key entity information and the target basic statement, and output the target interactive statement; if the target basic statement is a user statement in the historical dialogue data, the decoding module can be adopted to decode the target entity information, the key entity information and the target basic statement and output a target interactive statement.

In a possible implementation manner of the first aspect, after the target interactive statement is obtained, it may be further verified whether the rewritten target interactive statement is correct. The embodiment provides a double-layer verification mechanism, which may extract a plurality of pieces of entity information in a target interactive statement, and verify whether the plurality of pieces of entity information match a semantic slot of a target user intention in a preset knowledge base, where the target user intention is any one of candidate user intentions. If the plurality of entity information in the target interactive statement matches the semantic slot of the target user intention, the generated target interactive statement can be judged to be correct, and a step of outputting a reply statement corresponding to the target interactive statement is executed; if the entity information in the target interactive statement does not match the semantic slot intended by the target user, the target interactive statement can be verified for the second time according to the statement type of the target interactive statement. When the secondary verification is carried out, whether the target interactive statement is a task statement or not can be judged by calling a preset natural language understanding model. If the statement is a task-type statement, a corresponding reply statement can be output aiming at the statement; if not, prompting the user to re-input the user statement, restating the user intention, and generating the target interactive statement again according to the re-input user statement.

In a second aspect, an embodiment of the present application provides a voice interaction apparatus, including:

the historical dialogue data acquisition module is used for acquiring historical dialogue data when receiving a user statement to be replied;

the target entity information identification module is used for identifying target entity information in the user statement; and the number of the first and second groups,

the historical entity information identification module is used for identifying historical entity information in the historical dialogue data;

a key entity information extraction module, configured to extract key entity information associated with the user statement from the historical entity information;

the target interactive statement generating module is used for generating a target interactive statement according to the target entity information and the key entity information;

and the reply statement output module is used for outputting a reply statement corresponding to the target interactive statement.

In a possible implementation manner of the second aspect, the key entity information extraction module may specifically include the following sub-modules:

a candidate user intention determining submodule, configured to determine a candidate user intention matched with the user statement according to the target entity information and the historical entity information;

the distribution probability calculation submodule is used for calculating the distribution probability of each historical entity information in the historical dialogue data;

And the key entity information extraction submodule is used for extracting key entity information from the historical entity information according to the distribution probability and the candidate user intention.

In a possible implementation manner of the second aspect, the distribution probability calculation sub-module may specifically include the following units:

the first pointer generation network model calling unit is used for calling a preset pointer generation network model, and a coding module adopting the pointer generation network model respectively codes each historical entity information to obtain the distribution probability corresponding to each historical entity information.

In a possible implementation manner of the second aspect, the key entity information extraction sub-module may specifically include the following units:

a candidate entity information extracting unit for extracting candidate entity information associated with any candidate user intention from the historical entity information;

and the key entity information extraction unit is used for extracting candidate entity information of which the probability value of the distribution probability is greater than a preset probability threshold value as the key entity information.

In a possible implementation manner of the second aspect, the key entity information extraction sub-module may further include the following units:

An inquiry statement generating unit, configured to generate an inquiry statement according to the candidate entity information corresponding to the target probability value and the key entity information if a difference between the target probability value and the preset probability threshold is smaller than a preset difference and the target probability value is smaller than the preset probability threshold, so as to instruct a user to identify the candidate entity information corresponding to the target probability value;

and the key entity information determining unit is used for determining candidate entity information corresponding to the target probability value as the key entity information when receiving confirmation information of a user for the inquiry statement, wherein the target probability value is the probability value of the distribution probability of any candidate entity information in the historical dialogue data.

In a possible implementation manner of the second aspect, the target interactive statement generating module may specifically include the following sub-modules:

the target basic statement determining submodule is used for determining a target basic statement;

and the target interactive statement generation submodule is used for adopting the target entity information and the key entity information to rewrite the target basic statement and generate the target interactive statement.

In a possible implementation manner of the second aspect, the target base statement determination sub-module may specifically include the following units:

A basic statement acquisition unit, configured to acquire a plurality of basic statements from a user statement that includes the target entity information and historical dialogue data that includes the key entity information;

the matching degree calculation unit is used for calculating the matching degrees between the basic sentences and the entity information to be evaluated respectively, wherein the entity information to be evaluated comprises the target entity information and the key entity information;

and the target basic statement identification unit is used for identifying the basic statement corresponding to the maximum matching degree as the current target basic statement.

In a possible implementation manner of the second aspect, any base statement may include a plurality of semantic slots, and the matching degree calculation unit may specifically include the following sub-units:

the statistic subunit is used for counting the number of semantic slots in any basic statement and the number of the entity information to be evaluated;

the determining submodule is used for determining the number of key slot positions respectively matched with the entity information to be evaluated in the basic statement;

and the calculating subunit is used for calculating a ratio of the number of the key slots to the number of the semantic slots in the basic statement, and taking the ratio as the matching degree between the entity information to be evaluated and the basic statement.

In a possible implementation manner of the second aspect, the pointer generation network model may further include a decoding module, where the decoding module is obtained by training a plurality of types of training data, where the plurality of types of training data include a plurality of entity information and a base statement corresponding to each entity information; the target interactive statement generation submodule may specifically include the following units:

a second pointer generation network model calling unit, configured to decode, by using the decoding module, the key entity information and the target basic statement and output a target interactive statement if the target basic statement is the user statement; and if the target basic statement is the historical dialogue data, decoding the target entity information and the target basic statement by adopting the decoding module, and outputting a target interactive statement.

In a possible implementation manner of the second aspect, the target interactive statement generation sub-module may further include the following units:

the target interactive statement entity information extraction unit is used for extracting a plurality of entity information in the target interactive statement;

a target interactive statement verification unit, configured to verify whether a plurality of pieces of entity information in the target interactive statement match semantic slots of a target user intention in a preset knowledge base, where the target user intention is any one of the candidate user intentions; if the plurality of entity information in the target interactive statement matches the semantic slot of the target user intention, judging that the generated target interactive statement is correct, and outputting a reply statement corresponding to the target interactive statement; and if the plurality of entity information in the target interactive statement do not match the semantic slot of the target user intention, verifying the target interactive statement according to the statement type of the target interactive statement.

In a possible implementation manner of the second aspect, the target interactive statement verification unit is further configured to call a preset natural language understanding model to determine whether the target interactive statement is a task-based statement, and if the target interactive statement is a task-based statement, output a reply statement corresponding to the target interactive statement; and if the target interactive statement is not a task statement, prompting the user to re-input the user statement, and re-generating the target interactive statement according to the re-input user statement.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the voice interaction method according to any one of the above first aspects when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor of a terminal device, implements the voice interaction method according to any one of the above first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the voice interaction method described in any one of the above first aspects.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

according to the embodiment of the application, the target entity information in the current conversation turn is identified, the key entity information is extracted from the historical conversation data, the actual intention of the user can be determined according to the two entity information, the user statement of the current turn is rewritten according to the intention, and the target interactive statement is generated, so that the application programs such as a voice assistant in the terminal equipment can reply according to the target interactive statement. The embodiment converts the DST problem in the multi-turn conversation into the single-turn conversation problem to a certain extent, can utilize the existing mature single-turn conversation technology to reply the user intention, improves the accuracy of conversation state tracking and user intention identification, improves the natural language processing capacity of the conversation system, enhances the rationality of the conversation system reply in the multi-turn conversation process, enables the system reply to more match the actual demand of the user, and reduces the interaction times between the user and the conversation system.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of the operation of a prior art knowledge-base inference based multi-turn dialog state tracking scheme;

FIG. 2 is a schematic diagram of a prior art learning model based multi-turn dialog state tracking scheme;

FIG. 3 is a schematic diagram illustrating an operation process of a voice interaction method according to an embodiment of the present application;

fig. 4 is a schematic application scenario diagram of a voice interaction method according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of a mobile phone to which a voice interaction method according to an embodiment of the present application is applied;

fig. 6 is a schematic software structure diagram of a mobile phone to which a voice interaction method according to an embodiment of the present application is applied;

FIG. 7 is a flow chart illustrating exemplary steps of a voice interaction method provided by an embodiment of the present application;

FIG. 8 is a flow chart of illustrative steps of a method of voice interaction as provided by another embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a process of calculating a distribution probability of entity information according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an operation process of a voice interaction method according to another embodiment of the present application;

FIG. 11 is a block diagram of a voice interaction apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

For ease of understanding, several exemplary prior art multi-session state tracking schemes will first be described.

Fig. 1 is a schematic diagram illustrating an operation process of a multi-turn dialog state tracking scheme in the prior art. The scheme is a scheme based on Question and Answer (QA) knowledge base reasoning, and the specific operation process is as follows:

first, keywords of a plurality of dialog rounds are determined according to the current dialog rounds and the current input as the input of the current dialog state, and then the keyword is searched in the knowledge base according to the predefined rule, and the step is shown as a box 101 in fig. 1.

Then, upon retrieval, a corresponding set of candidate multi-turn conversations may be obtained, as shown in block 102 of FIG. 1.

And finally, according to a predefined similarity calculation rule, calculating the similarity between the current conversation and the candidate conversation, wherein the specific strategy comprises the following steps: calculating semantic similarity between the current input and the candidate question as first similarity; calculating semantic similarity between the currently input context and each candidate problem context as a second similarity; and calculating the similarity of the summary information of the current multi-turn conversation and each candidate multi-turn conversation as a third similarity. The three similarity weights and sums to obtain the similarity between each candidate question and the current input, and the reply corresponding to the candidate question with the maximum similarity is taken as the output reply, and the step is shown as a block 103 in fig. 1.

In the scheme shown in fig. 1, key information extraction in multiple rounds of dialog has no primary score, that is, key information related to current round of input is not extracted, and extracted redundant information will affect the accuracy of dialog state tracking; secondly, because the accuracy of dialog state tracking depends on the coverage degree of the knowledge base to a great extent, considering the complexity of natural language dialog in real scenes, an ideal knowledge base with wide coverage is difficult to obtain in fact; thirdly, the method for obtaining the state tracking result in the scheme depends on various rules defined in advance, which also greatly influences the generalization capability and robustness of the model.

Fig. 2 is a schematic diagram showing the operation process of another multi-turn dialog state tracking scheme in the prior art. The scheme is a DST scheme based on a learning model, the state information of each round of conversation is tracked in sequence, the state of the current round of conversation is updated through a copy flow mechanism, and then the tracking of the long-term conversation state is realized, and the specific operation process comprises the following steps of S201-S204:

firstly, extracting key information in a current round of conversation and a previous round of conversation through a semi-supervised neural network model, and generating a keyword sequence corresponding to the two rounds of sentences.

The dialog state information is then represented by displaying a lexical sequence using a novel encoder-decoder network based on a copy stream mechanism. The copy flow mechanism can transfer the information flow of the conversation history through replication and finally participate in the generation of the target statement of the conversation reply in the current round.

And finally, according to the state information of the current round of conversation obtained above, automatically generating a target sentence replied by the current round of conversation by using a decoder module, and further completing the response to the inquiry of the user.

In the scheme shown in fig. 2, extracting the key information of the historical dialogue based on the semi-supervised neural network model may cause the key information to be lost or extracted by mistake, thereby affecting the understanding of the historical dialogue; secondly, tracking historical conversations round by round and updating conversation states easily cause high time complexity of a model and error accumulation; thirdly, the scheme is too dependent on the comprehension ability of the model to historical conversations, and the encoder-decoder type network at present has difficulty in meeting the actual scene requirements with high precision.

In order to solve the above problems, the core concept of the embodiments of the present application is to, in the multi-turn dialog state tracking process of the dialog system, rewrite the current turn of sentences based on the key information in the historical dialog, and complement the omitted information of the current turn of dialog, thereby converting the multi-turn dialog problem into a single-turn dialog. According to the voice interaction method, the user intention is tracked based on the key information of the historical conversation, the defects of various schemes in the prior art are overcome to a certain extent, and the state tracking precision in multiple rounds of conversation and the response accuracy of a conversation system are improved.

In the voice interaction method provided in the embodiment of the present application, after Entity information in a historical dialog is extracted by using a Named Entity Recognition (NER) module, attention distribution of the Entity information in the historical dialog and an Entity in a current-round dialog are obtained by using a Knowledge Base (KBs) and a Pointer-generating network (PGN) model through calculation, and by screening the Entity information in the historical dialog, a redundant Entity is discarded, and key information participating in tracking of a current-round dialog state is determined. The processing mode not only reduces the influence of redundant information on conversation state tracking, but also provides effective key information for subsequent steps.

The contribution of key entity information in the dialog state (expressed in distributed probabilities) is then computed in conjunction with a knowledge base and a supervised feed-forward neural network, and the feed-forward neural network is used as part of the overall model to determine whether it directly affects the dialog state. This avoids tracking the dialog state from round to round and reduces the accumulation of errors. After the current round of sentences are rewritten based on a PGN model decoding step, a mature single-round dialogue related module is used for generating reply sentences of multiple rounds of dialogues, and the response accuracy of the dialogue system is improved.

Fig. 3 is a schematic diagram illustrating an operation process of the voice interaction method according to an embodiment of the present application. According to the operation process shown in fig. 3, the method firstly extracts key information directly related to the dialogue state in the historical dialogue, and then rewrites the current round of statements by combining the model to complete the tracking and fusion of the dialogue state information. Then, the existing single-round dialogue processing module is utilized to complete the corresponding reply of the inquiry of the user in the multi-round dialogue on the basis of rewriting the sentence.

Based on the operation process, the operation process of the method can be realized by a plurality of modules as follows:

1. entity extraction module

The method is used for extracting entity information in the historical conversation as candidate key information.

2. Screening entity module

And screening out key information related to the conversation state of the current wheel by combining attention distribution about entity information obtained by calculation in the KBs and the PGN model and entities in the current wheel sentence.

3. Key information distribution prediction module

Using the KBs and the key information, a probability distribution of the key information is calculated based on the supervised feed forward network.

4. Statement rewriting module

And based on the probability distribution of the key information, utilizing a decoder link of the PGN model to complete or rewrite the current round of sentences.

5. Reply generation module

And generating the reply sentences in the multiple rounds of dialogue by using the existing mature single-round dialogue system processing technology.

Fig. 4 is a schematic view of an application scenario of the voice interaction method according to an embodiment of the present application. In the exemplary application scenario shown in fig. 4, in the first round of dialog, the query statement of the user is: please tell me the nearest restaurant. "for this query, the dialog system replies: "the nearest restaurant is a submarine scoop on the south of the agricultural university in the lake area. "immediately after, in the second round of the dialog, the user replies: "good. "and continue to ask" how much is the temperature on friday? "at this time, the query sentence of the user does not include any words related to the location, and the dialog system will query the user about the temperature of which place needs to be queried, i.e.," ask you about which city friday temperature you want to query? "user reply: "Beijing City. "the current dialog can be regarded as the current dialog, and the previous two dialogs are history dialogs.

For the historical dialogue, entities in the historical dialogue, including "recent", "restaurant", "great south road in the lake region", "seafloor fishing", "friday", and "temperature", etc., may be extracted based on the entity extraction module. On the basis of the above steps. And (3) calculating the probability distribution of the entities in the historical conversation by utilizing a screening entity module and combining a predefined KBs model and a PGN model to obtain key entities related to the conversation state, namely 'Friday' and 'temperature', wherein other entities in the historical conversation, such as 'recent', 'restaurant', 'Haihe-region-Fangda-Nanlu' and 'seabed fishing', are redundant entities. The dialog system may determine the basic rewrite statement according to the obtained key entities, i.e. "how much is the temperature of friday? "

Specifically, the probability distribution of the key information related to the current dialog state can be further predicted by using the key information distribution prediction module based on the feedforward network, so that the probability of the temperature is 0.86, and the probability of the friday is 0.72. For entities with a probability value near but below the threshold of 0.8, the dialog system may invite the user to participate in the configuration, i.e. the dialog system in fig. 4 may ask the user: asking for temperature information that you want to inquire about "friday" in Beijing? "then, using a sentence rewriting module, generate a rewriting sentence" how much is the temperature in beijing, friday? "

For the rewritten user statement, the dialog system may generate a reply to the rewritten statement by using a reply generation module, that is, the reply of the dialog system in fig. 4: "temperature of ZhouWubeijing is … …"

It should be noted that the basic rewrite statement may be from historical dialog data or from the current dialog. That is, the base sentence may be selected from a certain user sentence in the historical dialog, or the current user sentence. The criterion can be determined according to the amount of the target entity information and the key entity information contained in the statement. For example, in the above example, the history dialog "how much is the temperature of friday? "includes two key entity information of" friday "and" temperature ", and the current dialogue" beijing city "includes only one target entity information, so can select how much is the temperature of the friday in the history dialogue? "to rewrite the statement as a basis.

The following describes the voice interaction of the present application in detail with reference to specific embodiments.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The voice interaction method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA) and the like, and the embodiment of the application does not limit the specific type of the terminal device at all.

Take a terminal device as a mobile phone as an example. Fig. 5 is a block diagram illustrating a partial structure of a mobile phone according to an embodiment of the present disclosure. Referring to fig. 5, the handset includes: radio Frequency (RF) circuitry 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuitry 560, wireless fidelity (Wi-Fi) module 570, processor 580, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 5 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 5:

RF circuit 510 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 580; in addition, the data for designing uplink is transmitted to the base station. Typically, the RF circuitry includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE)), e-mail, Short Messaging Service (SMS), and the like.

The memory 520 may be used to store software programs and modules, and the processor 580 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 520. The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone 500. Specifically, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also called a touch screen, can collect touch operations of a user on or near the touch panel 531 (for example, operations of the user on or near the touch panel 531 by using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 580, and can receive and execute commands sent by the processor 580. In addition, the touch panel 531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 540 may include a Display panel 541, and optionally, the Display panel 541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation on or near the touch panel 531, the touch panel is transmitted to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although the touch panel 531 and the display panel 541 are shown as two separate components in fig. 5 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

Cell phone 500 can also include at least one sensor 550, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 541 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 560, speaker 561, and microphone 562 may provide an audio interface between a user and a cell phone. The audio circuit 560 may transmit the electrical signal converted from the received audio data to the speaker 561, and convert the electrical signal into a sound signal by the speaker 561 for output; on the other hand, the microphone 562 converts the collected sound signals into electrical signals, which are received by the audio circuit 560 and converted into audio data, which are then processed by the audio data output processor 580, and then passed through the RF circuit 510 to be sent to, for example, another cellular phone, or output to the memory 520 for further processing.

Wi-Fi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the Wi-Fi module 570, and provides wireless broadband internet access for the user. Although fig. 5 shows the Wi-Fi module 570, it is understood that it does not belong to the essential constitution of the cellular phone 500 and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 520 and calling data stored in the memory 520, thereby performing overall monitoring of the mobile phone. Alternatively, processor 580 may include one or more processing units; preferably, the processor 580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 580.

Handset 500 also includes a power supply 590 (e.g., a battery) for powering the various components, which may preferably be logically coupled to processor 580 via a power management system that may be used to manage charging, discharging, and power consumption.

Although not shown, handset 500 may also include a camera. Optionally, the position of the camera on the mobile phone 500 may be front-located or rear-located, which is not limited in this embodiment of the application.

Optionally, the mobile phone 500 may include a single camera, a dual camera, or a triple camera, which is not limited in this embodiment.

For example, the cell phone 500 may include three cameras, one being a main camera, one being a wide camera, and one being a tele camera.

Optionally, when the mobile phone 500 includes a plurality of cameras, the plurality of cameras may be all front-mounted, all rear-mounted, or a part of the cameras front-mounted and another part of the cameras rear-mounted, which is not limited in this embodiment of the present application.

In addition, although not shown, the mobile phone 500 may further include a bluetooth module, etc., which will not be described herein.

Fig. 6 is a schematic diagram of a software structure of a mobile phone 500 according to an embodiment of the present application. Taking the operating system of the mobile phone 500 as an Android system as an example, in some embodiments, the Android system is divided into four layers, which are an application layer, an application Framework (FWK) layer, a system layer and a hardware abstraction layer, and the layers communicate with each other through a software interface.

As shown in fig. 6, the application layer may include a series of application packages, which may include short message, calendar, camera, video, navigation, gallery, call, and other applications.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.

As shown in fig. 6, the application framework layer may include a window manager, a resource manager, and a notification manager, among others.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to notify download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The application framework layer may further include:

a viewing system that includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication functions of the handset 500. Such as management of call status (including on, off, etc.).

The system layer may include a plurality of functional modules. For example: a sensor service module, a physical state identification module, a three-dimensional graphics processing library (such as OpenGL ES), and the like.

The sensor service module is used for monitoring sensor data uploaded by various sensors in a hardware layer and determining the physical state of the mobile phone 500;

the physical state recognition module is used for analyzing and recognizing user gestures, human faces and the like;

the three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The system layer may further include:

the surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The hardware abstraction layer is a layer between hardware and software. The hardware abstraction layer may include a display driver, a camera driver, a sensor driver, etc. for driving the relevant hardware of the hardware layer, such as a display screen, a camera, a sensor, etc.

The following embodiments may be implemented on the handset 500 having the above-described hardware/software architecture. The following embodiment will take the mobile phone 500 as an example to explain the voice interaction method provided in the embodiment of the present application.

Referring to fig. 7, a schematic step flow chart of a voice interaction method provided in an embodiment of the present application is shown, and by way of example and not limitation, the method may be applied to the mobile phone 500, and the method may specifically include the following steps:

s701, acquiring historical dialogue data when a user statement to be replied is received;

in the embodiment of the present application, the user sentence may be a sentence that is directly spoken by the user when the user uses an application such as a voice assistant in the terminal device. For example, if a user wishes to query the weather in tomorrow, the user may wake up a voice assistant in the cell phone and speak a "how tomorrow's weather" or similar sentence.

In general, a user may, through multiple rounds of dialog with a voice assistant, prompt the voice assistant to fully and accurately understand the user's intent and return information that satisfies the intent. The user sentence to be replied in this embodiment may be a sentence or a word spoken by the user in a non-first-wheel dialog process, that is, the voice assistant has completed at least one round of dialog with the user before receiving the user sentence to be replied.

In the embodiment of the application, in order to better understand the user intention, after receiving the user statement, a program such as a voice assistant can acquire the dialog data of several turns of dialog before the user and the voice assistant in the current dialog process, and determine the real intention of the user in the current turn of dialog by combining with the historical dialog data.

In a specific implementation, the historical dialog data may be all dialog data after the user wakes up the voice assistant this time, or may also be dialog data in a previous specific turn, such as data of three previous turns of the dialog, which is not limited in this embodiment.

S702, identifying target entity information in the user statement and identifying historical entity information in the historical dialogue data;

an entity (entity) is a term often used in the information world to denote a conceptual thing. Generally, most of the entities can be represented by nouns, such as names of people, places, organizations, etc.; the small amount of entity information can also be represented by words with other parts of speech, such as adjectives and the like.

In an embodiment of the application, entity information in user statements and historical dialogue data may be identified based on the NER model.

In specific implementation, for a received user sentence, the sentence may be segmented first, then whether each segmented word belongs to an entity word is determined one by one, and each entity word is labeled.

Of course, the entity information identified from the user sentence of the current round of conversation may be used as the historical entity information in the next round of conversation and later. Therefore, for the entity information in the historical dialogue data, after each statement in the historical dialogue data is obtained, the statement is segmented to find out the entity information; or, words that have been marked as entity information in the previous rounds may be directly extracted as historical entity information, which is not limited in this embodiment.

S703, extracting key entity information associated with the user statement from the historical entity information;

since the user statement to be replied is the dialog statement of the current turn, and each entity information contained in the dialog statement is basically closely related to the user intention, all the target entity information contained in the user statement can be reserved. For historical entity information, it is necessary to distinguish which is useful information for the current turn of conversation and which is redundant information.

Therefore, after extracting the historical entity information, key entity information associated with the dialog sentences of the current round can be screened out from the historical entity information, and the key entity information can be regarded as information with obvious benefits for identifying the intention of the user.

In the embodiment of the application, a plurality of user intentions can be set in the voice assistant according to different application scenes, and a plurality of associated entity information can be configured for each user intention. After the target entity information in the user statement is identified, other entity information possibly contained in the intention can be screened out from the intention containing the target entity information, and then the key entity information is identified from the historical entity information. For example, for the intention of "weather forecast", a plurality of entity information such as "time", "place", "weather condition" and the like may be configured thereto, and if the target entity information is "weather condition", those entity information satisfying the requirements of "time", "place" in the history entity information may be identified as the key entity information.

Of course, according to different actual usage requirements, the key entity information may also be determined in other manners, which is not limited in this embodiment.

S704, generating a target interactive statement according to the target entity information and the key entity information;

in the embodiment of the application, after the target entity information in the user statement of the current turn and the key entity information in the historical dialogue statement are determined, the target interactive statement matching the actual intention of the user can be generated according to the two information.

For example, if the target entity information and the key entity information include time information "friday", location information "beijing", and weather condition information "temperature", it may be recognized that what the user currently desires to query is the temperature condition of friday of beijing, and the corresponding target interactive statement may be "how much the temperature of friday of beijing is", or other similar statements. The target interactive sentence is the expression sentence of the information that the user wants to query.

S705, outputting a reply sentence corresponding to the target interactive sentence.

The function of the voice assistant is to facilitate the user to inquire some information in a voice manner. Therefore, after the target interactive sentence matched with the actual intention of the user is identified, the voice assistant can search the sentence to find out the corresponding reply sentence.

For example, for the interactive statement "how much the temperature of beijing of this week is", the corresponding reply statement may be "the temperature of beijing of week five is 17 degrees celsius". The reply sentence may be broadcasted to the user in a voice manner, may be displayed in a mobile phone interface in a text manner, or may be sent to the mobile phone of the user in another information format, which is not limited in this embodiment.

In the embodiment of the application, by identifying the target entity information in the current conversation turn and extracting the key entity information from the historical conversation data, the actual intention of the user can be determined according to the two entity information, and the user statement of the current turn is rewritten according to the intention to generate the target interactive statement, so that the application programs such as a voice assistant in the terminal equipment can reply according to the target interactive statement. The DST problem in this embodiment through with many rounds of dialogues converts the single round of dialogue problem to a certain extent, can utilize current ripe single round of dialogue technique, reply user's intention, improve the accuracy of dialogue state tracking and user's intention discernment, promote dialogue system's natural language throughput, strengthen the rationality that dialogue system replied in many rounds of dialogue processes, make the system reply more can match user's actual demand, reduce the interactive number of times between user and the dialogue system.

Referring to fig. 8, a flowchart illustrating schematic steps of a voice interaction method provided in another embodiment of the present application is shown, where the method specifically includes the following steps:

s801, acquiring historical dialogue data when a user statement to be replied is received;

it should be noted that the method may be applied to terminal devices such as a mobile phone and a tablet computer, and the specific type of the terminal device is not limited in this embodiment.

For convenience of understanding, the following description is given by taking the terminal device as a mobile phone in this embodiment. That is, when a user uses an application program such as a voice assistant in a mobile phone, the application program identifies entity information of the user in the current round and each round before the current round, so as to determine a corresponding user intention, rewrites a user sentence of the current round based on the intention, and outputs a reply sentence corresponding to the rewritten user sentence, so as to meet the actual demand of the user.

In the embodiment of the present application, the user sentence to be replied may refer to a certain sentence that is directly spoken by the user in the process of interacting with the voice assistant, and the sentence may be a sentence capable of completely expressing a certain user intention, or may be one or more words.

When the voice assistant receives a certain sentence of the user, it may first determine whether a corresponding reply can be given for the sentence. If the voice assistant can give a reply directly from the sentence, no further processing is needed and the replied sentence can be provided directly to the user. For example, if the statement of the user is "ask how much o's temperature is in beijing, since it can be directly determined according to the statement that the user's intention is to ask for weather conditions in beijing, this week, the voice assistant can directly output the result obtained according to the inquiry to the user.

If the corresponding result cannot be directly inquired according to the statement of the current turn of the user, the intention of the user can be redetermined by combining the expressions of the user in each previous turn. At this point, historical conversation data between the user and the voice assistant may be obtained. The historical dialog data may be dialog data of all rounds until the current round after the user wakes up the voice assistant, or dialog data of a plurality of consecutive rounds before the current round, which is not limited in this embodiment.

S802, identifying target entity information in the user statement, and identifying historical entity information in the historical dialogue data;

It should be noted that the historical entity information in the historical dialogue data may include entity information in a sentence spoken by the user in a certain turn, and may also include entity information in a reply sentence when the voice assistant replies to the user.

For example, in a history conversation turn, the user asks the voice assistant "which restaurant is the nearest" and the voice assistant replies "the nearest restaurant is the seafloor fishing for the south of the Yangtze district, and for the history conversation data of the turn, the history entity information in the history entity information may include" restaurant "in the user sentence and also include entity information such as" south of the Yangtze district "and" seafloor fishing "in the voice assistant reply sentence.

S803, determining a candidate user intention matched with the user statement according to the target entity information and the historical entity information;

it should be noted that, since target entity information included in the user statement and historical entity information included in the historical dialogue data may include many, the candidate user intention preliminarily determined according to the target entity information and the historical entity information may also include a plurality of types.

In the embodiment of the present application, after the target entity information and the historical entity information are identified, the KBs may be combined to preliminarily determine what the user may currently intend.

In a specific implementation, multiple user intents may be preset in the KBs, and each user intention may include multiple semantic slots. After the target entity information and the historical entity information are identified and obtained, the slot position corresponding to each user intention can be matched according to the two entity information, so that the user intention corresponding to the slot position containing part of the identified entity information is preliminarily determined as the candidate user intention

S804, respectively calculating the distribution probability of each historical entity information in the historical dialogue data;

in the embodiment of the present application, in order to accurately determine the actual intention of the user, the distribution probability of each historical entity information in the historical dialogue data may be calculated first.

In a specific implementation, the distribution probability of each historical entity information may be determined based on a PGN model. Firstly, symbolizing each historical entity information, then calling a PGN model, coding each symbolized historical entity information by adopting a coding module of the PGN model, and calculating the distribution probability of each historical entity information in a coding section.

Fig. 9 is a schematic diagram of a process for calculating a distribution probability of entity information based on a PGN model according to an embodiment of the present application. Firstly, the prediction model can be trained by combining training data and KBs (KBs), so that the key information extraction capability of the PGN model is enhanced. The training data may be pre-collected multi-turn dialogue data, including entity information in a turn (current turn) of the pre-collected training data and entity information in history (previous turns of the current turn). For historical entity information needing to be calculated currently, corresponding attention distribution can be output after the historical entity information is converted into a text vector; meanwhile, the generation probability of the historical entity information is obtained by combining the coding module and the decoding module of the PGN model. The probabilities of the above types are added, and the final distribution probability can be output. In another invention, when determining the distribution probability, the reliability of the output distribution probability and the identified key entity information can be improved by combining the confirmation information of the user.

S805, extracting key entity information from the historical entity information according to the distribution probability and the candidate user intention;

in the embodiment of the application, the key entity information associated with the user intention is found out according to the distribution probability of each piece of historical entity information, that is, the entity information with greater relevance to the user intention is screened out from all pieces of historical entity information.

In a specific implementation, candidate entity information associated with any candidate user intention may be extracted from the historical entity information, and then candidate entity information having a probability value of a distribution probability greater than a certain preset probability threshold is extracted as key entity information related to the intention.

As an example of this embodiment, the probability threshold may be set to 0.8. Therefore, candidate entity information having a probability value of the distribution probability of more than 0.8 may be identified as the key entity information.

In the embodiment of the present application, for some entity information whose probability value is not greater than the probability threshold but whose probability value is near the probability threshold, the user may be invited to identify the entity information.

In a specific implementation, if the difference between the target probability value and the probability threshold is smaller than a preset difference, and the target probability value is smaller than the probability threshold, an inquiry statement may be generated according to the candidate entity information and the key entity information corresponding to the target probability value, so as to instruct the user to identify the candidate entity information corresponding to the target probability value, where the target probability value is a probability value of a distribution probability of any candidate entity information in the historical dialogue data.

When confirmation information of the user for the query statement is received, the entity information may be considered to be approved by the user, and at this time, the candidate entity information corresponding to the target probability value may be identified as the key entity information.

For example, in a certain conversation turn between the user and the voice assistant, the probability value of the historical entity information "temperature" is calculated to be 0.86 and is greater than the set probability threshold value of 0.8, and the entity information "temperature" can be identified as the key entity information. On the other hand, the probability value of the historical entity information "friday" is calculated to be 0.72, which is smaller than the probability threshold value 0.8, but is in the vicinity of the threshold value. Assuming that the target entity information in the current turn of user statement is 'Beijing City', a corresponding query statement 'asking you to inquire about temperature information of' ZhouWu 'in Beijing City' can be generated by combining the target entity information and the existing key entity information, and if a confirmation reply of the user is received, the historical entity information 'ZhouWu' can also be identified as the key entity information.

S806, determining a target basic statement;

In a specific implementation, in order to reduce the generation difficulty of the target interactive statement, the target basic statement may be determined first, and then rewritten on the basis of the target basic statement to obtain a final target interactive statement.

In an embodiment of the present application, the target base sentence may be determined based on the key entity information and/or the target entity information.

In a specific implementation, a plurality of basic sentences are obtained from a user sentence including target entity information and historical dialogue data including key entity information, then matching degrees between the key basic sentences and entity information to be evaluated are respectively calculated, and a basic sentence corresponding to the maximum matching degree is identified as a current target basic sentence. The entity information to be evaluated comprises all target entity information and key entity information.

In the embodiment of the present application, the basic sentence may be a current user sentence, or may be a certain sentence of user sentences in the historical dialogue data. The matching degree between the entity information to be evaluated and each basic statement can be determined according to the matching degree between the entity information to be evaluated and the semantic slot.

Specifically, for any basic statement, the number of semantic slots in the basic statement and the number of entity information to be evaluated may be counted, that is, how many slots are included in the basic statement to be calculated and how many identified entity information to be evaluated is counted. Then, determining the number of key slots respectively matched with the entity information to be evaluated in the basic statement, and finally calculating the ratio of the number of the key slots to the number of semantic slots in the basic statement, so that the ratio can be used as the matching degree between the entity information to be evaluated and the basic statement.

For example, a certain basic statement includes four semantic slots, where the entity information to be evaluated, "friday" and "temperature" are respectively matched with the time slot and the weather condition slot therein, and the matching degree between the entity information to be evaluated and the basic statement is 50%.

S807, rewriting the target basic statement by adopting the target entity information and the key entity information to generate a target interactive statement;

after the target basic statement is determined, the statement can be rewritten by adopting the target entity information and the key entity information to obtain a final target interactive statement.

Specifically, target entity information or key entity information is adopted for statement rewriting, and the target basic statement is determined by whether the target basic statement is a current user statement or a user statement in a historical conversation. If the target basic statement is the current user statement, since the user statement already contains all target entity information, the target basic statement can be rewritten by using key entity information identified from the historical dialogue; if the target basic statement is a statement in a historical dialogue, since the statement may only contain a part of key entity information, the statement can be rewritten by using all the key entity information and the target entity information.

In the embodiment of the present application, the target interactive statement may be output based on a PGN model. The PGN model may further include a decoding module, in addition to the encoding module, where the decoding module may be obtained by training a plurality of types of training data, and the plurality of types of training data may include a plurality of pieces of entity information and a base sentence corresponding to each piece of entity information.

Therefore, after the target basic statement is determined, the target entity information, the key entity information and the target basic statement can be decoded by using a decoding module of the PGN model, and the target interactive statement is output.

In the embodiment of the application, for a target interactive statement output by a PGN model, whether the statement is rewritten correctly or not may be verified.

In the embodiment of the application, whether the target interactive statement is rewritten correctly or not can be judged in a double-layer verification mode.

Specifically, a plurality of entity information in the target interactive statement may be extracted first, and it is verified whether the plurality of entity information in the target interactive statement matches a semantic slot intended by the target user in a preset knowledge base. The target user intention is any one of all candidate user intentions.

If the plurality of entity information in the target interactive statement matches the semantic slot intended by the target user, it may be determined that the generated target interactive statement is correct, and step S808 is performed to output a reply statement corresponding to the target interactive statement.

If the entity information in the target interactive statement does not match the semantic slot intended by the target user, the target interactive statement can be verified for the second time according to the statement type of the target interactive statement.

In performing the secondary verification, it may be done based on a natural language understanding model. And whether the target interactive statement is a task statement or not can be judged by calling a preset natural language understanding model. If the statement is a task-type statement, the intention of the user can be specifically recognized according to the current statement, and a response is made to the intention. At this time, step S808 may also be executed to output a reply statement corresponding to the target interactive statement; if the target interactive statement is not a task-type statement, the target interactive statement indicates that the voice assistant cannot perform specific intention recognition on the statement or the recognized intention lacks more definite information, at this time, the user may be prompted to re-input the user statement, and the voice assistant may recognize the user intention again according to the re-input user statement and generate a new target interactive statement.

And S808, outputting a reply sentence corresponding to the target interactive sentence.

In the embodiment of the application, the user statements in the current turn can be rewritten by combining entity information in historical conversation data, so that a conversation state tracking problem in multiple turns of conversation is converted into a single turn of conversation problem to a certain extent, the existing single turn of conversation technology can be utilized to reply the user intention, the natural language processing capacity of a conversation system is improved, the accuracy of user intention identification is ensured, the reasonability of the conversation system reply in the multiple turns of conversation process is enhanced, the system reply can better match the actual requirements of the user, and the interaction times between the user and the conversation system are reduced.

For ease of understanding, the voice interaction method of the present application is described below with reference to a specific example. As shown in fig. 10, which is a schematic view of an operation process of the voice interaction method provided in an embodiment of the present application, according to the operation process shown in fig. 10, the whole voice interaction may include the following steps:

1. for an input sentence in a multi-turn dialog process, whether the input sentence in the current turn of dialog needs to be rewritten or not can be judged firstly. If the reply sentence is not required to be rewritten, the reply sentence can be directly output; if the historical dialogue data needs to be rewritten, the NER module is used for extracting entity information in the historical dialogue data to determine all historical entity information in the historical dialogue data. Since an entity may be composed of a plurality of words, it is necessary to perform symbolization processing on each entity information to facilitate subsequent encoding and generation processes in the PGN model.

2. Based on the PGN model, the attention distribution of each historical entity information is calculated at the coding section.

3. And training a prediction model by combining target entity information in the current conversation turn, the distribution probability of historical entity information and a knowledge base KBs, screening entity information with the maximum probability value related to the user intention from the historical entity information as key entity information, and abandoning redundant entity information in historical conversation data. On the basis, corresponding basic rewriting statements are determined according to the key entity information. And then generating a target interactive statement by using a decoding module of the PGN model on the basis of the basic rewriting statement. Predetermining the base rewrite statement is significant in reducing the difficulty of the PGN model to generate the output statement.

4. And training a neural network model by taking the KBs as priors and combining historical entity information and target entity information in the current round of conversation so as to enhance the capability of the PGN model for extracting key information in the historical conversation. It should be noted that since the entities in the sentences of the current conversation turn are basically closely related to the user's intention, all entities are retained. And then, the output of the neural network is fused into a loss function of the PGN model to calculate the output probability corresponding to the historical entity information.

5. And on the basis of the output probability of each piece of historical entity information, calculating the final distribution probability of each piece of historical entity information by combining a PGN (probabilistic graphical user network) model, namely the probability that the entity in the model can finally reflect the intention of the user.

6. If the distribution probability is smaller than the threshold, the improved model still cannot determine whether the entity should appear in the output statement, and at this time, the user may be invited to participate in configuring the key entity; and then combining the key entities which are larger than the threshold value, and generating the statement by using a decoding module of the PGN model on the basis of the determined basic rewriting statement. It should be noted that, the reason why the user is invited to participate in the entity configuration is that the threshold value may be generally set to be higher in order to increase the recall rate of the model for extracting the key information. However, too high a threshold may also result in a part of the key information being lost, so that it is necessary to invite the user to participate in configuration for entity information close to the threshold, so as to further increase the recall rate of extracting the key information by the model. On the other hand, by inviting the user to participate in the configuration of the entity information, the reliability of the output statement obtained in the way is higher, the output statement can be used as a training corpus to carry out iterative optimization on the model, and the problem that high-quality multi-turn dialogue corpus is difficult to obtain is partially solved.

7. In order to determine the validity of the output target interactive statement, the embodiment designs a double-layer feedback mechanism, and the specific manner is as follows:

matching the rewritten sentences with the slot position values corresponding to the KBs mean graphs, and if the matching is successful, considering that the rewriting is correct; if the matching is unsuccessful, the natural language understanding technology can be continuously utilized to verify the rewritten sentence, and if the natural language understanding technology is adopted to identify the sentence as a task-type sentence, the rewriting can be considered to be correct; if the statement is recognized as not being a task-based statement, the rewrite is considered to be incorrect. At this time, the user may be guided to restate the intention as a subsequent corpus.

8. Based on the rewritten target interactive statements, the DST problem in the multi-turn dialogue can be converted into a single-turn dialogue problem to a certain extent, the existing mature single-theory dialogue technology is utilized, the intention of the user is replied, and the capability and the user experience of the task-oriented multi-turn dialogue system are improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 11 shows a structural block diagram of a voice interaction apparatus provided in an embodiment of the present application, and for convenience of description, only the parts related to the embodiment of the present application are shown.

Referring to fig. 11, the apparatus may be applied to a terminal device, and specifically may include the following modules:

a historical dialogue data acquisition module 1101, configured to acquire historical dialogue data when receiving a user statement to be replied;

a target entity information identification module 1102, configured to identify target entity information in the user statement; and (c) a second step of,

a historical entity information identification module 1103, configured to identify historical entity information in the historical dialogue data;

a key entity information extraction module 1104, configured to extract key entity information associated with the user statement from the historical entity information;

a target interactive statement generating module 1105, configured to generate a target interactive statement according to the target entity information and the key entity information;

a reply statement output module 1106, configured to output a reply statement corresponding to the target interactive statement.

In this embodiment of the present application, the key entity information extraction module 1104 may specifically include the following sub-modules:

The candidate user intention determining submodule is used for determining candidate user intentions matched with the user sentences according to the target entity information and the historical entity information;

In this embodiment, the distribution probability calculation sub-module may specifically include the following units:

In this embodiment, the key entity information extraction sub-module may specifically include the following units:

In this embodiment, the key entity information extraction sub-module may further include the following units:

and a key entity information determining unit, configured to determine, when receiving confirmation information of a user for the query statement, candidate entity information corresponding to the target probability value as the key entity information, where the target probability value is a probability value of a distribution probability of any candidate entity information in the historical dialog data.

In this embodiment of the present application, the target interactive statement generating module 1105 may specifically include the following sub-modules:

And the target interactive statement generating submodule is used for adopting the target entity information and the key entity information to rewrite the target basic statement and generate the target interactive statement.

In this embodiment of the present application, the target basic statement determination sub-module may specifically include the following units:

In this embodiment of the present application, each basic sentence includes a plurality of semantic slots, and the matching degree calculating unit may specifically include the following sub-units:

and the calculating subunit is used for calculating the ratio of the number of the key slots to the number of the semantic slots in the basic statement, and taking the ratio as the matching degree between the entity information to be evaluated and the basic statement.

In this embodiment of the present application, the pointer generation network model further includes a decoding module, where the decoding module is obtained by training a plurality of types of training data, where the plurality of types of training data include a plurality of entity information and a basic sentence corresponding to each entity information; the target interactive statement generation submodule may specifically include the following units:

and the second pointer generation network model calling unit is used for decoding the target entity information, the key entity information and the target basic statement by adopting the decoding module and outputting a target interactive statement.

In this embodiment of the present application, the target interactive statement generating submodule may further include the following unit:

A target interactive statement verification unit, configured to verify whether multiple pieces of entity information in the target interactive statement match a semantic slot of a target user intention in a preset knowledge base, where the target user intention is any one of the candidate user intentions; if the plurality of entity information in the target interactive statement matches the semantic slot of the target user intention, judging that the generated target interactive statement is correct, and executing a step of outputting a reply statement corresponding to the target interactive statement; and if the plurality of entity information in the target interactive statement do not match the semantic slot intended by the target user, verifying the target interactive statement according to the statement type of the target interactive statement.

In an embodiment of the present application, the target interactive statement verifying unit is further configured to: calling a preset natural language understanding model to judge whether the target interactive statement is a task statement; if the target interactive statement is a task-type statement, calling the reply statement output module 1106 to output a reply statement corresponding to the target interactive statement; and if the target interactive statement is not a task statement, prompting the user to re-input the user statement, and re-generating the target interactive statement according to the re-input user statement.

For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the description of the method embodiment for relevant points.

Referring to fig. 12, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 12, the terminal device 1200 of the present embodiment includes: a processor 1210, a memory 1220, and a computer program 1221 stored in the memory 1220 and operable on the processor 1210. When the processor 1210 executes the computer program 1221, the steps in the embodiments of the voice interaction method described above, for example, steps S701 to S705 shown in fig. 7, are implemented. Alternatively, the processor 1210, when executing the computer program 1221, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the modules 1101 to 1106 shown in fig. 11.

Illustratively, the computer program 1221 may be partitioned into one or more modules/units that are stored in the memory 1220 and executed by the processor 1210 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used to describe the execution process of the computer program 1221 in the terminal device 1200. For example, the computer program 1221 may be divided into a historical dialogue data acquisition module, a target entity information identification module, a historical entity information identification module, a key entity information extraction module, a target interactive statement generation module, and a reply statement output module, where the specific functions of the modules are as follows:

the target entity information identification module is used for identifying target entity information in the user statement; and (c) a second step of,

The terminal device 1200 may be a desktop computer, a notebook, a palm computer, or other computing devices. The terminal device 1200 may include, but is not limited to, a processor 1210 and a memory 1220. Those skilled in the art will appreciate that fig. 12 is only one example of a terminal device 1200 and does not constitute a limitation of the terminal device 1200, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 1200 may further include input and output devices, network access devices, buses, etc.

The Processor 1210 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 1220 may be an internal storage unit of the terminal device 1200, such as a hard disk or a memory of the terminal device 1200. The memory 1220 may also be an external storage device of the terminal device 1200, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device 1200. Further, the memory 1220 may also include both an internal storage unit and an external storage device of the terminal device 1200. The memory 1220 is used for storing the computer program 1221 and other programs and data required by the terminal device 1200. The memory 1220 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the application also discloses a computer readable storage medium, which stores a computer program, and the computer program can implement the foregoing voice interaction method when executed by a processor.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed voice interaction method, apparatus and terminal device may be implemented in other ways. For example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a voice interaction device or terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.

Claims

1. A method for voice interaction, comprising:

when receiving a user statement to be replied, acquiring historical dialogue data;

identifying target entity information in the user statement and identifying historical entity information in the historical dialog data;

extracting key entity information associated with the user statement from the historical entity information;

determining a target basic statement, and rewriting the target basic statement by adopting the target entity information and the key entity information to generate a target interactive statement;

outputting a reply sentence corresponding to the target interactive sentence;

Wherein the determining the target base statement comprises:

acquiring a plurality of basic sentences from the user sentences containing the target entity information and the historical dialogue data containing the key entity information;

respectively calculating the matching degrees between the plurality of basic sentences and entity information to be evaluated, wherein the entity information to be evaluated comprises the target entity information and the key entity information;

and identifying the basic statement corresponding to the maximum matching degree as the current target basic statement.

2. The method of claim 1, wherein extracting key entity information associated with the user statement from the historical entity information comprises:

determining candidate user intentions matched with the user sentences according to the target entity information and the historical entity information;

respectively calculating the distribution probability of each historical entity information in the historical dialogue data;

and extracting key entity information from the historical entity information according to the distribution probability and the candidate user intention.

3. The method of claim 2, wherein the calculating the distribution probability of each historical entity information in the historical dialogue data comprises:

Calling a preset pointer to generate a network model, wherein the pointer generating network model comprises a coding module;

and respectively coding each historical entity information by adopting the coding module to obtain the distribution probability corresponding to each historical entity information.

4. The method of claim 2 or 3, wherein the extracting key entity information from the historical entity information according to the distribution probability and the candidate user intent comprises:

extracting candidate entity information associated with any candidate user intent from the historical entity information;

and extracting candidate entity information with the probability value of the distribution probability being larger than a preset probability threshold value as key entity information.

5. The method of claim 4, further comprising:

if the difference value between the target probability value and the preset probability threshold value is smaller than the preset difference value, and the target probability value is smaller than the preset probability threshold value, generating an inquiry statement according to the candidate entity information corresponding to the target probability value and the key entity information so as to indicate a user to identify the candidate entity information corresponding to the target probability value;

When confirmation information of a user for the inquiry statement is received, determining candidate entity information corresponding to the target probability value as the key entity information, wherein the target probability value is the probability value of the distribution probability of any candidate entity information in the historical dialogue data.

6. The method of claim 1, wherein any one of the base sentences includes a plurality of semantic slots, and the calculating the matching degrees between the plurality of base sentences and the entity information to be evaluated includes:

counting the number of semantic slots in any basic statement and the number of the entity information to be evaluated aiming at the basic statement;

determining the number of key slot positions respectively matched with the entity information to be evaluated in the basic statement;

and calculating the ratio of the number of the key slots to the number of the semantic slots in the basic statement, and taking the ratio as the matching degree between the entity information to be evaluated and the basic statement.

7. The method of claim 3, wherein the pointer generation network model further comprises a decoding module, the decoding module being obtained by training a plurality of types of training data, the plurality of types of training data comprising a plurality of entity information and a base sentence corresponding to each entity information;

The rewriting the target basic statement by using the target entity information and the key entity information to generate a target interactive statement includes:

and decoding the target entity information, the key entity information and the target basic statement by adopting the decoding module, and outputting a target interactive statement.

8. The method of claim 7, further comprising:

extracting a plurality of entity information in the target interactive statement;

verifying whether a plurality of entity information in the target interactive statement matches a semantic slot of a target user intention in a preset knowledge base, wherein the target user intention is any one of the candidate user intentions;

if the plurality of entity information in the target interactive statement matches the semantic slot of the target user intention, judging that the generated target interactive statement is correct, and executing a step of outputting a reply statement corresponding to the target interactive statement;

and if the plurality of entity information in the target interactive statement do not match the semantic slot of the target user intention, verifying the target interactive statement according to the statement type of the target interactive statement.

9. The method of claim 8, wherein the verifying the target interactive statement according to the statement type of the target interactive statement comprises:

calling a preset natural language understanding model to judge whether the target interactive statement is a task statement;

if the target interactive statement is a task-type statement, executing a step of outputting a reply statement corresponding to the target interactive statement;

and if the target interactive statement is not a task statement, prompting the user to re-input the user statement, and generating the target interactive statement again according to the re-input user statement.

10. A voice interaction apparatus, comprising:

The target interactive statement generating module is used for determining a target basic statement, adopting the target entity information and the key entity information to rewrite the target basic statement and generate a target interactive statement;

a reply sentence output module for outputting a reply sentence corresponding to the target interactive sentence;

wherein the target interactive statement generating module comprises:

11. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the voice interaction method according to any one of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for voice interaction according to any one of claims 1 to 9.