WO2021196981A1

WO2021196981A1 - Voice interaction method and apparatus, and terminal device

Info

Publication number: WO2021196981A1
Application number: PCT/CN2021/079479
Authority: WO
Inventors: 刘杰; 张晴
Original assignee: 华为技术有限公司
Priority date: 2020-03-31
Filing date: 2021-03-08
Publication date: 2021-10-07
Also published as: CN111428483A; CN111428483B

Abstract

A voice interaction method and apparatus and a terminal device, applicable to the technical field of artificial intelligence. The method comprises: obtaining historical dialogue data when a user statement to be replied is received; identifying target entity information in the user statement, and identifying historical entity information in the historical dialogue data; extracting key entity information associated with the user statement from the historical entity information; generating a target interaction statement according to the target entity information and the key entity information; and outputting a reply statement corresponding to the target interaction statement. According to the method, the accuracy of dialogue state tracking and user intention recognition can be improved, the natural language processing capacity of the dialogue system is improved, and the reply rationality of the dialogue system in the multi-round dialogue process is enhanced.

Description

Voice interaction method, device and terminal equipment

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office, the application number is 202010244784.8, and the application name is "Voice Interaction Method, Apparatus and Terminal Equipment" on March 31, 2020, the entire content of which is incorporated herein by reference Applying.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a voice interaction method, device and terminal equipment.

Background technique

Natural language processing (Natural Language Processing, NLP) is an important part of artificial intelligence (Artificial Intelligence, AI), and its typical application scenarios include task-oriented dialogue systems and machine translation. In a multi-round dialogue scenario based on natural language dialogue, how to track and determine the user's intention is a crucial link. In the process of Dialogue State Tracking (DST), it is necessary to dynamically adjust the model in combination with historical dialogue to extract the key information contained in the user's sentence, and then determine the user's intention, and complete the corresponding response in conjunction with the dialogue system.

In the prior art, the DST method based on machine learning requires the model to understand the content of multiple rounds of dialogue well, which places extremely high requirements on the model, which largely limits the accuracy of this type of DST method. However, due to the high abstraction of natural language and the complexity of multiple rounds of dialogue, it is difficult for current machine learning technology to fully and accurately understand multiple rounds of dialogue in practical application scenarios, that is, it is difficult to accurately track the state of multiple rounds of dialogue and determine User intent.

Summary of the invention

The embodiments of the present application provide a voice interaction method, device, and terminal device, which can solve the problem of the difficulty in tracking the state of multiple rounds of dialogue in the prior art and the inability to accurately determine the user's intention.

In the first aspect, an embodiment of the present application provides a voice interaction method, including:

When the user sentence to be replied is received, the historical dialogue data is obtained, and the target entity information in the user sentence and the historical entity information in the historical dialogue data are identified through the named entity recognition model; then, the historical entity information is extracted from the historical entity information and the user The key entity information associated with the sentence can rewrite the current user sentence according to the target entity information and the key entity information to generate a target interaction sentence; by outputting a reply sentence corresponding to the target interaction sentence, the user's interaction needs can be met. In this embodiment, by acquiring entity information in historical dialogue rounds, and combining the entity information in the current dialogue round to rewrite the sentences of the current dialogue round, the dialogue state tracking problem in multiple rounds of dialogue can be converted into a single round of dialogue. Problem, it is convenient to use the existing mature single-round dialogue technology to reply to the user's intention, which helps to improve the accuracy of the user's intention recognition and enhance the language processing ability of the dialogue system.

In a possible implementation of the first aspect, the key entity information associated with the user sentence is extracted from the historical entity information, and the candidate users that match the user sentence can be initially determined based on the target entity information and the historical entity information Intent; then calculate the distribution probability of each historical entity information in the historical dialogue data, so that according to the distribution probability and candidate user intent, the key entity information can be extracted from the historical entity information.

In a possible implementation of the first aspect, separately calculating the distribution probability of each historical entity information in the historical dialogue data can be implemented by calling a preset pointer generation network model, and the pointer generation network model includes an encoding module, The encoding module can be used to separately encode each historical entity information to obtain the distribution probability corresponding to each historical entity information.

In a possible implementation of the first aspect, when extracting key entity information, it is also possible to initially extract candidate entity information associated with any candidate user’s intention from the historical entity information; and then extract the probability value of the distribution probability greater than Candidate entity information with a preset probability threshold is used as key entity information.

In a possible implementation manner of the first aspect, for those candidate entity information whose target probability value is less than the preset probability threshold and the difference between the target probability value and the preset probability threshold is less than the preset probability threshold, you can According to the candidate entity information and key entity information corresponding to the target probability value, a query sentence is generated, and the user is invited to identify the candidate entity information corresponding to the target probability value; if the user's confirmation information for the query sentence is received, the target probability can be The candidate entity information corresponding to the value is determined to be the key entity information, and the aforementioned target probability value is the probability value of the distribution probability of any candidate entity information in the historical dialogue data.

In a possible implementation of the first aspect, when generating the target interactive sentence based on the target entity information and key entity information, the target basic sentence can be determined first, and then the target entity information and key entity information can be used to determine the target basic sentence. The sentence is rewritten to obtain the target interactive sentence, which reduces the difficulty of directly generating the target interactive sentence.

In a possible implementation of the first aspect, multiple basic sentences can be obtained from user sentences containing target entity information and historical dialogue data containing key entity information; and then multiple basic sentences and entities to be evaluated can be calculated separately The matching degree between the information is identified, and the basic sentence corresponding to the maximum value of the matching degree is recognized as the current target basic sentence. The above-mentioned entity information to be evaluated includes all target entity information and key entity information.

In a possible implementation of the first aspect, any basic sentence may respectively include multiple semantic slots, and the matching degree between the entity information to be evaluated and the multiple basic sentences may be based on the number of key slots and the basic sentence To determine the ratio between the number of semantic slots in. Therefore, for any basic sentence, you can first count the number of semantic slots in the basic sentence and the number of entity information to be evaluated; then determine the number of key slots in the basic sentence that match the information of each entity to be evaluated. ; After calculating the ratio between the number of key slots and the number of semantic slots in the basic sentence, the ratio can be used as the matching degree between the entity information to be evaluated and the basic sentence.

In a possible implementation of the first aspect, the pointer generation network model may also include a decoding module, and the decoding module may be obtained by training a variety of training data. The basic sentence corresponding to the entity information. Therefore, when the target basic sentence is rewritten and the target interactive sentence is generated, the decoding module can be used to complete. Specifically, if the target basic sentence is the current user sentence, the decoding module can be used to decode the key entity information and the target basic sentence and output the target interactive sentence; if the target basic sentence is the user sentence in the historical dialogue data, it can be used The decoding module decodes target entity information, key entity information and target basic sentences, and outputs target interactive sentences.

In a possible implementation manner of the first aspect, after obtaining the target interaction sentence, it is also possible to verify whether the rewritten target interaction sentence is correct. This embodiment provides a two-layer verification mechanism. First, multiple entity information in the target interaction sentence can be extracted, and it can be verified whether the multiple entity information matches the preset semantic slot of the target user's intention in the knowledge base, and the target user's intention is Any of the candidate user intents. If multiple entity information in the target interactive sentence matches the semantic slot of the target user’s intention, it can be determined that the generated target interactive sentence is correct, and the step of outputting a reply sentence corresponding to the target interactive sentence is executed; If multiple entity information does not match the semantic slot of the target user's intention, the target interactive sentence can be verified a second time according to the sentence type of the target interactive sentence. In the second verification, it is possible to judge whether the target interactive sentence is a task-type sentence by calling a preset natural language understanding model. If it is a task-type sentence, you can output a corresponding reply sentence for the sentence; if it is not, you need to prompt the user to re-enter the user sentence, restate the user's intention, and generate the target interactive sentence again according to the re-entered user sentence.

In the second aspect, an embodiment of the present application provides a voice interaction device, including:

The historical dialogue data acquisition module is used to acquire historical dialogue data when the user sentence to be replied is received;

The target entity information identification module is used to identify the target entity information in the user sentence; and,

The historical entity information identification module is used to identify the historical entity information in the historical dialogue data;

A key entity information extraction module for extracting key entity information associated with the user sentence from the historical entity information;

A target interactive sentence generating module, configured to generate a target interactive sentence according to the target entity information and the key entity information;

The reply sentence output module is used to output the reply sentence corresponding to the target interactive sentence.

In a possible implementation of the second aspect, the key entity information extraction module may specifically include the following submodules:

A candidate user intention determination sub-module, configured to determine a candidate user intention that matches the user sentence according to the target entity information and the historical entity information;

The distribution probability calculation sub-module is used to separately calculate the distribution probability of each historical entity information in the historical dialogue data;

The key entity information extraction sub-module is used to extract key entity information from the historical entity information according to the distribution probability and the candidate user's intention.

In a possible implementation of the second aspect, the distribution probability calculation submodule may specifically include the following units:

The first pointer generation network model calling unit is configured to call a preset pointer generation network model, and use the coding module of the pointer generation network model to respectively encode each historical entity information to obtain information corresponding to each historical entity information. The corresponding distribution probability.

In a possible implementation of the second aspect, the key entity information extraction submodule may specifically include the following units:

A candidate entity information extraction unit, configured to extract candidate entity information associated with any candidate user's intention from the historical entity information;

The key entity information extraction unit is configured to extract candidate entity information whose distribution probability is greater than a preset probability threshold as key entity information.

In a possible implementation of the second aspect, the key entity information extraction submodule may further include the following units:

The query sentence generating unit is configured to: if the difference between the target probability value and the preset probability threshold is less than the preset difference, and the target probability value is less than the preset probability threshold, then according to the target probability value Corresponding to the candidate entity information and the key entity information, generating a query sentence to instruct the user to identify the candidate entity information corresponding to the target probability value;

The key entity information determining unit is configured to determine the candidate entity information corresponding to the target probability value as the key entity information when the user's confirmation information for the query sentence is received, and the target probability value is any candidate The probability value of the distribution probability of the entity information in the historical dialogue data.

In a possible implementation of the second aspect, the target interactive sentence generation module may specifically include the following submodules:

The target basic sentence determination sub-module is used to determine the target basic sentence;

The target interactive sentence generating sub-module is used to use the target entity information and the key entity information to rewrite the target basic sentence to generate a target interactive sentence.

In a possible implementation of the second aspect, the target basic sentence determination submodule may specifically include the following units:

The basic sentence obtaining unit is configured to obtain a plurality of basic sentences from the user sentence containing the target entity information and the historical dialogue data containing the key entity information;

A matching degree calculation unit, configured to calculate the matching degree between the plurality of basic sentences and the entity information to be evaluated, where the entity information to be evaluated includes the target entity information and the key entity information;

The target basic sentence identification unit is used to identify the basic sentence corresponding to the maximum matching degree as the current target basic sentence.

In a possible implementation of the second aspect, any basic sentence may respectively include multiple semantic slots, and the matching degree calculation unit may specifically include the following sub-units:

The statistics subunit is used to count the number of semantic slots in the basic sentence and the number of entity information to be evaluated for any basic sentence;

The determining sub-module is used to determine the number of key slots in the basic sentence that respectively match the information of the entity to be evaluated;

The calculation subunit is used to calculate the ratio between the number of key slots and the number of semantic slots in the basic sentence, and use the ratio as the difference between the entity information to be evaluated and the basic sentence suitability.

In a possible implementation of the second aspect, the pointer generation network model may also include a decoding module, which is obtained by training a variety of training data. The aforementioned multiple training data includes multiple entity information and information related to each entity. The basic sentence corresponding to the information; the aforementioned target interactive sentence generating sub-module may specifically include the following units:

The second pointer generation network model calling unit is configured to use the decoding module to decode the key entity information and the target basic sentence if the target basic sentence is the user sentence, and output a target interactive sentence; if If the target basic sentence is the historical dialogue data, the decoding module is used to decode the target entity information and the target basic sentence, and output a target interactive sentence.

In a possible implementation of the second aspect, the target interactive sentence generation submodule may further include the following units:

The target interactive sentence entity information extraction unit is used to extract multiple entity information in the target interactive sentence;

The target interactive sentence verification unit is used to verify whether the multiple entity information in the target interactive sentence matches a preset semantic slot of the target user's intention in the knowledge base, and the target user's intention is any of the candidate user intentions One; if the multiple entity information in the target interaction sentence matches the semantic slot intended by the target user, it is determined that the generated target interaction sentence is correct, and a reply sentence corresponding to the target interaction sentence is output; If the multiple entity information in the target interaction sentence does not match the semantic slot intended by the target user, the target interaction sentence is verified according to the sentence type of the target interaction sentence.

In a possible implementation of the second aspect, the target interactive sentence verification unit is further configured to call a preset natural language understanding model to determine whether the target interactive sentence is a task-type sentence, and if the target interactive sentence is Task-type sentence, the reply sentence corresponding to the target interactive sentence is output; if the target interactive sentence is not a task-type sentence, the user is prompted to re-enter the user sentence, and the target interactive sentence is generated again according to the re-input user sentence .

In the third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, The voice interaction method described in any one of the foregoing first aspect is implemented.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of a terminal device, any one of the above-mentioned aspects of the first aspect is implemented. The voice interaction method.

In a fifth aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the voice interaction method described in any one of the above-mentioned first aspects.

Compared with the prior art, the embodiments of the present application include the following beneficial effects:

In the embodiment of this application, by identifying the target entity information in the current dialogue round and extracting key entity information from the historical dialogue data, the actual intention of the user can be determined based on the above two kinds of entity information, and the current round can be determined according to the intention. The second user sentence is rewritten to generate a target interactive sentence, so that applications such as the voice assistant in the terminal device can respond according to the target interactive sentence. In this embodiment, by converting the DST questions in multiple rounds of dialogues to a certain extent into single-round dialogue questions, the existing mature single-round dialogue technology can be used to reply to the user's intention, and the accuracy of dialogue state tracking and user intention recognition can be improved. It can improve the natural language processing capabilities of the dialogue system, and enhance the rationality of the dialogue system’s reply during multiple rounds of dialogue, so that the system’s reply can better match the actual needs of the user and reduce the number of interactions between the user and the dialogue system.

Description of the drawings

Figure 1 is a schematic diagram of the operation process of a multi-round dialogue state tracking scheme based on knowledge base reasoning in the prior art;

2 is a schematic diagram of the operation process of a multi-round dialogue state tracking solution based on a learning model in the prior art;

FIG. 3 is a schematic diagram of the operation process of a voice interaction method provided by an embodiment of the present application;

4 is a schematic diagram of an application scenario of a voice interaction method provided by an embodiment of the present application;

5 is a schematic diagram of the hardware structure of a mobile phone to which the voice interaction method provided by an embodiment of the present application is applicable;

FIG. 6 is a schematic diagram of the software structure of a mobile phone to which the voice interaction method provided by an embodiment of the present application is applicable;

FIG. 7 is a schematic step flowchart of a voice interaction method provided by an embodiment of the present application;

FIG. 8 is a schematic step flowchart of a voice interaction method provided by another embodiment of the present application;

FIG. 9 is a schematic diagram of a calculation process of the distribution probability of entity information provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of the operation process of a voice interaction method provided by another embodiment of the present application;

FIG. 11 is a structural block diagram of a voice interaction device provided by an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In order to facilitate understanding, firstly, several typical multi-round dialogue state tracking solutions in the prior art are introduced.

As shown in FIG. 1, it is a schematic diagram of the operation process of a multi-round dialogue state tracking solution in the prior art. This scheme is a scheme based on question and answer (Question&Answering, QA) knowledge base reasoning, and its specific operation process is as follows:

First, determine the keywords of the multiple rounds of dialogue according to the current multiple rounds of dialogue and the current input, as the input of the current dialogue state, and then search in the knowledge base according to predefined rules. This step is shown in box 101 in FIG. 1.

Then, after searching, the corresponding candidate multi-round dialogue set can be obtained, as shown in box 102 in FIG. 1.

Finally, according to the predefined similarity calculation rules, the similarity between the current dialogue and the candidate dialogue is calculated. The specific strategies include: calculating the semantic similarity between the current input and the candidate question as the first similarity; calculating the context of the current input and each candidate The semantic similarity of the question context is used as the second similarity; the similarity between the summary information of the current multiple rounds of dialogue and each candidate multiple rounds of dialogue is calculated as the third similarity. The weighted summation of the three similarities obtains the similarity between each candidate question and the current input, and the response corresponding to the candidate question with the largest similarity is used as the output response. This step is shown in box 103 in FIG. 1.

In the scheme shown in Figure 1, the key information extraction in multiple rounds of dialogue has no primary or secondary distinction, that is, no key information related to the current round of input is extracted, and the extracted redundant information will affect the accuracy of dialogue state tracking. ; Secondly, since the accuracy of dialogue state tracking largely depends on the coverage of the knowledge base, in view of the complexity of natural language dialogue in real scenes, it is actually difficult to obtain an ideal knowledge base with extensive coverage; third, the The method of obtaining the state tracking results in the scheme depends on various pre-defined rules, which also greatly affects the generalization ability and robustness of the model.

As shown in FIG. 2, it is a schematic diagram of the operation process of another multi-round dialogue state tracking solution in the prior art. This solution is a DST solution based on a learning model. It tracks the status information of each round of dialogue in turn, and updates the state of the current round of dialogue through the mechanism of copying the stream, thereby realizing the tracking of the long-term dialogue state. The specific operation process includes step S201 -S204:

First, the key information in the current round of dialogue and the previous round of dialogue is extracted through the semi-supervised neural network model, and the keyword sequences corresponding to the above two rounds of sentences are generated.

Then, a new encoder-decoder network based on the copy-stream mechanism is adopted to express dialogue status information by displaying a sequence of words. The copy flow mechanism can transmit the information flow of the dialogue history through copying, and finally participate in the generation of the target sentence for the current round of dialogue replies.

Finally, according to the status information of the current round of dialogue obtained above, the decoder module is used to automatically generate the target sentence of the current round of dialogue reply, and then complete the response to the user's inquiry.

In the scheme shown in Figure 2, the key information of historical dialogue is extracted based on the semi-supervised neural network model, which may lead to the loss or mis-extraction of key information, which will affect the understanding of historical dialogue; secondly, the historical dialogue is tracked round by round. Updating the dialogue state will easily lead to higher time complexity of the model and error accumulation. Third, this solution relies too much on the model’s ability to understand historical dialogue, and the encoder-decoder network at this stage is difficult to achieve a higher level of time complexity. The accuracy meets the actual needs of the scene.

In order to solve the above-mentioned problems, the core idea of the embodiments of the present application is that in the multi-round dialogue state tracking process of the dialogue system, the sentence of the current round is rewritten based on the key information in the historical dialogue to complete the omitted information of the current round of dialogue. , Thereby converting multiple rounds of dialogue questions into a single round of dialogue. The voice interaction method provided by the embodiments of the present application tracks user intentions based on key information of historical dialogues, overcomes the shortcomings of various solutions in the prior art to a certain extent, and improves the accuracy of state tracking in multiple rounds of dialogue and the dialogue system The accuracy of the response.

The voice interaction method provided by the embodiments of this application uses a named entity recognition (Named Entity Recognition, NER) module to extract entity information in historical conversations, and then uses Knowledge Bases (KBs) and Pointer-Generator Networks , PGN) model calculates the attention distribution of the entity information in the historical dialogue and the entities in the current round of dialogue. By filtering the entity information in the historical dialogue, redundant entities are discarded, and the participants in the current round of dialogue state tracking are determined Key Information. Such a processing method not only reduces the impact of redundant information on dialogue status tracking, but also provides effective key information for subsequent steps.

Then, combined with the knowledge base and the supervised feedforward neural network, the role of key entity information in the dialogue state (represented by the distribution probability) is calculated, and the feedforward neural network is used as part of the entire model to determine whether it is Directly affect the dialogue state. This avoids tracking the dialogue state round by round and reduces the accumulation of errors. After rewriting the current round of sentences based on the decoding steps of the PGN model, a mature single-round dialogue related module is used to generate multiple rounds of dialogue reply sentences to improve the accuracy of the dialogue system's response.

As shown in FIG. 3, it is a schematic diagram of the operation process of the voice interaction method provided by an embodiment of the present application. According to the running process shown in Figure 3, this method first extracts key information directly related to the dialogue state in the historical dialogue, and then rewrites the current round of sentences in combination with the model to complete the tracking and fusion of dialogue state information. Then, using the existing single-round dialogue processing module, on the basis of rewriting the sentence, the corresponding reply to the user's inquiry in the multi-round dialogue is completed.

Based on the above operation process, the operation process of this method can be realized by the following multiple modules:

1. Entity extraction module

Used to extract entity information in the historical dialogue as candidate key information.

2. Screening entity modules

Combining the attention distribution of the entity information calculated in the KBs and PGN models and the entities in the current round of sentences to filter out the key information related to the current round of dialogue status.

3. Key information distribution prediction module

Using KBs and key information, the probability distribution of key information is calculated based on a supervised feedforward network.

4. Statement rewriting module

Based on the probability distribution of key information, the decoder link of the PGN model is used to complete or rewrite the current round of sentences.

5. Generate reply module

Utilize the existing mature single-round dialogue system processing technology to generate reply sentences in multiple rounds of dialogue.

As shown in FIG. 4, it is a schematic diagram of an application scenario of a voice interaction method provided by an embodiment of the present application. In the typical application scenario shown in Figure 4, in the first round of dialogue, the user’s query sentence is: "Please tell me the nearest restaurant." In response to the query, the dialogue system replied: "The nearest restaurant is on Nongda South Road, Haidian District. Haidilao." Then, in the second round of dialogue, the user replied: "Okay." And continued to ask "What is the temperature on Friday?" At this time, the user's query statement did not contain any location-related The dialogue system will ask the user which place the temperature needs to be inquired, that is, "Which city do you want to check the temperature on Friday?" The user replied: "Beijing." The current dialogue can be regarded as the current round of dialogue, the first two The round of dialogue is a historical dialogue.

For historical dialogues, entities in historical dialogues can be extracted based on the entity extraction module, including "Recent", "Restaurant", "Nongda South Road in Haidian District", "Haidilao", "Friday" and "Temperature". On this basis, use the screening entity module and combine the predefined KBs and PGN models to calculate the probability distribution of the above entities in the historical dialogue, and obtain the key entities related to the dialogue state, namely "Friday" and "Temperature", historical dialogue Other entities such as "Xinyi", "Restaurant", "Nongda South Road in Haidian District" and "Haidilao" are redundant entities. The dialogue system can determine the basic rewrite sentence based on the key entities obtained, that is, "What is the temperature on Friday?"

Specifically, the key information distribution prediction module can be used to further predict the probability distribution of key information related to the current dialogue state based on the feedforward network, and the probability of obtaining "temperature" is 0.86, and the probability of "Friday" is 0.72. For entities with a probability value near the threshold 0.8 but lower than the threshold, the dialogue system can invite users to participate in the configuration, that is, the dialogue system in Figure 4 can ask the user: "Do you want to query the temperature information of "Friday" in Beijing ?" Then use the sentence rewrite module to generate the rewritten sentence "What is the temperature in Beijing on Friday?"

For the rewritten user sentence, the dialogue system can use the generating reply module to generate a reply to the rewritten sentence, that is, the reply of the dialogue system in Figure 4: "The temperature in Beijing on Friday is..."

It should be noted that the basic rewritten sentence may come from historical dialogue data or from the current dialogue. That is, the basic sentence may be selected from a certain user sentence in the historical dialogue or the current user sentence. The standard can be determined according to the number of target entity information and key entity information contained in the sentence. For example, in the above example, the historical dialogue "What is the temperature on Friday?" contains two key entity information, "Friday" and "Temperature", while the current dialogue "Beijing" contains only one target entity information. , So you can choose the historical dialogue "What's the temperature on Friday?" as the basis for rewriting the sentence.

The voice interaction of the present application will be described in detail below in conjunction with specific embodiments.

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

The terms used in the following embodiments are only for the purpose of describing specific embodiments, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also This includes expressions such as "one or more" unless the context clearly indicates to the contrary. It should also be understood that in the embodiments of the present application, "one or more" refers to one, two or more than two; "and/or" describes the association relationship of the associated objects, indicating that there may be three relationships; for example, A and/or B can mean the situation where A exists alone, A and B exist at the same time, and B exists alone, where A and B can be singular or plural. The character "/" generally indicates that the associated objects before and after are in an "or" relationship.

The voice interaction method provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super mobile personal computers For terminal devices (ultra-mobile personal computer, UMPC), netbooks, and personal digital assistants (personal digital assistant, PDA), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.

Take the terminal device as a mobile phone as an example. Fig. 5 shows a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application. 5, the mobile phone includes: a radio frequency (RF) circuit 510, a memory 520, an input unit 530, a display unit 540, a sensor 550, an audio circuit 560, a wireless fidelity (Wi-Fi) module 570, a processing 580, and power supply 590. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 5 does not constitute a limitation on the mobile phone, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.

The following is a detailed introduction to each component of the mobile phone in conjunction with Figure 5:

The RF circuit 510 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is processed by the processor 580; in addition, the designed uplink data is sent to the base station. Generally, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 510 can also communicate with the network and other devices through wireless communication. The above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.

The memory 520 may be used to store software programs and modules. The processor 580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 520. The memory 520 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of mobile phones (such as audio data, phone book, etc.), etc. In addition, the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.

The input unit 530 may be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone 500. Specifically, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also called a touch screen, can collect the user's touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 531 or near the touch panel 531. Operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 531 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 580, and can receive and execute the commands sent by the processor 580. In addition, the touch panel 531 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 531, the input unit 530 may also include other input devices 532. Specifically, the other input device 532 may include, but is not limited to, one or more of a physical keyboard, function keys (such as a volume control button, a switch button, etc.), a trackball, a mouse, and a joystick.

The display unit 540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541. Optionally, the display panel 541 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc. Further, the touch panel 531 can cover the display panel 541. When the touch panel 531 detects a touch operation on or near it, it is transmitted to the processor 580 to determine the type of the touch event, and then the processor 580 determines the type of the touch event. The type provides corresponding visual output on the display panel 541. Although in FIG. 5, the touch panel 531 and the display panel 541 are used as two independent components to realize the input and input functions of the mobile phone, but in some embodiments, the touch panel 531 and the display panel 541 can be integrated. Realize the input and output functions of the mobile phone.

The mobile phone 500 may also include at least one sensor 550, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor can include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 541 according to the brightness of the ambient light. The proximity sensor can close the display panel 541 and/or when the mobile phone is moved to the ear. Or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary. It can be used to identify mobile phone posture applications (such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can also be configured in mobile phones, I will not here Go into details.

The audio circuit 560, the speaker 561, and the microphone 562 can provide an audio interface between the user and the mobile phone. The audio circuit 560 can transmit the electric signal converted from the received audio data to the speaker 561, and the speaker 561 converts it into a sound signal for output; on the other hand, the microphone 562 converts the collected sound signal into an electric signal, and the audio circuit 560 After being received, it is converted into audio data, and then processed by the audio data output processor 580, and sent to, for example, another mobile phone via the RF circuit 510, or the audio data is output to the memory 520 for further processing.

Wi-Fi is a short-distance wireless transmission technology. Through the Wi-Fi module 570, mobile phones can help users send and receive emails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access. Although FIG. 5 shows the Wi-Fi module 570, it is understandable that it is not a necessary component of the mobile phone 500, and can be omitted as needed without changing the essence of the invention.

The processor 580 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole. Optionally, the processor 580 may include one or more processing units; preferably, the processor 580 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 580.

The mobile phone 500 also includes a power source 590 (such as a battery) for supplying power to various components. Preferably, the power source can be logically connected to the processor 580 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.

Although not shown, the mobile phone 500 may also include a camera. Optionally, the position of the camera on the mobile phone 500 may be front or rear, which is not limited in the embodiment of the present application.

Optionally, the mobile phone 500 may include a single camera, a dual camera, or a triple camera, etc., which is not limited in the embodiment of the present application.

For example, the mobile phone 500 may include three cameras, of which one is a main camera, one is a wide-angle camera, and one is a telephoto camera.

Optionally, when the mobile phone 500 includes multiple cameras, the multiple cameras may be all front-mounted, or all rear-mounted, or partly front-mounted and another part rear-mounted, which is not limited in the embodiment of the present application.

In addition, although not shown, the mobile phone 500 may also include a Bluetooth module, etc., which will not be repeated here.

FIG. 6 is a schematic diagram of the software structure of a mobile phone 500 according to an embodiment of the present application. Taking the mobile phone 500 operating system as the Android system as an example, in some embodiments, the Android system is divided into four layers, namely the application layer, the application framework layer (framework, FWK), the system layer, and the hardware abstraction layer. Communication between the layers through software interface.

As shown in Figure 6, the application layer may include a series of application packages, which may include applications such as short message, calendar, camera, video, navigation, gallery, and call.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.

As shown in Figure 6, the application framework layer can include a window manager, a resource manager, and a notification manager.

The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc. The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.

The application framework layer can also include:

A view system, which includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.

The phone manager is used to provide the communication function of the mobile phone 500. For example, the management of the call status (including connecting, hanging up, etc.).

The system layer can include multiple functional modules. For example: sensor service module, physical state recognition module, 3D graphics processing library (for example: OpenGL ES), etc.

The sensor service module is used to monitor the sensor data uploaded by various sensors at the hardware layer and determine the physical state of the mobile phone 500;

Physical state recognition module, used to analyze and recognize user gestures, faces, etc.;

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.

The system layer can also include:

The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The hardware abstraction layer is the layer between hardware and software. The hardware abstraction layer can include display drivers, camera drivers, sensor drivers, etc., used to drive related hardware at the hardware layer, such as display screens, cameras, sensors, and so on.

The following embodiments can be implemented on the mobile phone 500 having the above hardware structure/software structure. The following embodiments will take the mobile phone 500 as an example to describe the voice interaction method provided by the embodiments of the present application.

Referring to FIG. 7, a schematic step flowchart of a voice interaction method provided by an embodiment of the present application is shown. As an example and not a limitation, the method may be applied to the above-mentioned mobile phone 500, and the method may specifically include the following steps:

S701: When a user sentence to be replied is received, historical dialogue data is obtained.

In the embodiment of the present application, the user sentence may be a certain sentence directly spoken by the user when using an application such as a voice assistant in a terminal device. For example, if a user wants to inquire about the weather tomorrow, the user can wake up the voice assistant in the phone and say "what is the weather tomorrow" or a similar sentence.

Generally, the user can make multiple rounds of dialogue with the voice assistant to prompt the voice assistant to fully and accurately understand the user's intention, and return information that satisfies the intention. The user sentence to be replied in this embodiment may be a sentence or word spoken by the user during a non-first round of dialogue, that is, the voice assistant has completed at least one round with the user before receiving the user sentence to be replied. dialogue.

In the embodiment of the present application, in order to better understand the user's intention, after receiving the user sentence, the voice assistant and other programs can obtain the dialogue data of the previous rounds of dialogue between the user and the voice assistant in the current dialogue process, combined with historical dialogue data Determine the real intention of the user in this round of dialogue.

In a specific implementation, the historical dialogue data can be all the dialogue data after the user wakes up the voice assistant this time, or it can also be the dialogue data in a specific previous round, such as the data of the first three rounds of this round of dialogue, in this embodiment There is no restriction on this.

S702: Identify the target entity information in the user sentence, and identify the historical entity information in the historical dialogue data.

Entity is a term often used in the information world to represent a conceptual thing. Generally, nouns can be used to represent entity information, such as names of persons, places, organizations, etc.; a small amount of entity information can also be represented by other part-of-speech words, such as adjectives.

In the embodiment of the present application, user sentences and entity information in historical dialogue data can be identified based on the NER model.

In a specific implementation, for the received user sentence, the sentence can be segmented first, and then each word after the segmentation is judged one by one whether it belongs to an entity word, and each entity word is labeled.

Of course, the entity information identified from the user sentences of the current round of dialogue can be used as historical entity information in the next round and subsequent dialogue rounds. Therefore, for the entity information in the historical dialogue data, after obtaining each sentence in the historical dialogue data, the sentence can be segmented to find out the entity information; or it can be directly extracted from the previous rounds that have been marked as The words of the entity information are used as historical entity information, which is not limited in this embodiment.

S703. Extract key entity information associated with the user sentence from the historical entity information.

Since the user sentence to be replied is the dialogue sentence of the current round, and the entity information contained therein is basically closely related to the user's intention, all the target entity information contained in the user sentence can be retained. For historical entity information, it is necessary to distinguish which is useful information for the current round of dialogue, and which is redundant information.

Therefore, after the historical entity information is extracted, the key entity information associated with the current round of dialogue sentences can be filtered from the historical entity information. These key entity information can be regarded as information that has obvious benefits in identifying the user's intention.

In the embodiments of the present application, multiple user intentions can be set in the voice assistant according to different application scenarios, and multiple associated entity information can be configured for each user intention. After identifying the target entity information in the user sentence, other entity information that may be included in the intention can be filtered from the intention containing the target entity information, and then the key entity information can be identified from the historical entity information. For example, for the intention of "weather forecast", multiple entity information such as "time", "location", and "weather conditions" can be configured for it. If the target entity information is "weather conditions", the historical entity information can be added Those entity information that meets the "time" and "location" requirements are identified as key entity information.

Of course, other methods may be used to determine the key entity information according to different actual usage requirements, which is not limited in this embodiment.

S704: Generate a target interaction sentence according to the target entity information and the key entity information.

In the embodiment of the present application, after determining the target entity information in the user sentence of the current round and the key entity information in the historical dialogue sentence, the target interaction sentence matching the actual intention of the user can be generated based on the above two kinds of information.

For example, if the target entity information and key entity information include time information "Friday", location information "Beijing", and weather condition information "Temperature", it can be recognized that the user currently wants to query the temperature of Beijing on Friday. The target interaction sentence corresponding to this can be "What is the temperature in Beijing this Friday", or other similar sentences. The above-mentioned target interaction sentence is also the expression sentence pattern of the information that the user wants to query.

S705: Output a reply sentence corresponding to the target interactive sentence.

The function of the voice assistant is to facilitate users to query certain information by voice. Therefore, after identifying the target interaction sentence that matches the user's actual intention, the voice assistant can search for the sentence and find the corresponding reply sentence.

For example, for the interactive sentence "What is the temperature in Beijing this Friday", the corresponding reply sentence may be "The temperature in Beijing on Friday is 17 degrees Celsius". The reply sentence can be broadcast to the user by voice, or displayed in the mobile phone interface in the form of text, or sent to the user's mobile phone in other information formats, which is not limited in this embodiment.

In the embodiment of this application, by identifying the target entity information in the current dialogue round and extracting key entity information from the historical dialogue data, the actual intention of the user can be determined based on the above two kinds of entity information, and the user’s actual intention can be determined according to the intention. The user sentence of the current round is rewritten to generate the target interactive sentence, so that applications such as the voice assistant in the terminal device can respond according to the target interactive sentence. In this embodiment, by converting the DST questions in multiple rounds of dialogues to a certain extent into single-round dialogue questions, the existing mature single-round dialogue technology can be used to reply to the user's intention, and the accuracy of dialogue state tracking and user intention recognition can be improved. It can improve the natural language processing capabilities of the dialogue system, and enhance the rationality of the dialogue system’s reply during multiple rounds of dialogue, so that the system’s reply can better match the actual needs of the user and reduce the number of interactions between the user and the dialogue system.

Referring to FIG. 8, there is shown a schematic step flowchart of a voice interaction method provided by another embodiment of the present application. The method may specifically include the following steps:

S801: When a user sentence to be replied is received, historical dialogue data is obtained.

It should be noted that this method can be applied to terminal devices such as mobile phones and tablet computers, and this embodiment does not limit the specific types of terminal devices.

For ease of understanding, this embodiment takes the terminal device as a mobile phone as an example for subsequent introduction. That is, when a user uses an application such as a voice assistant in a mobile phone, this type of application identifies the user's entity information in the current round and previous rounds to determine the corresponding user intention, and based on the intention, the current round The second user sentence is rewritten, and a reply sentence corresponding to the rewritten user sentence is output to meet the actual needs of the user.

In the embodiments of this application, the user sentence to be replied may refer to a certain sentence directly uttered by the user during the interaction with the voice assistant. This sentence may be a sentence that can fully express the intention of a certain user, or it may be One or more words.

When the voice assistant receives a certain sentence from the user, it can first determine whether it can give a corresponding reply for the sentence. If the voice assistant can directly give a reply based on the sentence, no other processing is required, and the reply sentence can be directly provided to the user. For example, if the user's sentence is "What is the temperature in Beijing this Friday", because the sentence can directly determine that the user's intention is to inquire about the weather in Beijing this Friday, the voice assistant can directly output the result according to the query To the user.

If it is not possible to directly query the corresponding results according to the user's current round of sentences, the user's intention can be re-determined by combining the user's expressions in the previous rounds. At this time, the historical dialogue data between the user and the voice assistant can be obtained. The aforementioned historical dialogue data can be the dialogue data of all rounds after the user wakes up the voice assistant this time until the current round, or it can be the dialogue data of several consecutive rounds before the current round. This embodiment does not do this. limited.

S802: Identify the target entity information in the user sentence, and identify the historical entity information in the historical dialogue data.

It should be noted that the historical entity information in the historical dialogue data may include the entity information in the sentence spoken by the user in a certain round, and may also include the entity information in the reply sentence when the voice assistant replies to the user.

For example, in a certain historical dialogue round, the user asked the voice assistant "Which is the nearest restaurant", and the voice assistant replied "The nearest restaurant is Haidilao on Nongda South Road, Haidian District". For the historical dialogue data of this round, The historical entity information may include the "restaurant" in the user's sentence, as well as the entity information such as "Nongda South Road, Haidian District" and "Haidilao" in the voice assistant's reply sentence.

S803: According to the target entity information and the historical entity information, determine a candidate user intention that matches the user sentence.

It should be noted that since the target entity information contained in the user sentence and the historical entity information contained in the historical dialogue data may include many, the candidate user intentions preliminarily determined based on the target entity information and the historical entity information may also include multiple types.

In the embodiment of the present application, after identifying the target entity information and historical entity information, the KBs can be combined to preliminarily determine the current possible intentions of the user.

In specific implementation, multiple user intentions can be preset in KBs, and each user intention can include multiple semantic slots. After identifying the target entity information and historical entity information, you can match the slots corresponding to each user intent based on the above two entity information, so as to match the user intents corresponding to the slots containing part of the identified entity information , Which is preliminarily determined as a candidate user’s intention.

S804: Calculate the distribution probability of each historical entity information in the historical dialogue data.

In the embodiment of the present application, in order to accurately determine the actual intention of the user, the distribution probability of each historical entity information in the historical dialogue data may be calculated first.

In specific implementation, the distribution probability of each historical entity information can be determined based on the PGN model. First, symbolize each historical entity information, then call the PGN model, and use the encoding module of the PGN model to encode each historical entity information after the symbolization process, and calculate the distribution of each historical entity information in the encoding link Probability.

As shown in FIG. 9, it is a schematic diagram of the calculation process of the distribution probability of entity information based on the PGN model provided by an embodiment of the present application. First, the prediction model can be trained with training data and KBs to enhance the key information extraction capability of the PGN model. The above-mentioned training data may be pre-collected multiple rounds of dialogue data, including entity information in a certain round (current round) of the dialogue and historical entity information (rounds before the current round) in the pre-collected training data. For the historical entity information that currently needs to be calculated, the corresponding attention distribution can be output after converting it into a text vector; at the same time, combining the encoding module and decoding module of the PGN model to obtain the generation probability of historical entity information. The above-mentioned various types of probabilities can be added together to output the final distribution probability. In another invention, when determining the distribution probability, the user's confirmation information can also be combined to improve the output distribution probability and the reliability of the identified key entity information.

S805. Extract key entity information from the historical entity information according to the distribution probability and the intention of the candidate user.

In the embodiment of the present application, according to the distribution probability of each historical entity information, the key entity information associated with the user's intention is found, that is, the entity information that has a greater correlation with the user's intention is selected from all the historical entity information .

In specific implementation, the candidate entity information associated with any candidate user's intention can be extracted from the historical entity information, and then the candidate entity information whose distribution probability is greater than a certain preset probability threshold can be extracted as the candidate entity information related to the intention Key entity information.

As an example of this embodiment, the probability threshold may be set to 0.8. Therefore, candidate entity information whose distribution probability is greater than 0.8 can be identified as key entity information.

In the embodiment of the present application, for some entity information whose probability value is not greater than the above-mentioned probability threshold, but whose probability value is near the above-mentioned probability threshold, the user may be invited to identify the entity information.

In specific implementation, if the difference between the target probability value and the aforementioned probability threshold is less than the preset difference, and the target probability value is less than the aforementioned probability threshold, the candidate entity information and key entity information corresponding to the target probability value can be used, The query sentence is generated to instruct the user to identify the candidate entity information corresponding to the target probability value, and the target probability value is the probability value of the distribution probability of any candidate entity information in the historical dialogue data.

When receiving the user's confirmation information for the above query sentence, it can be considered that the user has approved the entity information. At this time, the candidate entity information corresponding to the target probability value can be identified as key entity information.

For example, in a certain round of dialogue between the user and the voice assistant, the probability value of the historical entity information "temperature" is calculated to be 0.86, which is greater than the set probability threshold of 0.8. At this time, the entity information "temperature" can be Identify as key entity information. On the other hand, the calculated probability value of the historical entity information "Friday" is 0.72, which is less than the above-mentioned probability threshold of 0.8, but is in the vicinity of the threshold. Assuming that the target entity information in the current round of user sentences is "Beijing", you can combine the target entity information and the existing key entity information to generate the corresponding query sentence "May I ask you to check the temperature of "Friday" in Beijing Information?", if the user's confirmation reply is received, the aforementioned historical entity information "Friday" can also be identified as key entity information.

S806. Determine the target basic sentence.

In specific implementation, in order to reduce the difficulty of generating the target interactive sentence, the target basic sentence can be determined first, and then rewritten on the basis of the target basic sentence to obtain the final target interactive sentence.

In the embodiment of the present application, the target basic sentence may be determined based on key entity information and/or target entity information.

In specific implementation, you can first obtain multiple basic sentences from the user sentence containing the target entity information and historical dialogue data containing the key entity information, and then calculate the matching degree between the key multiple basic sentences and the entity information to be evaluated. , And identify the basic sentence corresponding to the maximum matching degree as the current target basic sentence. The aforementioned entity information to be evaluated includes all target entity information and key entity information.

In the embodiment of the present application, the basic sentence may be the current user sentence or a certain sentence of the user sentence in the historical dialogue data. The degree of matching between the entity information to be evaluated and each basic sentence can be determined according to the degree of matching between the entity information to be evaluated and the semantic slot.

Specifically, for any basic sentence, the number of semantic slots in the basic sentence and the number of entity information to be evaluated can be counted, that is, how many slots are included in the basic sentence to be calculated, and the number of slots to be identified can be counted. Evaluate how many entity information there are. Then, determine the number of key slots in the basic sentence that match the entity information to be evaluated, and finally calculate the ratio between the number of key slots and the number of semantic slots in the basic sentence. The ratio is used as the matching degree between the entity information to be evaluated and the basic sentence.

For example, if a basic sentence includes four semantic slots, the information of the entity to be evaluated "Friday" and "temperature" respectively match the time slot and weather condition slot, then the above entity information to be evaluated matches the basic The matching degree of the sentence is 50%.

S807. Using the target entity information and the key entity information, rewrite the target basic sentence to generate a target interactive sentence.

After the target basic sentence is determined, the target entity information and key entity information can be used to rewrite the sentence to obtain the final target interactive sentence.

The specific use of target entity information or key entity information for sentence rewriting depends on whether the target basic sentence is the current user sentence or the user sentence in the historical dialogue. If the target basic sentence is the current user sentence, since the user sentence already contains all the target entity information, you can use the key entity information identified from the historical dialogue to rewrite; if the target basic sentence is in the historical dialogue Since the sentence may only contain part of the key entity information, you can use all the key entity information and target entity information to rewrite the sentence.

In the embodiment of the present application, the target interactive sentence may be output based on the PGN model. In addition to the encoding module, the PGN model may also include a decoding module. The decoding module can be obtained by training various types of training data. The various types of training data can include multiple entity information and basic sentences corresponding to each entity information.

Therefore, after the target basic sentence is determined, the decoding module of the PGN model can be used to decode the target entity information, key entity information, and target basic sentence, and output the target interactive sentence.

In the embodiment of the present application, for the target interactive sentence output by the PGN model, it can be verified whether the sentence is rewritten correctly.

In the embodiment of the present application, it is possible to determine whether the target interactive sentence is rewritten correctly by means of double-layer verification.

Specifically, multiple entity information in the target interaction sentence may be extracted first, and it is verified whether the multiple entity information in the target interaction sentence matches the preset semantic slot of the target user's intention in the knowledge base. Among them, the target user intention is any one of all candidate user intentions.

If multiple entity information in the target interaction sentence matches the semantic slot intended by the target user, it can be determined that the generated target interaction sentence is correct, and step S808 is executed to output a reply sentence corresponding to the target interaction sentence.

If multiple entity information in the target interactive sentence does not match the semantic slot intended by the target user, the target interactive sentence can be verified a second time according to the sentence type of the target interactive sentence.

In the second verification, it can be done based on the natural language understanding model. By calling a preset natural language understanding model, it can be judged whether the target interactive sentence is a task-type sentence. If the sentence is a task-type sentence, that is, according to the current sentence, the user's intention can be specifically identified and a response can be made to the intention. At this time, step S808 can also be executed to output a reply sentence corresponding to the target interactive sentence; if the target interactive sentence is not a task-type sentence, it means that the voice assistant cannot perform specific intent recognition on the sentence or the recognized intention lacks a change. With clear information, the user can be prompted to re-enter the user sentence at this time, and the voice assistant can recognize the user's intention again according to the re-entered user sentence, and generate a new target interaction sentence.

S808: Output a reply sentence corresponding to the target interactive sentence.

In the embodiment of the present application, by combining the entity information in the historical dialogue data, the user sentence in the current round can be rewritten, so as to convert the dialogue state tracking problem in multiple rounds of dialogues to a single round of dialogue questions to a certain extent , Can use the existing single-round dialogue technology to reply to the user's intention, improve the natural language processing ability of the dialogue system, ensure the accuracy of the user's intention recognition, and enhance the rationality of the dialogue system's reply in the process of multiple rounds of dialogue , So that the system reply can better match the actual needs of the user, reducing the number of interactions between the user and the dialogue system.

For ease of understanding, the following describes the voice interaction method of the present application in conjunction with a specific example. As shown in FIG. 10, it is a schematic diagram of the operation process of the voice interaction method provided by an embodiment of the present application. According to the operation process shown in FIG. 10, the entire voice interaction may include the following steps:

1. Regarding the input sentences in the process of multiple rounds of dialogue, you can first determine whether it is necessary to rewrite the input sentences of the current round of dialogue. If there is no need to rewrite, you can directly output the reply sentence; if you need to rewrite, first use the NER module to extract the entity information in the historical dialogue data, and determine all the historical entity information in the historical dialogue data. Since entities may be composed of multiple words, each entity information needs to be symbolized to facilitate subsequent encoding and generation in the PGN model.

2. Based on the PGN model, calculate the attention distribution of each historical entity information in the coding process.

3. Combine the target entity information in the current dialogue round, the distribution probability of historical entity information and the knowledge base KBs, train the prediction model, and filter the entity information with the maximum probability related to the user's intention from the historical entity information as the key entity Information, while discarding redundant entity information in historical dialogue data. On this basis, the corresponding basic rewrite sentence is determined according to the key entity information. Then, on the basis of the basic rewritten sentence, the decoding module of the PGN model is used to generate the target interactive sentence. The significance of pre-determining the basic rewritten sentences is to reduce the difficulty of generating output sentences from the PGN model.

4. Using KBs as a priori, combining historical entity information and target entity information in the current round of dialogue to train the neural network model to enhance the ability of the PGN model to extract key information from historical dialogue. It should be noted that since the entities in the sentences of the current conversation round are basically closely related to the user's intention, they are all retained. Then, the output of the neural network is integrated into the loss function of the PGN model to calculate the output probability corresponding to the historical entity information.

5. Based on the output probability of each historical entity information, combined with the PGN model, calculate the final distribution probability of each historical entity information, that is, the probability that the entity in the model can ultimately reflect the user's intention.

6. If the distribution probability is less than the threshold, that is, the improved model still cannot determine whether the entity should appear in the output sentence. At this time, the user may be invited to participate in the configuration of the key entity; then, the key entity greater than the threshold is combined, based on the above determination. On the basis of rewriting the sentence, the decoding module of the PGN model is used to generate the sentence. It should be noted that the reason for inviting users to participate in entity configuration is to increase the recall rate of key information extracted by the model. Generally, the threshold can be set higher. But too high a threshold may also result in the loss of some key information. Therefore, it is necessary to invite users to participate in the configuration for entity information close to the threshold to further improve the recall rate of the key information extracted by the model. On the other hand, by inviting users to participate in the configuration of entity information, the reliability of the output sentences obtained in this way is also high, which can be used as training corpus to iteratively optimize the model, which partially solves the problem of difficulty in obtaining high-quality multi-round dialogue materials.

7. In order to determine the validity of the output target interactive sentence, this embodiment designs a two-layer feedback mechanism, and the specific method is as follows:

The rewritten sentence is matched with the slot value corresponding to the intent in KBs. If the match is successful, the rewriting is considered correct; if the match is unsuccessful, the rewritten sentence can be verified by using natural language understanding technology. If natural language understanding technology is used If it is recognized that the sentence is a task-type sentence, it can be considered that the rewriting is correct; if it is recognized that the sentence is not a task-type sentence, it can be considered that the rewriting is wrong. At this point, the user can be guided to restate the intention as the follow-up training corpus.

8. Based on the rewritten target interaction sentence, the DST problem in multiple rounds of dialogue can be converted to a single-round dialogue question to a certain extent, and the existing mature single-talk dialogue technology can be used to respond to user intentions and improve task orientation The capabilities and user experience of a multi-round dialogue system.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Corresponding to the voice interaction method described in the above embodiment, FIG. 11 shows a structural block diagram of a voice interaction device provided by an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.

Referring to FIG. 11, the device can be applied to terminal equipment, and specifically can include the following modules:

The historical dialogue data acquisition module 1101 is used to acquire historical dialogue data when a user sentence to be replied is received;

The target entity information identification module 1102 is used to identify the target entity information in the user sentence; and,

The historical entity information identification module 1103 is used to identify historical entity information in the historical dialogue data;

The key entity information extraction module 1104 is configured to extract key entity information associated with the user sentence from the historical entity information;

The target interactive sentence generating module 1105 is configured to generate a target interactive sentence according to the target entity information and the key entity information;

The reply sentence output module 1106 is used to output a reply sentence corresponding to the target interactive sentence.

In the embodiment of the present application, the key entity information extraction module may specifically include the following submodules:

In the embodiment of the present application, the distribution probability calculation sub-module may specifically include the following units:

In the embodiment of the present application, the key entity information extraction submodule may specifically include the following units:

In the embodiment of the present application, the key entity information extraction submodule may further include the following units:

In the embodiment of the present application, the target interactive sentence generation module may specifically include the following sub-modules:

In the embodiment of the present application, the target basic sentence determination submodule may specifically include the following units:

In the embodiment of the present application, any basic sentence includes multiple semantic slots, and the matching degree calculation unit may specifically include the following subunits:

In the embodiment of the present application, the pointer generation network model further includes a decoding module, which is obtained by training various types of training data, and the various types of training data include multiple entity information and information related to each entity. Corresponding basic sentences; the target interactive sentence generation sub-module may specifically include the following units:

The second pointer generation network model calling unit is configured to use the decoding module to decode the target entity information, the key entity information, and the target basic sentence, and output a target interactive sentence.

In the embodiment of the present application, the target interactive sentence generation submodule may further include the following units:

The target interactive sentence verification unit is used to verify whether the multiple entity information in the target interactive sentence matches a preset semantic slot of the target user's intention in the knowledge base, and the target user's intention is any of the candidate user intentions One; if multiple entity information in the target interaction sentence matches the semantic slot of the target user's intention, it is determined that the generated target interaction sentence is correct, and the response sentence corresponding to the target interaction sentence is executed and output If the multiple entity information in the target interaction sentence does not match the semantic slot intended by the target user, then the target interaction sentence is verified according to the sentence type of the target interaction sentence.

In the embodiment of the present application, the target interactive sentence verification unit is further configured to: call a preset natural language understanding model to determine whether the target interactive sentence is a task-type sentence; if the target interactive sentence is a task-type sentence, then Call the reply sentence output module to output a reply sentence corresponding to the target interactive sentence; if the target interactive sentence is not a task-type sentence, the user is prompted to re-enter the user sentence, and re-generated according to the re-input user sentence Target interactive statement.

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the description of the method embodiment part.

Referring to FIG. 12, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in FIG. 12, the terminal device 1200 of this embodiment includes: a processor 1210, a memory 1220, and a computer program 1221 that is stored in the memory 1220 and can run on the processor 1210. When the processor 1210 executes the computer program 1221, the steps in the various embodiments of the voice interaction method described above are implemented, for example, steps S701 to S705 shown in FIG. 7. Alternatively, when the processor 1210 executes the computer program 1221, the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 1101 to 1106 shown in FIG. 11.

Exemplarily, the computer program 1221 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 1220 and executed by the processor 1210 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments may be used to describe the execution process of the computer program 1221 in the terminal device 1200. For example, the computer program 1221 can be divided into a historical dialogue data acquisition module, a target entity information recognition module, a historical entity information recognition module, a key entity information extraction module, a target interactive sentence generation module, and a reply sentence output module. The specific functions of each module are as follows:

The target entity information identification module is used to identify the target entity information in the user sentence;

The terminal device 1200 may be a computing device such as a desktop computer, a notebook, or a palmtop computer. The terminal device 1200 may include, but is not limited to, a processor 1210 and a memory 1220. Those skilled in the art can understand that FIG. 12 is only an example of the terminal device 1200, and does not constitute a limitation on the terminal device 1200. It may include more or less components than those shown in the figure, or combine some components, or different components. For example, the terminal device 1200 may also include input and output devices, network access devices, buses, and so on.

The processor 1210 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 1220 may be an internal storage unit of the terminal device 1200, such as a hard disk or a memory of the terminal device 1200. The memory 1220 may also be an external storage device of the terminal device 1200, such as a plug-in hard disk equipped on the terminal device 1200, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 1220 may also include both an internal storage unit of the terminal device 1200 and an external storage device. The memory 1220 is used to store the computer program 1221 and other programs and data required by the terminal device 1200. The memory 1220 can also be used to temporarily store data that has been output or will be output.

The embodiment of the present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the aforementioned voice interaction method can be realized.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed voice interaction method, device, and terminal device can be implemented in other ways. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored. Or not. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in the present application can be accomplished by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a voice interaction device or terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access Memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, according to legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.

Finally, it should be noted that the above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Any changes or substitutions within the technical scope disclosed in this application shall be covered by this application. Within the scope of protection applied for. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A voice interaction method, characterized in that it comprises:

When the user sentence to be replied is received, the historical conversation data is obtained;

Identifying the target entity information in the user sentence, and identifying the historical entity information in the historical dialogue data;

Extracting key entity information associated with the user sentence from the historical entity information;

Generating a target interactive sentence according to the target entity information and the key entity information;

A reply sentence corresponding to the target interactive sentence is output.
The method according to claim 1, wherein said extracting key entity information associated with said user sentence from said historical entity information comprises:

Determine, according to the target entity information and the historical entity information, candidate user intentions that match the user sentence;

Respectively calculating the distribution probability of each historical entity information in the historical dialogue data;

According to the distribution probability and the intention of the candidate user, key entity information is extracted from the historical entity information.
The method according to claim 2, wherein the separately calculating the distribution probability of each historical entity information in the historical dialogue data comprises:

Calling a preset pointer generation network model, the pointer generation network model including an encoding module;

The encoding module is used to separately encode each historical entity information to obtain the distribution probability corresponding to each historical entity information.
The method according to claim 2 or 3, wherein said extracting key entity information from said historical entity information according to said distribution probability and said candidate user intentions comprises:

Extract candidate entity information associated with any candidate user's intention from the historical entity information;

Extract candidate entity information whose distribution probability has a probability value greater than a preset probability threshold as key entity information.
The method according to claim 4, further comprising:

If the difference between the target probability value and the preset probability threshold is less than the preset difference, and the target probability value is less than the preset probability threshold, then according to the candidate entity information corresponding to the target probability value and all The key entity information is described, and a query sentence is generated to instruct the user to identify the candidate entity information corresponding to the target probability value;

When the user's confirmation information for the query sentence is received, the candidate entity information corresponding to the target probability value is determined as the key entity information, and the target probability value is any candidate entity information in the historical dialogue data. The probability value of the distribution probability in.
The method according to claim 3, wherein the generating a target interaction sentence according to the target entity information and the key entity information comprises:

Determine the target basic sentence;

Using the target entity information and the key entity information, the target basic sentence is rewritten to generate a target interactive sentence.
The method according to claim 6, wherein said determining the target basic sentence comprises:

Obtain a plurality of basic sentences from the user sentence containing the target entity information and the historical dialogue data containing the key entity information;

Respectively calculating the matching degree between the multiple basic sentences and the entity information to be evaluated, where the entity information to be evaluated includes the target entity information and the key entity information;

Identify the basic sentence corresponding to the maximum matching degree as the current target basic sentence.
The method according to claim 7, wherein any basic sentence includes a plurality of semantic slots respectively, and the calculating the matching degree between the plurality of basic sentences and the entity information to be evaluated respectively comprises:

For any basic sentence, count the number of semantic slots in the basic sentence and the number of entity information to be evaluated;

Determine the number of key slots in the basic sentence that respectively match the information of the entity to be evaluated;

The ratio between the number of key slots and the number of semantic slots in the basic sentence is calculated, and the ratio is used as the matching degree between the entity information to be evaluated and the basic sentence.
The method according to claim 7 or 8, wherein the pointer generation network model further comprises a decoding module, the decoding module is obtained by training various types of training data, the various types of training data including multiple entities Information and basic sentences corresponding to each entity information;

The using the target entity information and the key entity information to rewrite the target basic sentence to generate a target interactive sentence includes:

The decoding module is used to decode the target entity information, the key entity information, and the target basic sentence, and output a target interactive sentence.
The method according to claim 9, further comprising:

Extract multiple entity information in the target interactive sentence;

Verifying whether the multiple entity information in the target interaction sentence matches a preset semantic slot of the target user's intention in the knowledge base, and the target user's intention is any one of the candidate user's intentions;

If multiple entity information in the target interaction sentence matches the semantic slot intended by the target user, it is determined that the generated target interaction sentence is correct, and the step of outputting a reply sentence corresponding to the target interaction sentence is executed ；

If the multiple entity information in the target interaction sentence does not match the semantic slot intended by the target user, the target interaction sentence is verified according to the sentence type of the target interaction sentence.
The method according to claim 10, wherein the verifying the target interactive sentence according to the sentence type of the target interactive sentence comprises:

Calling a preset natural language understanding model to determine whether the target interactive sentence is a task-type sentence;

If the target interactive sentence is a task-type sentence, execute the step of outputting a reply sentence corresponding to the target interactive sentence;

If the target interactive sentence is not a task-type sentence, the user is prompted to re-enter the user sentence, and the target interactive sentence is generated again according to the re-entered user sentence.
A voice interaction device, characterized in that it comprises:

The historical dialogue data acquisition module is used to acquire historical dialogue data when the user sentence to be replied is received;

The target entity information identification module is used to identify the target entity information in the user sentence; and,

The historical entity information identification module is used to identify the historical entity information in the historical dialogue data;

The key entity information extraction module is used to extract key entity information associated with the user sentence from the historical entity information;

A target interactive sentence generating module, configured to generate a target interactive sentence according to the target entity information and the key entity information;

The reply sentence output module is used to output the reply sentence corresponding to the target interactive sentence.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 11. The voice interaction method described in any one of items.
A computer-readable storage medium storing a computer program, wherein the computer program implements the voice interaction method according to any one of claims 1 to 11 when the computer program is executed by a processor.