CN113012687B

CN113012687B - Information interaction method and device and electronic equipment

Info

Publication number: CN113012687B
Application number: CN202110247302.9A
Authority: CN
Inventors: 赵瀚; 贾朝阳; 颜廷旭; 丁宁
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2022-05-13
Anticipated expiration: 2041-03-05
Also published as: CN113012687A

Abstract

The embodiment of the invention discloses an information interaction method, an information interaction device and electronic equipment. In this embodiment, by determining at least one behavior type corresponding to a current voice input by a target user, extracting an entity word slot in each behavior type, and recalling at least one similar sentence corresponding to the current voice from a sentence corresponding to each entity word slot, at least each similar sentence is input into a pre-trained intent determination model for processing to obtain an intent corresponding to the current voice, and corresponding operation is executed according to the intent and an execution result is returned under control. Therefore, the accuracy of the intention judgment can be improved, correct response can be made based on the intention, and the user experience can be improved.

Description

Information interaction method and device and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to an information interaction method, an information interaction device and electronic equipment.

Background

Dialog management refers to a system guiding dialog in a certain way, and the main task of the system is to identify the current dialog intention according to the input of a user and combining the context and the historical interactive information, and to execute the next action based on the dialog intention. In the prior art, the accuracy of intention recognition is low due to noisy conversation environment, complex accent of a user, and/or insufficient corpus of model training.

Disclosure of Invention

In view of this, embodiments of the present invention provide an information interaction method, an information interaction device, and an electronic device, so as to improve accuracy of intent determination, and make a correct response based on the intent, thereby improving user experience.

In a first aspect, an embodiment of the present invention provides an information interaction method, where the method includes:

receiving current voice input by a target user;

determining at least one behavior type corresponding to the current voice;

extracting entity word slots in each behavior type;

recalling at least one similar sentence corresponding to the current voice from sentences corresponding to each entity word slot;

inputting the obtained current feature information into a pre-trained intention determining model for processing, and obtaining an intention corresponding to the current voice, wherein the current feature information at least comprises similar sentences;

and executing corresponding operation according to the intention and controlling to return an execution result.

In a second aspect, an embodiment of the present invention provides an information interaction apparatus, where the apparatus includes:

a receiving unit configured to receive a current voice input by a target user;

the type determining unit is configured to determine at least one behavior type corresponding to the current voice;

a word slot extracting unit configured to extract an entity word slot in each of the behavior types;

a sentence recalling unit configured to recall at least one similar sentence corresponding to the current voice from sentences corresponding to each of the entity word slots;

the intention determining unit is configured to input the obtained current feature information into a pre-trained intention determining model for processing, and obtain an intention corresponding to the current voice, wherein the current feature information at least comprises each similar statement;

and the execution unit is configured to execute the corresponding operation according to the intention and control to return an execution result.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method according to the first aspect of the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method according to the first aspect of the embodiment of the present invention.

In a fifth aspect, embodiments of the present invention provide a computer program product, which when run on a computer causes the computer to perform the method according to the first aspect of embodiments of the present invention.

In this embodiment, by determining at least one behavior type corresponding to a current voice input by a target user, extracting an entity word slot in each behavior type, and recalling at least one similar sentence corresponding to the current voice from a sentence corresponding to each entity word slot, at least each similar sentence is input into a pre-trained intent determination model for processing to obtain an intent corresponding to the current voice, and corresponding operation is executed according to the intent and an execution result is returned under control. Therefore, the accuracy of the intention judgment can be improved, correct response can be made based on the intention, and the user experience can be improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a dialog management module according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a portion of a sketchpad of an embodiment of the present invention;

FIG. 3 is a flow chart of a method of information interaction according to an embodiment of the present invention;

FIG. 4 is a flow chart of a similar statement recall method according to an embodiment of the present invention;

FIG. 5 is a flow chart of a feature vector determination method of an embodiment of the present invention;

FIG. 6 is a schematic diagram of an intent recognition data processing procedure according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an information interaction system of an embodiment of the present invention;

FIG. 8 is a schematic diagram of an information interaction device according to an embodiment of the present invention;

fig. 9 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the following embodiments, a dialog management scenario of taking a car by voice (for example, telephone or other communication modes) is mainly described, and it should be understood that the present embodiment does not limit the information interaction method and the intention recognition method, and may be applied to any dialog management scenario, for example, a logistics application scenario, a car appointment application scenario, and the like.

In the present application of the online car booking, the online car booking is usually performed through the APP in the smart device, and some users (for example, the elderly) are inconvenient or cannot use the APP in the smart device, which results in a situation that the users are difficult to taxi. Therefore, a dialogue management method and a dialogue management system can be provided, so that a user can place an order in the form of telephone voice. For example, the specific process may be: the intelligent service end analyzes the voice input by the user to execute the scheduled operation of the network car booking, and simultaneously informs the user of the condition of order taking, the information of the order taking vehicle, the predicted arrival time of the vehicle and the arrival state of the vehicle through voice communication, and determines whether the user safely gets on or safely gets off the vehicle. Therefore, in a dialogue application scene of the network appointment, it is crucial to accurately recognize the user input voice and the user intention, otherwise, the starting place and the destination of the user cannot be determined, and the network appointment is failed. Therefore, the information interaction method provided by the embodiment can be adopted to improve the accuracy of the intention judgment, make a correct response based on the intention and further improve the experience of the user.

In an alternative implementation manner, the information interaction system of the embodiment of the invention may include a dialogue management module and a semantic understanding module. And the conversation management module judges the next possible state transition according to the current task state, the historical interaction information and the like. And the semantic understanding module identifies the user intention according to the current task state, the historical interaction information and the current voice of the user, controls the dialogue management module to make corresponding action and controls the answer corresponding to the intention to be returned.

In this embodiment, the dialog management module is constructed based on a finite state machine, that is, the present embodiment creates a transition relationship between information interaction states based on a state machine. The structure of the dialogue management module is composed of a drawing board (graph), a sub-graph flow (flow), a node (node) and an edge (edge), wherein the drawing board corresponds to the whole project, the sub-graph flow corresponds to a skill, the node represents a corresponding action to be executed, and the edge corresponds to a judgment condition. Optionally, when the user interacts with the information interaction system, the dialog management module maintains the state of the user, for example, the dialog management module stores the information of the user, the current task state, and the historical interaction information in a cache of the database, and when the user interacts with the information interaction system again, the dialog management module calls and refers to the information of the user, the current task state, and the historical interaction information, and flows to a next node through a corresponding edge based on an intention returned by the preset drawing board and the semantic understanding module, that is, transfers to the next state.

FIG. 1 is a schematic diagram of a dialog management module according to an embodiment of the present invention. As shown in fig. 1, the dialog management module 1 obtains the voice information sent by the user terminal and the current state of the target task through the user management unit 11 and the user state tracking unit 12, and determines a node n1 corresponding to the current state. The dialogue management module 1 calls the semantic understanding module 2 to perform semantic recognition on the acquired voice information, and acquires a corresponding semantic recognition result (i.e., an intention corresponding to the voice information). That is, the dialogue management module 1 sends a semantic recognition request to the semantic understanding module 2, and the semantic understanding module 2 performs semantic recognition on the voice information in response to the semantic recognition request, obtains a corresponding semantic recognition result, and returns the semantic recognition result to the dialogue management module 1. And the dialogue management module 1 moves the nodes according to the intention corresponding to the semantic recognition result and the condition information between the nodes, and updates the state of the target task. As shown in fig. 4, the node n2 is determined as the current node based on the intention corresponding to the semantic recognition result and the condition information between the nodes. Meanwhile, the dialogue management module 1 transmits the state update information to the state tracking unit 13, and the state tracking unit 13 transmits the new state to the user management unit 11, so that the user management unit 11 updates and saves the new state of the target task. And, the dialogue management module 1 executes the corresponding action according to the intention corresponding to the semantic recognition result, and returns the corresponding answer. Meanwhile, the system acquires a track list by recording track information of state transition. The state transition process of the target task can be determined according to the track list corresponding to the target task, so that the target task can be checked and the like in the following process.

Therefore, the transition relation between the information interaction states is established based on the finite state machine, so that the accuracy of state jump can be improved, and the user experience can be improved.

Fig. 2 is a schematic diagram of a part of the drawing board of the embodiment of the invention. As shown in fig. 2, the "confirmation getting on the bus" part of the taxi taking by the call is taken as an example for explanation, wherein whether the user gets on the bus is determined at the node 21 through information interaction. The corresponding state of the node 21 may include non-boarding, boarding and others, and "others" are used to characterize the judgment result of whether to board the vehicle or not, which is not derived from the user input voice. When the state determination result at the node 21 is "get on the car", the control returns a "get on the car end word", for example, "confirm you get on the car, please note safety", and the like. When the result of the state determination at the node 21 is "no boarding", the control returns a "no boarding completion statement", for example, "the vehicle has arrived, please get on the vehicle as soon as possible", or the like. When the status determination result at the node 21 is "other", the control returns "other-case speech", for example, the speech of confirming again whether the user gets on the car, and whether the user gets on the car is determined based on the voice input by the user. If the status determination result of this confirmation is "getting on", the control returns to "getting on end word". If the status determination result of this confirmation is "no boarding", the control returns to the "no boarding completion word". If the status determination result of the confirmation is still "other", a "manual final word" is returned, for example, "not identify your intention, jump to your manual customer service" and the like. Therefore, the transition relation between the information interaction states is established based on the finite state machine, so that the accuracy of state jump can be improved, and the user experience can be improved.

Fig. 3 is a flowchart of an information interaction method according to an embodiment of the present invention. As shown in fig. 3, the information interaction method according to the embodiment of the present invention includes the following steps:

step S110, receiving a current voice input by the target user. In the car booking process in a network car booking application scene, after call connection between a user terminal and a server is established, a user sends a voice 'I want to make a car and go to a place B from a place A' to the server through the user terminal. Optionally, the user may dial the network car-booking fixed phone through the user terminal to establish a call connection with the server, or dial the network phone through the network car-booking applet embedded in the network car-booking APP or other APPs to establish a call connection with the server, which is not limited in this embodiment.

Step S120, determining at least one behavior type corresponding to the current voice. The type of the network appointment application scenario as a routine may include various scenarios in the network appointment task, for example, a car appointment scenario, an order modification scenario, an order cancellation scenario, an order query scenario, and the like.

In an optional implementation manner, step S120 may specifically be: and inputting the text information corresponding to the current voice into the type determination model for processing to obtain the behavior type of the current voice.

Optionally, in this embodiment, Speech Recognition is performed on the current Speech through an ASR (Automatic Speech Recognition) method, so as to obtain text information corresponding to the current Speech. Further alternatively, text information directly recognized through the current speech may be erroneous due to environmental noise or a user's accent problem, etc. Therefore, in this embodiment, ASR is used to perform speech recognition on the current speech, obtain an initial text corresponding to the current speech, correct the initial text, and obtain text information corresponding to the current speech. Optionally, the correcting the initial text includes correcting and removing stop words, where the correcting is an error occurring in the recognized text, and this embodiment uses an error correction model based on an n-gram algorithm to correct the initial text. The removal of stop words is the deletion of punctuation marks and mood words that do not contribute to semantic understanding. Therefore, the accuracy of the acquired text information can be improved, and the accuracy of intention recognition is further improved.

In another optional implementation manner, step S120 may specifically be: inputting the text information corresponding to the current voice, the characteristic vector of the pinyin of the text information, the characteristic vector of each character in the text information and the random initialization character vector into a type determination model for processing, and acquiring a predetermined number of behavior types to which the current voice belongs. In the embodiment, the problem of accuracy reduction caused by wrongly written characters possibly appearing in the recognized text information is solved by adding the pinyin vector of the text information, the OOV problem (for example, unregistered words appear) is solved by adding the feature vector of each character in the pre-trained text information and the random initialization word vector, and the generalization capability of the type determination model is enhanced, so that the type recognition accuracy of the type determination model can be improved, and the accuracy of subsequent intention recognition can be improved.

In an alternative implementation, the type recognition model is obtained by pre-training type training data, where the type training data may include voice data in each behavior type and a type tag of each voice data. In another optional implementation manner, in the training process of the type recognition model, the pinyin vector of the text corresponding to the voice data, the word vector of each pre-trained word and the randomly initialized word vector are added to improve the accuracy of data recognition and enhance the generalization capability of the model.

Optionally, the type recognition model of the present embodiment is a TextCNN model to balance data processing efficiency and recognition effect. In other alternative implementations, an RNN model or the like may also be used to perform type classification, and the type of the model is not limited in this embodiment.

In the present embodiment, the TextCNN model includes an input layer (i.e., word embedding layer), a convolutional layer, a pooling layer, and a fully-connected layer (classification). Optionally, in this embodiment, an entity as a whole is input to the word embedding layer as a participle, and a word vector is obtained, so as to enhance the recognition performance of the model for a specific type. Optionally, in this embodiment, the pooling layer uses k-max-pooling to obtain more local information and output k behavior types and their corresponding scores. Optionally, k may be 3, and it should be understood that the value of k may be set according to a specific application scenario, which is not limited in this embodiment.

Step S130, an entity word slot corresponding to the current voice is obtained from each action type. Optionally, in the word slot extraction process of this embodiment, a BIO labeling method is used for sequence labeling, so as to reduce introduced tag types and improve data processing efficiency. Where B denotes the beginning of a noun phrase, I denotes the middle of a noun phrase, and O denotes not a noun phrase. It should be understood that other sequence labeling methods, such as biees, etc., may be used for sequence labeling, and the embodiment is not limited thereto.

In an optional implementation manner, word slot extraction is performed on the sentences in each action type according to the word slot extraction model, and a plurality of entity word slots corresponding to the current voice are obtained. Optionally, word slot extraction is performed on the sentences in each behavior type according to the word slot extraction model to obtain a plurality of entity word slots, and the obtained plurality of entity word slots are screened based on the current voice to obtain a plurality of entity word slots corresponding to the current voice. For example, segmenting the text information of the current voice to obtain entity words, performing semantic similarity calculation on the entity words of the current voice and the extracted entity word slots, and determining the entity word slots with the similarity meeting the conditions as the entity word slots corresponding to the current voice. Taking the type of car booking behavior in the network car booking application scenario as an example, the sentence in the type may include "i want to take a car from a place a to a place B", "i want to take a car", "i want to go to a place B", "i want to go to a square beside a train station", and the extracted entity word slot may include "take a car", "a place", "B place", "train station", "square", and the like. If the current voice is "get a car and go to a train station from the place a", the entity word slot corresponding to the current voice may include "get a car", "train station", and the like.

Sequence tagging is a basic task of natural speech processing, and comprises part of speech tagging, Chinese word segmentation, slot position identification and the like. Optionally, the word slot extraction model in this embodiment is a Bert-Crf model. The pretrained Bert model is subjected to fine tuning and is combined with Crf, and the sequence labeling effect can be improved. Taking a network car booking application scene as an example, voice interaction data between a server and a client in the network car booking task creating and executing process can be obtained, and training corpora are obtained so as to perform fine adjustment on a pre-trained Bert model. Because the pretrained Bert model has learned a large amount of voice information in a high-dimensional space, a good effect can be achieved in a corresponding type only by a small amount of training corpora.

Step S140, recalling at least one similar sentence corresponding to the current voice from the sentences corresponding to each entity word slot.

FIG. 4 is a flowchart of a similar statement recall method according to an embodiment of the present invention. In an alternative implementation, step S140 includes:

in step S141, a first feature vector is acquired. The first feature vector represents a feature vector of text information corresponding to the current voice. Optionally, the input text information corresponding to the current voice is obtained by processing the current voice. In an alternative implementation manner, an initial text corresponding to the current speech is determined based on an automatic speech recognition method (e.g., ASR, etc.), the initial text is corrected, and input text information corresponding to the current speech is obtained. Optionally, the correcting the initial text includes correcting and removing stop words, where the correcting is an error occurring in the recognized text, and this embodiment uses an error correction model based on an n-gram algorithm to correct the initial text. The removal of stop words is the deletion of punctuation marks and mood words that do not contribute to semantic understanding.

Fig. 5 is a flowchart of a feature vector determination method according to an embodiment of the present invention. In an alternative implementation, as shown in fig. 5, step S141 includes:

step S141A, inputting the text information corresponding to the current speech into the first vector calculation model for processing, and obtaining a first vector. Optionally, the first vector calculation model is a model based on the BM25 algorithm. In the embodiment, the feature vector of the input text information based on the characteristic information statistics is determined according to the model of the BM25 algorithm, so that the calculation amount can be reduced, and the data processing speed can be increased. The BM25 algorithm is an algorithm for evaluating the relevance between search terms and documents, and is an algorithm proposed based on a probabilistic search model.

Step S141B, inputting the text information corresponding to the current speech into the second vector calculation model for processing, and obtaining a second vector. Optionally, the second vector calculation model is a deep learning model obtained by unsupervised learning. Optionally, the second vector calculation model is a model based on the Bert-Ada algorithm. The model based on the Bert-Ada algorithm is a task-adaptive small model formed by compressing the Bert model through a Differential Neural Architecture Search (DNAS), and the structure and knowledge of the small model can be adjusted according to the task to be executed. Therefore, the complexity of the model can be reduced and the data processing efficiency can be improved while the semantic expression capability of the feature vector is realized.

Step S141C, concatenating the first vector and the second vector to obtain a first feature vector.

Therefore, the feature vector based on the feature information statistics and the feature vector based on the depth semantic vector representation are spliced to obtain the feature vector of the text information corresponding to the current voice, so that the semantic expression capability of the feature vector can be further improved, and the accuracy of intention identification can be further improved.

In step S142, a plurality of second feature vectors are obtained. The plurality of second feature vectors respectively represent feature vectors of sentences corresponding to the entity word slots, or the plurality of second feature vectors respectively represent feature vectors of sentences in the intention sentence library. At least one sentence corresponding to each intention is stored in the intention sentence library. Each intent has a corresponding at least one statement. For example, an intent to express "taken" may include "kay, i's on a car", "yes, has been taken", "right, has been taken", and so on.

Optionally, all sentences in the intended sentence library may be obtained to improve accuracy, or a sentence including at least one entity word slot corresponding to the current speech may be obtained from the intended sentence library to reduce the amount of computation. Alternatively, the feature vectors of the sentences in the intention sentence library may be predetermined and stored based on a method similar to the steps S141A-S141C, and will not be described herein again.

Step S143, calculating similarities between the first feature vectors and the second feature vectors. Optionally, the cosine similarity, the euclidean distance, the chebyshev distance, the manhattan distance, and the like between the first feature vector and each of the second feature vectors may be calculated to calculate the corresponding similarity, and the similarity between the first feature vector and each of the second feature vectors may also be calculated according to a neural network model, which is not limited in this embodiment.

And step S144, recalling at least one similar sentence according to each similarity. In an optional implementation manner, the sentences in the meaning graph database or the sentences corresponding to the entity word obtaining slots are sorted from large to small according to the similarity, k sentences with the highest similarity are obtained as similar sentences, and k is greater than or equal to 1.

Therefore, in the embodiment, by calculating the similarity between the text information corresponding to the current voice and the sentences (or each sentence in the intention sentence library) corresponding to each obtained entity word slot, at least one similar sentence is recalled, and the recall efficiency is improved.

Step S150, inputting the obtained current feature information into a pre-trained intention determining model for processing, and obtaining the intention corresponding to the current voice, wherein the current feature information at least comprises similar sentences and entity word slot information. Optionally, the entity word slot information may include entity word slots and their corresponding scores in the word slot extraction model.

In an optional implementation manner, the current feature information further includes text information corresponding to the current voice, each behavior type and corresponding score of the current voice, and similarity ranking information of each similar sentence (that is, the text information calculated in the recalling step S140 and the similarity ranking information of each similar sentence). According to the embodiment, the posterior ranking is performed according to the classification result classified based on the coarse-grained behavior type, the extraction result of the entity word slot corresponding to the current voice and the text similarity calculation result, and the intention corresponding to the current voice is obtained, so that the accuracy of intention recognition can be improved.

In another optional implementation manner, the current feature information further includes a current task state of the target task and historical interaction information corresponding to the target task. Taking a net appointment as an example, the current task state of the target task may include a dispatch state, a driver order-received state, an getting-on state, a getting-off state, and the like. The current task state is used as a ' car getting-on state ', the historical interaction information comprises a dialogue record after a user calls a reserved number of a net car reservation, for example, a net car reservation dialogue ' I am at east of xx cell, want to go to west of xx university ', ' good, is sending a bill for you ', ' passenger is good, white xx car with the license plate number of xxxx has received the bill, is 1km away from you currently, please keep a look at ', ' passenger is good, the net car reserved by you has arrived at a car getting-on point, and the like. Therefore, the intention identification accuracy can be further improved according to the current task state of the target task and the historical interaction information corresponding to the target task, and the user experience is further improved.

Optionally, the intention determining model of the present embodiment is an ensemble model. The ensemble model is a model integration framework and comprises a plurality of classifiers, each classifier can be different machine learning methods or the same machine learning method, and the classifiers can be complementary and refuse own positions. Therefore, in this embodiment, the text information corresponding to the current voice, the obtained behavior types and scores corresponding to the behavior types, the obtained entity word slots and scores corresponding to the entity word slots, the obtained similar sentences and similarity degrees corresponding to the similar sentences, the current task state of the target task and the historical interaction information corresponding to the target task are input to the ensemble model, the feature information is processed through each classifier in the ensemble model to comprehensively sort the similar sentences, and the intention corresponding to the similar sentence with the highest score is output. Therefore, the accuracy of intention recognition can be further improved, and the user experience is further improved.

And step S160, executing corresponding operation according to the intention and controlling to return an execution result. Taking the taxi appointment application scenario as an example, assuming that the intention obtained in step S160 is "query order status", the operation of querying an order is performed, and the queried order status is controlled to be returned to the user terminal, for example, "your order is in the process of dispatching", or "your order has been taken by a driver xxx, the driver is currently 1km away from your car, and is expected to arrive after 2 minutes", and so on. For another example, assuming that the intention obtained in step S160 is "getting on the car", an "end word of getting on the car" is returned to the target user, for example, "confirm you get on the car, please pay attention to safety", etc.

In an optional implementation manner, the information interaction method of this embodiment further includes: and jumping to a new task state according to the acquisition intention and the current task state of the target task. Taking a network appointment application scenario as an example, assuming that the obtained intention is "getting on", the current task state of the target task is skipped to a new node, and the corresponding states may include "getting off", "not getting off", and others.

In this embodiment, by determining at least one behavior type corresponding to a current voice input by a target user, extracting an entity word slot in each behavior type, and recalling at least one similar sentence corresponding to the current voice from a sentence corresponding to each entity word slot, at least each similar sentence is input into a pre-trained intent determination model for processing to obtain an intent corresponding to the current voice, and corresponding operation is executed according to the intent and an execution result is returned under control. Therefore, the accuracy of the intention judgment can be improved, and the correct response can be made based on the intention, so that the user experience can be improved.

FIG. 6 is a schematic diagram of an intention recognition data processing procedure according to an embodiment of the invention. As shown in fig. 6, taking the car booking application scenario as an example, when the user dials the car booking fixed-line telephone through the user terminal, and after the telephone is connected, the current voice is input. The speech recognition unit 61 in the server performs speech recognition on the current speech input by the user, and obtains text information "i want to take a car to go from place a to place B" corresponding to the current speech, and the type determination model 62 determines corresponding behavior types including "car appointment behavior type", "order query behavior type", and "order modification behavior type" according to the text information. The type determination model 62 determines the possibility (i.e., score) that the text information belongs to each behavior type, and outputs the first three behavior types with the highest scores. The word slot extraction model 63 performs word slot extraction on the sentences in each action type to obtain a plurality of entity word slots, and filters the obtained entity word slots based on the current voice to obtain a plurality of entity word slots corresponding to the current voice, including "car appointment", "place a", "place B", and the like. The sentence acquisition unit 64 acquires sentences corresponding to the entity word slots "car appointment", "place a", "place B", and the like, including "i want to take a car to place B from place a", "i want to inquire about an order to place B from place a", "i want to modify an order to place B from place a", and the like. The sentence recall model 65 calculates similarity between the text information of the current speech and each sentence, sorts the sentences according to the similarity, and obtains K sentences with the highest similarity, including "i want to take a car to go to the place B from the place a", and the like. The ensemble model 66 processes the input similar sentences and the corresponding similarity, the text information corresponding to the current voice, the acquired behavior types and the corresponding scores thereof, the acquired entity word slots and the corresponding scores thereof, the current task state and the historical interaction information, and outputs corresponding intentions, namely' car booking: from a ground to B ground ". The execution unit 67 creates a network appointment order according to the intention, and returns the execution result to the user terminal 68, for example, sends voice broadcast information of "order from a place a to B place, in dispatch" has been created for you to the user terminal.

FIG. 7 is a schematic diagram of an information interaction system according to an embodiment of the invention. As shown in fig. 7, the information interaction system 7 of the embodiment of the present invention includes a dialogue management module 71 and a semantic understanding module 72. The dialogue management module 71 is configured to receive a current voice sent by the user terminal, determine an answer corresponding to an intention based on the intention determined by the semantic understanding module 72, that is, determine a target voice, and send the target voice to the user terminal. The semantic understanding module 72 recognizes the user's intention according to the current task state, the historical interaction information and the current voice of the user, controls the dialog management module to make corresponding actions, and controls the answer corresponding to the intention to be returned. The semantic understanding module 72 may determine the intention corresponding to the current input sentence based on the steps S110 to S150, which are not described herein again.

FIG. 8 is a diagram of an information interaction apparatus according to an embodiment of the present invention. As shown in fig. 8, the information interaction apparatus 8 according to the embodiment of the present invention includes a receiving unit 81, a type determining unit 82, a word slot extracting unit 83, a sentence recalling unit 84, an intention determining unit 85, and an executing unit 86.

The receiving unit 81 is configured to receive a current voice input by a target user.

The type determining unit 82 is configured to determine at least one behavior type corresponding to the current speech. In an optional implementation manner, the type determining unit 82 is further configured to input text information corresponding to the current speech into a type determination model for processing, and obtain a predetermined number of behavior types to which the current speech belongs. In another optional implementation manner, the type determining unit 82 is further configured to input the text information corresponding to the current voice, the feature vector of the pinyin of the text information, the feature vector of each word in the text information, and the random initialization word vector into a type determination model for processing, and obtain a predetermined number of behavior types to which the current voice belongs.

The word slot extracting unit 83 is configured to extract an entity word slot in each of the behavior types. In an optional implementation manner, the word slot extraction unit 83 is further configured to perform word slot extraction on the sentences in each behavior type according to a word slot extraction model, so as to obtain a plurality of entity word slots.

The sentence recalling unit 84 is configured to recall at least one similar sentence corresponding to the current voice from the sentences corresponding to each of the entity word slots.

In an alternative implementation, the statement recall unit 84 includes a first vector acquisition subunit, a second vector acquisition subunit, a similarity degree subunit, and a statement recall subunit. The first vector obtaining subunit is configured to obtain a first feature vector, where the first feature vector represents a feature vector of text information corresponding to the current speech. The second vector obtaining subunit is configured to obtain a plurality of second feature vectors, where the plurality of second feature vectors respectively represent feature vectors of sentences corresponding to the entity word slots or feature vectors of sentences in the intention word library. The similarity operator unit is configured to calculate a similarity between the first feature vector and each of the second feature vectors, respectively. A sentence recalling subunit configured to recall at least one of the similar sentences according to each of the similarities.

In an optional implementation manner, the first vector obtaining subunit includes a first vector obtaining module, a second vector obtaining module, and a first feature vector obtaining module. The first vector acquisition module is configured to input the text information corresponding to the current voice into a first vector calculation model for processing, and acquire a first vector. The second vector acquisition module is configured to input the text information corresponding to the current voice into a second vector calculation model for processing, and acquire a second vector. The first feature vector acquisition module is configured to splice the first vector and the second vector to acquire the first feature vector.

The intention determining unit 85 is configured to input the obtained current feature information into a pre-trained intention determining model for processing, and obtain an intention corresponding to the current speech, where the current feature information at least includes each of the similar sentences. Optionally, the current feature information further includes text information corresponding to the current speech, each behavior type and corresponding score, and similarity ranking information of each similar sentence. Optionally, the current feature information further includes a current task state of a target task and historical interaction information corresponding to the target task.

The execution unit 86 is configured to execute the corresponding operation according to the intent and control the return of the execution result.

In an alternative implementation, the information interaction device 8 further includes a voice processing unit. The voice processing unit is configured to process the current voice and acquire text information corresponding to the current voice. Optionally, the speech processing unit includes a speech recognition subunit and a rectification subunit. The voice recognition subunit is configured to perform voice recognition on the current voice by adopting a voice recognition method, and acquire a corresponding initial text. And the correcting subunit is configured to correct the initial text and acquire the text information.

In an alternative implementation, the information interaction device 8 further includes a state transition unit. The state transition unit is configured to jump to a new task state based on the intent and a current task state of the target task.

In an alternative implementation manner, the information interaction apparatus 8 further includes a transfer relationship creating unit. The transfer relation creating unit is configured to create a transfer relation between the information interaction states based on the state machine.

Fig. 9 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 9, the electronic device 9 is a general-purpose data processing apparatus comprising a general-purpose computer hardware structure including at least a processor 91 and a memory 92. The processor 91 and the memory 92 are connected by a bus 93. The memory 92 is adapted to store instructions or programs executable by the processor 91. The processor 91 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 91 implements the processing of data and the control of other devices by executing instructions stored by the memory 92 to perform the method flows of embodiments of the present invention as described above. The bus 93 connects the above components together, and also connects the above components to a display controller 94 and a display device and an input/output (I/O) device 95. Input/output (I/O) devices 95 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 95 are coupled to the system through an input/output (I/O) controller 96.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the invention relates to a computer program product for causing a computer to perform some or all of the above method embodiments when the computer program product runs on a computer.

Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An information interaction method, characterized in that the method comprises:

receiving current voice input by a target user;

recognizing the text information of the current voice to determine at least one behavior type corresponding to the current voice;

performing word slot extraction on the sentences in each behavior type to obtain an entity word slot corresponding to the current voice;

recalling at least one similar sentence corresponding to the current voice, wherein the similar sentence comprises at least one entity word slot corresponding to the current voice;

inputting the obtained current feature information into a pre-trained intention determining model for processing, and obtaining an intention corresponding to the current voice, wherein the current feature information at least comprises information of each similar sentence and each entity word slot;

2. The method of claim 1, wherein determining at least one behavior type corresponding to the current speech comprises:

and inputting the text information corresponding to the current voice into a type determination model for processing, and acquiring a preset number of behavior types to which the current voice belongs.

3. The method of claim 1, wherein determining at least one behavior type corresponding to the current speech comprises:

inputting the text information corresponding to the current voice, the characteristic vector of the pinyin of the text information, the characteristic vector of each word in the text information and the random initialization word vector into a type determination model for processing, and acquiring a predetermined number of behavior types to which the current voice belongs.

4. The method of claim 1, wherein obtaining the entity word slot corresponding to the current speech from each behavior type comprises:

and performing word slot extraction on the sentences in each behavior type according to a word slot extraction model to obtain a plurality of entity word slots.

5. The method of claim 1, wherein recalling information of at least one similar sentence corresponding to the current speech comprises:

acquiring a first feature vector, wherein the first feature vector represents a feature vector of text information corresponding to the current voice;

obtaining a plurality of second feature vectors, wherein the plurality of second feature vectors respectively represent feature vectors of sentences corresponding to the entity word slots or feature vectors of the sentences in the meaning word library;

calculating the similarity between the first feature vector and each second feature vector;

recalling at least one similar sentence according to each similarity.

6. The method of claim 5, wherein obtaining the first feature vector comprises:

inputting the text information corresponding to the current voice into a first vector calculation model for processing to obtain a first vector;

inputting the text information corresponding to the current voice into a second vector calculation model for processing to obtain a second vector;

and splicing the first vector and the second vector to obtain the first feature vector.

7. The method according to any one of claims 1 to 6, wherein the current feature information further includes text information corresponding to the current speech, each behavior type and corresponding score, and similarity ranking information of each similar sentence.

8. The method according to any one of claims 1-6, wherein the current feature information further includes a current task state of a target task and historical interaction information corresponding to the target task.

9. The method according to any one of claims 1-6, further comprising:

and processing the current voice to acquire text information corresponding to the current voice.

10. The method of claim 9, wherein processing the current speech and obtaining text information corresponding to the current speech comprises:

performing voice recognition on the current voice by adopting a voice recognition method to obtain a corresponding initial text;

and correcting the initial text to acquire the text information.

11. The method of claim 1, further comprising:

and jumping to a new task state according to the intention and the current task state of the target task.

12. The method of claim 1, further comprising:

and creating a transition relation between the information interaction states based on the state machine.

13. An information interaction apparatus, the apparatus comprising:

a receiving unit configured to receive a current voice input by a target user;

the type determining unit is configured to recognize text information of the current voice to determine at least one behavior type corresponding to the current voice;

a word slot extraction unit configured to perform word slot extraction on the sentences in each behavior type to obtain an entity word slot corresponding to the current voice;

14. The apparatus according to claim 13, wherein the type determining unit is further configured to input text information corresponding to the current speech into a type determining model for processing, and obtain a predetermined number of behavior types to which the current speech belongs.

15. The apparatus of claim 13, wherein the type determining unit is further configured to input text information corresponding to the current speech, the feature vector of the pinyin of the text information, the feature vector of each word in the text information, and a random initialization word vector into a type determination model for processing, and obtain a predetermined number of behavior types to which the current speech belongs.

16. The apparatus according to claim 13, wherein the word slot extraction unit is further configured to perform word slot extraction on the sentences in each behavior type according to a word slot extraction model, and obtain a plurality of entity word slots.

17. The apparatus of claim 13, wherein the sentence recall unit comprises:

a first vector obtaining subunit, configured to obtain a first feature vector, where the first feature vector represents a feature vector of text information corresponding to the current speech;

a second vector obtaining subunit, configured to obtain a plurality of second feature vectors, where the plurality of second feature vectors respectively represent feature vectors of sentences corresponding to each entity word slot or feature vectors of each sentence in the intention word library;

a similarity operator unit configured to calculate similarities between the first feature vectors and the second feature vectors, respectively;

a sentence recalling subunit configured to recall at least one of the similar sentences according to each of the similarities.

18. The apparatus of claim 17, wherein the first vector obtaining subunit comprises:

the first vector acquisition module is configured to input the text information corresponding to the current voice into a first vector calculation model for processing to acquire a first vector;

the second vector acquisition module is configured to input the text information corresponding to the current voice into a second vector calculation model for processing to acquire a second vector;

a first feature vector obtaining module configured to splice the first vector and the second vector to obtain the first feature vector.

19. The apparatus according to any one of claims 13-18, wherein the current feature information further includes text information corresponding to the current speech, each behavior type and corresponding score, and similarity ranking information of each similar sentence.

20. The apparatus according to any one of claims 13-18, wherein the current feature information further comprises a current task state of a target task and historical interaction information corresponding to the target task.

21. The apparatus according to any one of claims 13-18, further comprising:

and the voice processing unit is configured to process the current voice and acquire text information corresponding to the current voice.

22. The apparatus of claim 21, wherein the speech processing unit comprises:

the voice recognition subunit is configured to perform voice recognition on the current voice by adopting a voice recognition method to acquire a corresponding initial text;

and the correcting subunit is configured to correct the initial text and acquire the text information.

23. The apparatus of claim 13, further comprising:

and the state transition unit is configured to jump to a new task state according to the intention and the current task state of the target task.

24. The apparatus of claim 13, further comprising:

and the transfer relation creating unit is configured to create a transfer relation between the information interaction states based on the state machine.

25. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-12.

26. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 12.