CN115186071A

CN115186071A - Intention recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN115186071A
Application number: CN202110368045.4A
Authority: CN
Inventors: 王宁宁; 王一秋
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2022-10-14

Abstract

The application provides an intention identification method, an intention identification device, an electronic device and a readable storage medium. The method comprises the following steps: acquiring a first text of a plurality of turns of conversations; acquiring a first intention which is identified through a trained first model and corresponds to the first text; under the condition that the first text contains a preset first keyword, performing semantic similarity calculation on the first keyword and the first intention to obtain a first similarity value; and determining a target intention corresponding to the first text according to a comparison result between the first similarity value and a first threshold value. According to the method and the device, the target intention corresponding to the first text can be combined with the context information of multiple turns of conversations, and therefore the accuracy of intention identification of the text in the multiple turns of conversations can be improved.

Description

Intention recognition method and device, electronic equipment and readable storage medium

Technical Field

The embodiment of the application relates to the field of natural language processing, in particular to an intention identification method, an intention identification device, electronic equipment and a readable storage medium.

Background

With the development of natural language processing technology, intention recognition technology is increasingly applied to man-machine interaction systems in many different vertical fields such as intelligent customer service, intelligent assistant and the like, and plays a great role in life and work of people.

Because of habits of multiple short sentences, omission of forms and the like when people communicate with each other, conversation contents are difficult to be clearly exchanged in a single conversation, and therefore, the multi-conversation technology has great commercial value. In the prior art, common intent recognition is primarily directed to a single turn of conversation. However, if the intention recognition method of a single-round conversation is adopted to perform intention recognition on texts in multiple rounds of conversations, the following problems are easily caused: semantic consistency cannot be guaranteed; the extracted semantic features are single and have low generalization; the context is not well understood, resulting in less reliability of the intent recognition of text in multiple turns of conversations.

Disclosure of Invention

The embodiment of the application provides an intention identification method, an intention identification device, electronic equipment and a readable storage medium, and aims to solve the problem that in the prior art, the reliability of intention identification of texts in multiple rounds of conversations is low because the intention identification mode of a single round of conversations is adopted to identify the intentions of the texts in multiple rounds of conversations.

To solve the above problems, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides an intention identification method, including:

acquiring a first text of a plurality of turns of conversations;

acquiring a first intention which is identified through a trained first model and corresponds to the first text;

performing semantic similarity calculation on the first keyword and the first intention under the condition that the first text contains a preset first keyword to obtain a first similarity value;

and determining a target intention corresponding to the first text according to a comparison result between the first similarity value and a first threshold value.

In a second aspect, an embodiment of the present application further provides an intention identifying apparatus, including:

the first acquisition module is used for acquiring first texts of multiple turns of conversations;

the second acquisition module is used for acquiring a first intention which is identified by the trained first model and corresponds to the first text;

the calculation module is used for performing semantic similarity calculation on the first keyword and the first intention under the condition that the first text contains a preset first keyword to obtain a first similarity value;

and the first determining module is used for determining the target intention corresponding to the first text according to the comparison result between the first similarity value and the first threshold value.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor; wherein the processor is configured to read the program in the memory to implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application further provide a readable storage medium for storing a program, where the program, when executed by a processor, implements the steps in the method according to the foregoing first aspect.

In the embodiment of the application, after a first intention corresponding to a first text is identified through a first model, semantic similarity calculation is performed on keywords and the first intention contained in the first text to obtain a first similarity value, whether the first intention is obtained directly based on the first text or obtained by combining context information of multiple rounds of conversations with the first text is judged based on a comparison result of the first similarity value and a first threshold, and then a target intention corresponding to the first text is determined according to a judgment result. In this way, the target intention corresponding to the first text can be combined with the context information of the multiple rounds of conversations, so that the accuracy of intention recognition of the text in the multiple rounds of conversations can be improved, and the reliability of intention recognition of the text in the multiple rounds of conversations can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings may be obtained according to these drawings without inventive labor.

Fig. 1 is a schematic flowchart of an intention identifying method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a first model training provided by an embodiment of the present application;

FIG. 3 is a second schematic flowchart of an intention identification method according to an embodiment of the present application;

FIG. 4 is a third flowchart illustrating an intention recognition method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a first model provided by an embodiment of the present application;

FIG. 6 is a fourth flowchart illustrating an intention recognition method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an intent translation device as provided in the practice of the present application;

fig. 8 is a schematic structural diagram of an electronic device provided in this application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Further, as used herein, "and/or" means at least one of the connected objects, e.g., a and/or B and/or C, means 7 cases including a alone, B alone, C alone, and both a and B present, B and C present, a and C present, and a, B, and C present.

In the embodiment of the present application, the terminal may perform multiple rounds of conversations with the user, in practical applications, the user may perform a conversation with the terminal by inputting a text, or may perform a conversation with the terminal by inputting a voice, which may be specifically determined according to actual situations, which is not limited in the embodiment of the present application.

The terminal may communicate with the electronic device. After acquiring current dialog data (which may be text or voice) of a user, the terminal may send the dialog data or text data corresponding to the dialog data to the electronic device, so as to identify a target intention corresponding to the dialog data through the electronic device, and determine reply content of the dialog data according to the target intention. After determining the reply content of the dialogue data, the electronic equipment sends the reply content to the terminal so that the terminal can output the reply content in the form of voice or text.

In practical application, the terminal can be a mobile phone, a computer, a robot and the like; the electronic device may be a cloud server or the like.

The intention identifying method provided in the embodiment of the present application is explained below. The intention identifying method of the embodiment of the application can be executed by the electronic device.

Referring to fig. 1, an intention identification method of an embodiment of the present application may include the steps of:

step 101, obtaining a first text in a plurality of turns of conversation.

In a specific implementation, the first text may be: and the ith wheel of the multi-wheel conversation corresponds to the text of the conversation data of the user in the multi-wheel conversation, wherein i is a positive integer. Under the condition that the dialogue data of the user is voice, the text corresponding to the dialogue data is the text obtained by converting the voice input by the user; and under the condition that the dialogue data of the user is a text, the text corresponding to the dialogue data is the text input by the user.

And 102, acquiring a first intention which is identified through the trained first model and corresponds to the first text.

The first model may be any model that can be used to identify text intent, such as: convolutional Neural Networks (CNN), long Short-Term Memory networks (LSTM), bidirectional Long Short-Term Memory networks in combination with attention mechanisms (Bilstm + attention), bidirectional attention Neural networks (Bert), and so forth.

Since in an actual scene, the types of intentions of the user are more, but samples under the same intention are fewer, that is, the labeled data are fewer, in order to solve the problem of small samples, optionally, the first model may be an induction network based on learning of fewer samples, so that even in the case of fewer labeled data, the first model can accurately and quickly identify the intention of the user, and is suitable for the case of professional data shortage in the vertical field.

Step 103, performing semantic similarity calculation on the first keyword and the first intention under the condition that the first text contains a preset first keyword to obtain a first similarity value.

In the implementation of the application, the electronic device may be preset with P keywords, where the P keywords are used to detect whether the text intention identified by the first model is combined with context information of multiple rounds of conversations. In case P is greater than 1, the P keywords may correspond to at least one rank. When the P keywords correspond to at least two levels, the at least two levels may have an affiliation, such as: the P keywords correspond to a first level and a second level, and one keyword of the first level can include at least one keyword of the second level. It is understood that the first keyword may be any keyword among the P keywords.

After acquiring the first intention corresponding to the first text and predicted by the first model, the electronic device may detect whether the first text includes a keyword of the P keywords, so as to determine a next execution step according to a detection result of the detection action. In implementation, if the first text includes the keywords of the P keywords, step 103 may be executed; if the first text does not include the keywords of the P keywords, the first intention may be directly determined as the target intention corresponding to the first text.

Optionally, the electronic device may further detect whether the first text is a first text in the multiple rounds of dialog, so as to determine a next execution step in combination with a detection result of whether the first text is the first text in the multiple rounds of dialog and a detection result of whether the first text includes a keyword in the P keywords. In specific implementation, the two detection actions may be executed simultaneously or sequentially, which is not limited in this embodiment of the present application.

And detecting that the first text contains a preset first keyword, and the first text is not the first text in the multi-turn conversation.

In this case, since the first intention corresponding to the first text predicted by the first model is not combined with the context information of the multiple rounds of conversations, the electronic device may perform semantic similarity calculation on the first keyword and the first intention to obtain a first similarity value, so as to determine whether the first intention is combined with the context information of the multiple rounds of conversations according to the first similarity value.

In specific implementation, the electronic device may perform semantic similarity calculation on the first keyword and the first intention by using a cosine distance and the like, and the semantic similarity calculation mode is not limited in the embodiment of the present application.

And detecting a second result, wherein the first text does not contain any keyword of the P keywords, or the first text is not the first text in the multi-turn dialog.

In this case, optionally, the electronic device may determine the first intention as a target intention corresponding to the first text.

In the case that the first detection result is that the first text is the first text in the multiple rounds of conversations, the multiple rounds of conversations do not generate context information, and the intention of the first text in the multiple rounds of conversations is the current intention of the multiple rounds of conversations, so the electronic device can directly determine the first intention as the target intention corresponding to the first text.

In the case that the first text does not contain any keyword of the P keywords, the user is very likely to start the next topic, that is, the first text and the first i-1 text discussions in the multiple conversations are not one topic, at this time, the newly started topic generates no context information, and the intention of the first text is the current intention of the topic, so that the electronic device can directly determine the first intention as the target intention corresponding to the first text.

Therefore, the target intention determined in the above manner is the real intention of the user for expressing the first text, so that the accuracy of intention identification of the first text can be improved, and the reliability of intention identification of texts in multiple rounds of conversations is improved.

And 104, determining a target intention corresponding to the first text according to a comparison result between the first similarity value and a first threshold value.

The result of the comparison between the first similarity value and the first threshold value may be used to characterize whether the first intention coincides with the user's true intention.

Specifically, in the case that the first similarity value is smaller than the first threshold, the first intention is characterized to be consistent with the real intention of the user for the following reasons: in this case, the semantic similarity between the first keyword and the first intention is small, which indicates that the first intention most likely includes context information of the multiple turns of conversations in addition to semantic information of the first keyword, and thus, the electronic device may determine that the first intention corresponds to the real intention of the user.

In the case that the first similarity value is greater than or equal to the first threshold, characterizing the first intention as not corresponding to the user's true intention for the following reasons: in this case, the semantic similarity between the first keyword and the first intention is high, which indicates that the first intention most likely does not include the context information of the multiple turns of conversations except the semantic information of the first keyword, and therefore, the electronic device may determine that the first intention does not correspond to the real intention of the user.

Therefore, the electronic device may determine the target intention of the first text, i.e. the real intention of the user after combining the context information, according to the comparison result between the first similarity value and the first threshold. It is to be understood that the target intent may or may not be the first intent.

For ease of understanding, the following is illustrated in connection with Table 1:

the multi-turn dialog in table 1 includes 4 texts, and the top-bottom sequence of the 4 texts is the sequence of the 4 text acquisition times from first to last. The single-turn user intention may be understood as the aforementioned first intention, and the multi-turn dialog intention after the combination of the context may be understood as the aforementioned target intention.

The keywords preset by the electronic device are assumed to include: popular version, professional version, mobile banking.

Table 1: multi-turn conversation for purchasing financial products

As can be seen from Table 1, for the first question of the user, the single-turn user intention is the target intention.

For the second question of the user, the single-round user intention predicted by the first model is 'popular version', the similarity is very high and the single-round user intention predicted by the first model is different from the real intention of the user, and the keyword 'popular version' included in the question is the same as the single-round user intention predicted by the first model.

For the third question of the user, the single-round user intention predicted by the first model is 'how to buy a financing product at a mobile phone bank', and the similarity with the keyword 'mobile phone bank' contained in the question is small, because the single-round user intention predicted by the first model also has the intention of 'buying financing' besides the intention of 'mobile phone bank', the intention is complete, and the real intention of the user for asking the question can be reflected.

According to the intention identification method, after a first intention corresponding to a first text is identified through a first model, semantic similarity calculation is carried out on keywords contained in the first text and the first intention to obtain a first similarity value, whether the first intention is obtained directly based on the first text or obtained by combining context information of multiple rounds of conversations besides the first text is judged based on a comparison result of the first similarity value and a first threshold value, and then a target intention corresponding to the first text is determined according to a judgment result. In this way, the target intention corresponding to the first text can be combined with the context information of the multiple rounds of conversations, so that the accuracy of intention recognition of the text in the multiple rounds of conversations can be improved, and the reliability of intention recognition of the text in the multiple rounds of conversations can be improved.

The following describes a specific implementation of step 204:

implementation mode one

Optionally, the determining a target intention corresponding to the first text according to a comparison result between the first similarity value and a first threshold includes:

determining the first intention as a target intention corresponding to the first text if the first similarity value is less than the first threshold.

In this embodiment, the first similarity value is smaller than the first threshold, which indicates that the first intention is consistent with the real intention of the user, so the electronic device may directly determine the first intention as the target intention.

As for the third question in table 1, since the single-turn user intention predicted by the first model is "how to purchase a financial product at the cell phone bank", the similarity to the keyword "cell phone bank" contained in the question is small, the electronic device may directly determine the single-turn user intention as a multi-turn dialog intention after combining the contexts.

Therefore, through the first embodiment, the target intention corresponding to the first text can be combined with the context information of the multiple rounds of conversations, so that the accuracy of intention recognition of the text in the multiple rounds of conversations can be improved, and the reliability of intention recognition of the text in the multiple rounds of conversations can be improved.

Second embodiment

under the condition that the first similarity value is larger than or equal to the first threshold value, detecting whether a second text of the multiple rounds of conversations contains a preset second keyword, wherein the acquisition time of the second text is earlier than that of the first text, and the second keyword and the first keyword are preset keywords with the same level;

under the condition that the second text contains the second keyword, controlling the first keyword to replace the second keyword of the second text to obtain a first target text;

acquiring a second intention corresponding to the first target text and identified by the first model;

determining that the second intent is a target intent corresponding to the first text.

In embodiment two, the first similarity value is greater than or equal to the first threshold, indicating that the first intention is not in accordance with the user's true intention. Accordingly, the electronic device may forgo determining the first intent as the target intent and re-determine the target intent.

In this embodiment, the electronic device may determine, in conjunction with the first text and the second text of the multiple turns of conversations, a target intent corresponding to the first text. The second text may be regarded as a history text of the multi-turn dialog with respect to the first text, and optionally, the second text may be a text corresponding to the dialog data of the user in the i-1 th turn of the multi-turn dialog, but is not limited thereto.

In specific implementation, the electronic device may detect whether the second text contains a preset second keyword. It is to be understood that the second keyword may be any keyword included in the P keywords and having the same rank as the first keyword.

In a case that the second text includes the second keyword, the electronic device may replace the second keyword of the second text with the first keyword to obtain a first target text, where it is understood that the first target text includes the first keyword and other information of the second text except for the second keyword. Then, the intention of the first target text is identified through the first model, and the intention is determined as the target intention of the first text. The target intention corresponding to the first text can be combined with the context information of the multiple turns of conversations, so that the accuracy of intention recognition of the text in the multiple turns of conversations can be improved, and the reliability of intention recognition of the text in the multiple turns of conversations can be improved.

For ease of understanding, the fourth problem of table 1 is illustrated as follows:

the similarity value between the intention 'professional edition' of the fourth question identified by the first model and the first keyword 'professional edition' included in the fourth question is larger than the first threshold value, the electronic equipment can detect that the third question includes the second keyword 'mobile phone bank', therefore, the second keyword 'mobile phone bank' in the third question can be replaced by the first keyword 'professional edition', the first target text 'how to buy the financial product with the professional edition' is obtained, and the intention of the first target text identified by the first model is 'how to buy the financial product with the professional edition', and the real intention of the user is met.

Further, after detecting whether the first text input by the electronic device contains the second keyword, the method further includes:

under the condition that the second text does not contain the second key word, splicing the first text and the second text to obtain a second target text;

acquiring a third intention corresponding to the second target text and identified by the first model;

determining that the third intent is a target intent corresponding to the first text.

It is to be understood that the second target text includes the first text and the second text.

For ease of understanding, the second problem of table 1 is illustrated below:

the similarity value of the intention 'popular version' of the second question identified by the first model and the first keyword 'popular version' included in the question is larger than the first threshold value, the electronic equipment can detect that the first question does not contain the keyword, therefore, the first question and the second question can be spliced to obtain a second target text 'how to purchase financing with popular version', and the intention of the first text is identified by the first model as 'how to purchase financing products in popular version', so that the real intention of the user is met.

Therefore, through the second embodiment, the target intention corresponding to the first text can be combined with the context information of the multiple rounds of conversations, so that the accuracy of intention recognition of the text in the multiple rounds of conversations can be improved, and the reliability of intention recognition of the text in the multiple rounds of conversations can be improved.

The following explains the principle of recognizing text intention by the first model of the embodiment of the present application:

optionally, the first model comprises an encoding layer, a generalization module, a relationship module and an intention classification result output module;

wherein the coding layer is to: acquiring a feature vector corresponding to the first text;

the induction module is connected with the coding layer and is used for: acquiring the feature vector from the coding layer, determining at least two class vectors according to the feature vector, and acquiring a target class vector according to the at least two class vectors;

the relation module is respectively connected with the coding layer and the induction module and is used for: determining a first intention corresponding to the first text according to the feature vector acquired from the coding layer and the target class vector acquired from the coding layer;

the intention classification result output module is connected with the relation module and used for acquiring the first intention from the relation module and outputting the first intention.

In this alternative embodiment, the first model is represented as a generalized model. It should be noted that the induction model may be trained in the same manner as in the related art. Alternatively, the idea of meta-learning may be employed to construct multiple phases (epinodes), each of which may be considered a task. In each training process, for each epsilon, C categories are randomly selected from training data, K samples are randomly selected from the C categories to serve as a support set, Q samples are randomly collected from the rest samples to serve as an inquiry set, and loss (loss) is solved on the inquiry set when the model is trained on the support set once. And for each epicode in the training stage, updating parameters according to the loss through a back propagation algorithm and an Adamw optimization algorithm, so that the model learns to learn, further completing the training of the whole network, and then storing the network with the best effect.

The inductive model of the embodiment of the present application is mainly different from the existing inductive model in that: the induction module of the induction model in the embodiment of the application innovatively provides a processing mode of category feature fusion, namely the induction module can obtain W category vectors in different modes and obtain a target category vector for determining text intention by fusing the W category vectors, wherein W is an integer greater than 1. Therefore, noise interference of the same intention and different expression modes is avoided, semantic information of different expression modes is considered, so that class vector representation containing more useful semantic information is obtained, the accuracy of intention identification is improved, and more accurate single-round user intention is obtained.

Optionally, the at least two class vectors may include, but are not limited to, a first class vector and a second class vector;

the first-class vectors are obtained by carrying out nonlinear conversion on the characteristic vectors and carrying out weighted summation on the characteristic vectors after nonlinear conversion;

the second class of vectors is obtained by averaging the feature vectors.

The concrete description is as follows:

for the first type of quantity: firstly, a matrix, namely a feature vector, is characterized for a sample of the jth support set of the ith category through formula (1)

Nonlinear conversion is performed using an activation function squarh function:

wherein s represents a support set; w is a _s A weight value representing a feature vector; b is a mixture of _s Indicating the bias.

Then, the feature vectors after conversion are subjected to weighted summation through formula (2), so as to obtain the first class of vectors

Wherein d is _i ＝softmax(b _i )，b _i Is the logarithm of the coupling coefficient of the ith class.

For the second class of vectors: averaging k samples in each class by formula (3) to obtain the second class vector

Wherein N is _i Is the total number of samples in category i.

After obtaining the first class vector and the second class vector, the induction module may fuse the first class vector and the second class vector to obtain a target class vector. Alternatively, the class-level features may be calculated by formula (4)

Where ω represents a class feature, i.e., a weight value of a class vector, and ω ₁ +ω ₂ =1. In practical applications, a class vector capable of well representing rich semantic information may be given a larger weight, and conversely, a class vector may be given a smaller weight, which may be determined according to practical situations, and this is not limited in this embodiment of the present application.

The class vector after fusion can be compressed by the activating function square function through the formula (5), so that the class vector is nonlinearly mapped to the interval [0,1 ]]To obtain the target class vector c _i ：

It is understood that, in practical application, the dynamic routing value b can be obtained by equation (6) _ij Updating is performed, and the weight w is updated through formula (6) and formula (1) to obtain an optimal class vector through iteration for multiple times:

the implementation principle of the other modules of the first model may be the same as that in the related art, and may also be implemented in the following manner:

in a specific implementation, before the electronic device identifies the first intention corresponding to the first text by using the first model, the electronic device may first process the first text, for example: chinese word segmentation, special symbol removal, stop word removal and other preprocessing operations, and then inputting the processed text into the coding layer of the first model.

For the coding layer, the coding layer may include a Bidirectional Gate recycling Unit (Bigru) + Attention (Attention) network. The coding layer can firstly carry out numeralization on the processed text in an embedding (embedding) mode to obtain a numeralization vector; then, extracting high-level features of the numerical vector, namely the feature vector, through a bidirectional GRU (Bigru) + Attention network. In the Bigru + Attention network, specifically, a Bigru network is used to obtain a state at time t, the states in two directions are spliced to obtain a hidden layer state of the Bigru network, and the output of the hidden layer state at each time passes through an Attention layer, so that deep semantic features of the text, namely the feature vectors, are obtained.

In the relational module, the class vector c can be modeled using a layer of nerve tensors _i With semantic feature vector of a sample in the challenge set, i.e. feature ring e ^q The relationship between (where q represents the set of interrogations), the tensor layer output relationship vector is:

M ^k ∈R ^2u×2u ,k∈[1,...,h]and then obtaining a final relation score through a sigmoid function, and determining the first intention.

It should be noted that, in the embodiments of the present application, various optional implementations that are described in the embodiments may be implemented in combination with each other or may be implemented separately without conflicting with each other, and the embodiments of the present application are not limited to this.

In the embodiment of the application, aiming at multiple rounds of conversations, the intention of the induction network for identifying the current sentence based on less sample learning is provided, a category feature fusion method is innovatively introduced into an induction module of the induction network to fuse a category vector obtained by a dynamic routing algorithm and a category vector obtained by an averaging method, and the accuracy of the intention identification of the current sentence is improved. Meanwhile, a preset scene key words with different grades (or called grades) are combined, an intention inheritance method based on semantic matching is provided to obtain the user intention combined with the historical information of the conversation, and the intention identification accuracy and speed in multiple rounds of conversations are effectively improved aiming at the condition that professional data in the vertical field is lack, so that the successful operation of man-machine conversation is facilitated, and better user experience is provided.

The following describes the embodiments of the present application in detail with reference to fig. 2 and 3.

FIG. 2 illustrates the process of training the generalization network, which mainly involves corpus acquisition, data preprocessing, model construction, model training, and storage.

Fig. 3 shows an intention recognition process for performing multiple rounds of conversations in practical applications by using a trained network model, which mainly includes preset keywords, collection of conversation data, data preprocessing, AI network recognition, keyword-intention semantic matching, and user intention recognition results.

The concrete description is as follows:

1. and obtaining the corpus.

The corpus used by the training model can be public data in the vertical field, and in the stage of testing the model, the man-machine conversation content of the actual scene is mainly acquired through a terminal, such as a computer, a mobile phone and a robot display screen.

2. And (5) constructing keywords.

According to different scenes in the vertical field, keywords of different levels, such as first-level keywords and second-level keywords, are preset, wherein zero or more second-level keywords may be preset under the first-level keywords.

3. And (4) preprocessing data.

For public data, operations such as Chinese word segmentation, word stop and the like need to be carried out on a text; for multiple rounds of conversations acquired in an actual scene, when part of conversation contents are missing, in order to not influence the user intention recognition result, the whole multiple rounds of conversation samples with the missing conversation contents need to be deleted. In addition, the dialog text of the user in each dialog is extracted and marked with the intention, and then the operations of Chinese word segmentation, special symbol removal, stop word removal and the like are carried out on the dialog text.

4. And (4) modeling an algorithm.

Because the types of intentions of users in an actual scene are more, but samples under the same intention are fewer, namely, labeled data are fewer, in order to solve the problem of small samples, an induction network based on less sample learning is adopted. The method comprises the steps of conducting embedding on preprocessed public text data to enable the text to be digitized, then extracting high-level features of a sample through a Bigru + Attenttion network, inputting the features of a dialog text into an induction module in the induction network, and innovatively providing a method for fusing category features in the induction module, so that noise interference of the same intention in different expression modes is avoided, semantic information of different expression modes is considered, so that class vector representation containing more useful semantic information is obtained, the accuracy of intention identification is improved, and more accurate single-turn user intention is obtained.

5. Multiple rounds of dialog intent recognition in a real-world scene.

The method comprises the steps of constructing a support set based on user data in an actual scene so as to facilitate subsequent intention identification, using a trained model in algorithm modeling to obtain the intention of a current user aiming at the problem of user input in the actual scene, combining preset scene keywords due to the fact that the intention of the current user is not combined with context information, performing semantic matching on the keywords contained in a sentence and the intention of the current user by adopting a cosine distance, judging whether the keywords in the current sentence need to be replaced or inherited by combining the text or not by setting a threshold value, and finally obtaining the intention combined with the context, namely a target intention.

As shown in fig. 4, the intention identifying method of the embodiment of the present application may include the steps of:

and S1, acquiring text corpora.

And S2, preprocessing data.

Specifically, the preprocessing such as Chinese word segmentation, special symbol removal, word stop removal and the like is performed on the speech.

And S3, constructing a network model.

And S4, training and storing the model.

And S5, setting scene keywords.

And S6, acquiring and processing user text data in an actual scene.

And S7, acquiring the user intention of the single-turn dialog without combining the dialog history.

And S8, combining the scene keywords to obtain the target intention of the user.

Specifically, a target intention combined with the dialog history information is acquired based on semantic matching according to whether the current user input information contains a scene keyword.

The concrete description is as follows:

in step S1, an intent recognition public data set associated with the vertical domain may be obtained for subsequent data preprocessing.

In step S2, in order to improve the final recognition effect, the public data set may be preprocessed by chinese word segmentation, special symbol removal, stop word removal, and the like.

In step S3, considering that in an actual scene, there are more types of intentions of users in each dialog, but there are fewer samples under the same intention, i.e. there are fewer labeled data, so we model the text preprocessed in step S2 by using an induction network to solve the problem of small samples, and the model structure is shown in fig. 5. In an encoding layer, the processed text is digitized in an embedding mode, then high-level features of samples are extracted through a Bigru + Attention network, the high-level features are used as input of a generalization module, a class vector is obtained through the generalization module, and a relation score of the feature vector and the class vector of each sample in a support set is calculated through a relation module, so that a classification effect is achieved.

In an encoding layer, for a numerical vector of any text, a bidirectional GRU (Bigru) network is used for obtaining a state at the time t, the states in two directions are spliced to obtain a hidden layer state of the Bigru network, and the output of the hidden layer state at each time passes through an attention layer, so that deep semantic features of the text are obtained.

In the induction module, a weighting fusion mode is innovatively designed, noise interference of the same intention in different expression modes is avoided, semantic information of different expression modes is considered, so that class vector representation containing more useful semantic information is obtained, and the accuracy of intention identification is improved. The concrete description is as follows:

first according to

Sample characterization matrix for jth support set of ith category

Performing nonlinear transformation using an activation function square (where s represents a support set); then, carrying out weighted summation on the transformed sample characterization to obtain a class vector

Wherein d is _i ＝softmax(b _i ) Here b is _i Is the logarithm of the coupling coefficient of the ith class. To improve the model effect, we average k samples in each class to represent the class vector

In the formula N _i Is the total number of samples in category i. Labeling the merged class-level features as

Then there are:

where ω represents the weight value of the class feature, and ω ₁ +ω ₂ And =1. Class vectors that can represent rich semantic information well are given greater weight, whereas they are given less weight.

Compressing the fused class vector by an activation function square function to enable the class vector to be nonlinearly mapped to an interval [0,1 ]]To obtain a new class vector c _i ：

Finally by the formula

Updating the dynamic routing value b _ij And a weight w, which is iterated for multiple times to obtain an optimal class vector.

In the relation module, the class vector c is simulated by using a nerve tensor layer _i Semantic feature vector e associated with a sample in the query set ^q The relationship between (where q represents the query set), the tensor layer outputs a relationship vector of:

M ^k ∈R ^2u×2u ,k∈[1,...,h]and then obtaining a final relation score through a sigmoid function.

In step S4, the constructed network model is trained and stored. By adopting the idea of meta-learning, a plurality of epicodes are constructed, and each epicode can be regarded as one task. In each training process, C categories are randomly selected from the training data for each epsilon, K samples are randomly selected from the C categories to serve as a support set, Q samples are randomly collected from the rest samples to serve as an inquiry set, and loss can be solved on the inquiry set when the model is trained on the support set once. And for each epicode in the training stage, updating parameters according to the loss through a back propagation algorithm and an Adamw optimization algorithm, so that the model learning can be realized, the training of the whole network can be completed, and then the network with the best effect can be stored.

In step S5, in order to make the finally identified intention combine with the history information and better conform to the real intention of the user in a specific scene, we manually construct keywords of different levels in combination with the business processes of different scenes, for example: in a certain scene, a first-level keyword and a second-level keyword are set, and zero or more second-level keywords are set under the first-level keyword.

In step S6, the dialog content of the user is obtained in an actual application scenario, the whole multi-turn dialog sample with the missing dialog content needs to be deleted, and for the obtained multiple dialog samples, the dialog text of the user in each pair of dialogs needs to be extracted and the intention of the user is marked.

In step S7, the obtained texts are grouped according to categories to construct a support set, the texts input online by the user are used as an inquiry set, the inquiry set and the support set are cleaned and feature extracted according to the steps S3 and S4, and the text intentions input online by the user at present, which are not combined with the conversation history, are obtained through the trained model.

In step S8, since the current user intention does not combine with the context information, a preset scene keyword needs to be combined to determine a target intention in multiple rounds of dialog, where the target intention refers to an intention combining with the history information, and for a first sentence input by the user, the user intention expression is clear and complete, and the current intention is the target intention. When the user performs the dialog operation again, the recognition flow is as shown in fig. 6, and the specific intention recognition method is as follows:

step 601, acquiring a first text input by a user.

Step 602, determining an intent of the first text.

Step 603, judging whether the first text contains keywords.

If not, go to step 604; if so, go to step 605.

Step 604, determining the intention of the first text as a target intention;

step 605, semantic similarity calculation is performed on the keywords contained in the first text and the intention of the first text to obtain a first similarity value.

Step 606, determine whether the first similarity value is greater than or equal to the first threshold.

If not, it is indicated that the current user's intention includes other information besides the semantic information of the keyword, for example: the user: how to buy financing in popular version? "the intention of the sentence is" buy financing in mass edition ", the keyword is" mass edition ", the similarity between the keyword" mass edition "and the intention" buy financing in mass edition "is smaller than the set threshold value, i.e. it is not similar, because the intention of buying financing is present besides the intention of mass edition, so the intention is more complete, the intention is the intention combined with the context, i.e. the target intention, then step 604 is executed.

If yes, go to step 607.

Step 607, determining whether the second text input by the user contains the keyword with the same level as the first text.

If yes, go to step 608; if not, go to step 609.

And 608, replacing the keywords in the second text with the keywords in the first text to obtain a first target text, and determining the intention of the first target text as a target intention.

And 609, splicing the first text and the second text to obtain a second target text, and determining the intention of the second target text as the target intention.

In the embodiment of the application, 1) a class feature fusion method is innovatively provided by an induction module of a network to fuse a class vector obtained by a dynamic routing algorithm with a class vector obtained by using an averaging method, so that noise interference of the same intention in different expression modes is avoided, semantic information of different expression modes is considered, so that class features containing more useful semantic information are obtained, and the accuracy of intention identification is improved. 2) In the intention recognition of multi-turn conversation of an actual scene, semantically matching keywords at different levels in a specific scene with a conversation intention, and replacing or splicing the keywords of the previous sentence according to a matching result to achieve the purpose of semantic inheritance, thereby obtaining a target intention combined with historical information.

Compared with the existing intention recognition scheme, the intention recognition method and the intention recognition system have the advantages that under the condition that professional data in the vertical field are lacked, accuracy of intention recognition in multiple rounds of conversation is effectively improved, smooth human-computer conversation is facilitated, and better user experience is provided.

Referring to fig. 7, fig. 7 is a structural diagram of an intention recognition apparatus provided in an embodiment of the present application. As shown in fig. 7, the intention identifying apparatus 700 includes:

a first obtaining module 701, configured to obtain a first text of multiple turns of conversations;

a second obtaining module 702, configured to obtain a first intention, corresponding to the first text, identified by the trained first model;

a calculating module 703, configured to perform semantic similarity calculation on the first keyword and the first intention to obtain a first similarity value when the first text contains a preset first keyword;

a first determining module 704, configured to determine a target intention corresponding to the first text according to a comparison result between the first similarity value and a first threshold.

Optionally, the first determining module 704 is configured to:

Optionally, the first determining module 704 includes:

a detecting unit, configured to detect whether a second text of the multiple rounds of conversations includes a preset second keyword when the first similarity value is greater than or equal to the first threshold, where the second text is acquired before the first text, and the second keyword and the first keyword are preset keywords at the same level;

the replacing unit is used for controlling the first keywords to replace the second keywords of the second text under the condition that the second text contains the second keywords to obtain a first target text;

a first acquisition unit configured to acquire a second intention corresponding to the first target text recognized by the first model;

a first determination unit to determine that the second intention is a target intention corresponding to the first text.

Optionally, the first determining module 704 further includes:

the splicing unit is used for splicing the first text and the second text to obtain a second target text under the condition that the second text does not contain the second keyword;

a second acquisition unit configured to acquire a third intention corresponding to the second target text, which is recognized by the first model;

a second determination unit configured to determine that the third intention is a target intention corresponding to the first text.

Optionally, the intention recognition device further comprises:

a second determining module, configured to determine the first intention as a target intention corresponding to the first text if the first text does not include any keyword of the P keywords, or i is equal to 1.

the induction module is connected with the coding layer and is used for: acquiring the feature vectors from the coding layer, determining at least two class vectors according to the feature vectors, and acquiring target class vectors according to the at least two class vectors;

Optionally, the at least two class vectors include a first class vector and a second class vector;

the first type of vector is obtained by carrying out nonlinear conversion on the characteristic vector and carrying out weighted summation on the characteristic vector after the nonlinear conversion;

the second class of vectors is obtained by averaging the feature vectors.

The intention identifying apparatus 700 can implement each process of the method embodiment in fig. 2 in the embodiment of the present application and achieve the same beneficial effects, and is not described herein again to avoid repetition.

The embodiment of the application also provides the electronic equipment. Referring to fig. 8, an electronic device may include a processor 801, a memory 802, and a program 8021 stored on the memory 802 and executable on the processor 801. When the program 8021 is executed by the processor 801, any steps in the method embodiment corresponding to fig. 2 may be implemented and the same advantageous effects may be achieved, which are not described herein again.

Those skilled in the art will appreciate that all or part of the steps of the method according to the above embodiments may be implemented by hardware associated with program instructions, and the program may be stored in a readable medium. An embodiment of the present application further provides a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, any step in the method embodiment corresponding to fig. 3 may be implemented, and the same technical effect may be achieved, and in order to avoid repetition, details are not repeated here.

The storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

While the foregoing is directed to the preferred embodiment of the present application, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the principles of the disclosure, and it is intended that such changes and modifications be considered as within the scope of the disclosure.

Claims

1. An intent recognition method, characterized in that the method comprises:

acquiring a first text of a plurality of turns of conversations;

under the condition that the first text contains a preset first keyword, performing semantic similarity calculation on the first keyword and the first intention to obtain a first similarity value;

2. The method of claim 1, wherein determining the target intent corresponding to the first text based on the comparison between the first similarity value and a first threshold comprises:

3. The method of claim 1, wherein determining the target intent corresponding to the first text based on the comparison between the first similarity value and a first threshold comprises:

4. The method according to claim 3, wherein after detecting whether the second text of the multiple rounds of conversations contains a preset second keyword, the method further comprises:

5. The method of claim 1, wherein after obtaining the first intent corresponding to the first text identified by the first model, the method further comprises:

and determining the first intention as a target intention corresponding to the first text under the condition that the first text does not contain preset keywords.

6. The method of claim 1, wherein the first model comprises an encoding layer, a generalization module, a relationship module, and an intent classification result output module;

7. The method of claim 6, wherein the at least two class vectors comprise a first class vector and a second class vector;

the second class of vectors is obtained by averaging the feature vectors.

8. An intention recognition apparatus, comprising:

the first acquisition module is used for acquiring first texts of multiple rounds of conversations;

9. The intent recognition device of claim 8, wherein said first determining module comprises:

a first acquisition unit configured to acquire a second intention corresponding to the first target text, which is recognized by the first model;

a first determination unit configured to determine that the second intention is a target intention corresponding to the first text.

10. The intent recognition device of claim 9, wherein said first determining module further comprises:

11. The intent recognition apparatus according to claim 8, wherein the first model comprises an encoding layer, a generalization module, a relationship module, and an intent classification result output module;

12. An electronic device, comprising: a transceiver, a memory, a processor, and a program stored on the memory and executable on the processor; characterized in that the processor, which is adapted to read a program in a memory, implements the steps in the intention-recognition method of any one of claims 1 to 7.

13. A readable storage medium storing a program, characterized in that the program, when executed by a processor, implements the steps in the intent recognition method of any of claims 1-7.