CN115017313A

CN115017313A - Intention recognition and model training method, electronic device and computer storage medium

Info

Publication number: CN115017313A
Application number: CN202210617940.XA
Authority: CN
Inventors: 刘澈; 李永彬
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-09-06

Abstract

The embodiment of the application provides an intention recognition and model training method, electronic equipment and a computer storage medium, wherein the intention recognition method comprises the following steps: acquiring an original characteristic coding sequence corresponding to a text to be recognized; performing mask processing on a mark in a text to be recognized to obtain a mask text and a masked feature coding sequence corresponding to the mask text; determining structural information among all marks in the text to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining structural feature vectors corresponding to the structural information; and aggregating the original feature coding sequence and the structural feature vector, and identifying the text intentions to be identified according to the aggregation result. Through the scheme provided by the embodiment of the application, more accurate intention identification can be realized, the intention identification cost is reduced, and the intention identification efficiency is improved on the whole.

Description

Intention recognition and model training method, electronic device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an intention recognition method, an intention recognition model training method, corresponding electronic equipment and a computer storage medium.

Background

Intent recognition is a fundamental natural language processing task for recognizing text-expressed intents and has wide applications in the field of artificial intelligence.

At present, the intention recognition mostly adopts the method that text is sent into a pre-training language model for coding, and then the intention recognition result is calculated through a Multi-Layer Perceptron (MLP) based on the coding result. Although this approach solves the problem of short sentence intent recognition to some extent, it is still insufficient for long difficult sentences. The reason is that the existing pre-training language model cannot analyze the complex logic contained in the long and difficult sentences because the logical relationship of each component in the long and difficult sentence is complex and has remote connection and the like, so that the recognition accuracy of the long and difficult sentence pattern is poor.

Therefore, how to identify the purpose of a long hard sentence accurately becomes a problem to be solved urgently.

Disclosure of Invention

In view of the above, embodiments of the present application provide an intention recognition and model training scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided an intention identification method, including: acquiring an original characteristic coding sequence corresponding to a text to be recognized; performing mask processing on the marks in the text to be recognized to obtain a mask text and a masked feature coding sequence corresponding to the mask text; determining structural information among all marks in the text to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining structural feature vectors corresponding to the structural information; and aggregating the original feature coding sequence and the structural feature vector, and identifying the text intent to be identified according to an aggregation result.

According to a second aspect of the embodiments of the present application, there is provided an intention recognition model training method, including: acquiring a training sample for training an intention recognition model, wherein the training sample comprises a text sample to be recognized and an intention label corresponding to the text sample to be recognized; inputting the training sample into an intention recognition model to be trained, and acquiring an original characteristic coding sequence corresponding to the text sample to be recognized; performing mask processing on the marks in the text sample to be recognized to obtain a mask text sample and a mask post-feature coding sequence corresponding to the mask text sample; determining structural information among all marks in the text sample to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information; aggregating the original feature coding sequence and the structural feature vector, and performing intent prediction on an aggregation result; and training the intention recognition model according to the difference between the intention prediction result and the intention label.

According to a third aspect of embodiments of the present application, there is provided an intention identification method including: obtaining a dialogue text corresponding to voice dialogue data of a user and an original feature coding sequence of the dialogue text; performing mask processing on the marks in the dialog text to obtain a mask text and a masked feature coding sequence corresponding to the mask text; determining structural information among all marks in the dialog text based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information; and aggregating the original feature coding sequence and the structural feature vector, and identifying the conversation intention according to an aggregation result.

According to a fourth aspect of embodiments of the present application, there is provided an intention identification method including: acquiring an original characteristic coding sequence corresponding to a search request of a user; performing mask processing on the marks in the search request to obtain a mask text and a masked feature coding sequence corresponding to the mask text; determining structural information among all marks in the search request based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information; and aggregating the original feature coding sequence and the structural feature vector, and identifying the search intention aiming at the search request according to an aggregation result.

According to a fifth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method.

According to a sixth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.

According to a seventh aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to execute operations corresponding to the method as described above.

According to the scheme provided by the embodiment of the application, when the intention of the text is identified, not only the semantic information carried by the original characteristic coding sequence is considered, but also the structural information of the text is obtained. Because the structural information of the text can effectively represent the logic relationship among all parts (such as characters, words, phrases and the like) in the text, the intention identification performed after the structural information and the semantic information are combined can more accurately capture the dependency and influence relationship among all parts in the text, and more accurate intention identification is realized. In addition, the structural information of the text is generated based on the original characteristic coding sequence and the masked characteristic coding sequence, so that the structural information of the text can be accurately obtained without depending on external information, the intention identification cost is reduced, and the intention identification efficiency is integrally improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to these drawings.

Fig. 1 is a schematic diagram of an exemplary system to which a check code generation method according to an embodiment of the present application is applied;

FIG. 2A is a flowchart illustrating steps of a method for training an intent recognition model according to a first embodiment of the present disclosure;

FIG. 2B is a schematic diagram of an intent recognition model in the embodiment of FIG. 2A;

FIG. 3A is a flowchart illustrating steps of a method for identifying intentions according to a second embodiment of the present application;

FIG. 3B is a diagram illustrating an example of a scenario in the embodiment shown in FIG. 3A;

FIG. 3C is a diagram illustrating another exemplary scenario in the embodiment shown in FIG. 3A;

fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Fig. 1 illustrates an exemplary system to which embodiments of the present application may be applied. As shown in fig. 1, the system 100 may include a server 102, a communication network 104, and/or one or more user devices 106, illustrated in fig. 1 as a plurality of user devices.

Server 102 may be any suitable server for storing information, data, programs, and/or any other suitable type of content. In some embodiments, server 102 may perform any suitable functions. For example, in some embodiments, the server 102 may be used for intent recognition of text. As an alternative example, in some embodiments, the server 102 may be used for text intent recognition based on the original feature encoding sequence carrying semantic information and the structural feature vector carrying structural information corresponding to the text. As another example, in some embodiments, when obtaining the structural information of the text, the server 102 may obtain the structural information of the text based on the masked feature encoding sequence corresponding to the masked text and the original feature encoding sequence corresponding to the original text by masking a mark (token) in the text. In some embodiments, an intention recognition model may be provided in the server 102 to enable intention recognition of the text by the intention recognition model. Optionally, in some embodiments, the server 102 may also train the intent recognition model. In some embodiments, the server 102 may send the intention recognition result to the user device after performing the intention recognition on the text and obtaining the intention recognition result.

In some embodiments, the communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 104 can include any one or more of the following: the network may include, but is not limited to, the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 106 can be connected to the communication network 104 by one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the server 102 via one or more communication links (e.g., communication link 114). The communication link may be any communication link suitable for communicating data between the user device 106 and the server 102, such as a network link, dial-up link, wireless link, hardwired link, any other suitable communication link, or any suitable combination of such links.

User device 106 may include any user device or devices adapted to interact with a user, receiving user input. In some embodiments, user devices 106 may comprise any suitable type of device. For example, in some embodiments, the user device 106 may include a mobile device, a tablet computer, a laptop computer, a desktop computer, a wearable computer, a game console, a media player, a vehicle entertainment system, and/or any other suitable type of user device. In some embodiments, the user device 106 may send information or data entered by the user to the server 102 for intent recognition by the server 102. In some embodiments, the user device 106 may also receive the intent recognition results returned by the server 102.

Although server 102 is illustrated as one device, in some embodiments, any suitable number of devices may be used to perform the functions performed by server 102. For example, in some embodiments, multiple devices may be used to implement the functions performed by the server 102. Alternatively, the functionality of the server 102 may be implemented using a cloud service.

Based on the above system, the present application will be described below with reference to a plurality of embodiments. For the sake of understanding, the following embodiments describe the training of the intention recognition model first, and then describe the application of the intention recognition model after the training is completed.

Example one

Referring to fig. 2A, a flowchart illustrating steps of a method for training an intention recognition model according to a first embodiment of the present application is shown.

The intention recognition model training method of the embodiment comprises the following steps:

step S202: training samples for training the intention recognition model are obtained.

The training samples comprise text samples to be recognized and intention labels corresponding to the text samples to be recognized.

The text sample to be recognized may be a text sample of any domain, including but not limited to: the present invention relates to a system and a method for providing a service, and more particularly, to a system and a method for providing a service.

The intention label corresponding to the text sample to be recognized can be set by those skilled in the art according to actual needs, for example, in the field of electronic commerce, the intention label may include: price inquiry intention, model intention, material intention, order intention and the like; the online search domain may then include richer intent tags, such as: address lookup intent, person lookup intent, news lookup intent, transaction lookup intent, and the like. It should be noted that the above-mentioned intention labels are only exemplary, in practical applications, the intention labels will be richer and more complex, and the intention labels may also include a plurality of levels, the intention labels of the next level are used to characterize the subdivision intention of the intention labels of the previous level, and so on.

Step S204: and inputting the training sample into the intention recognition model to be trained, and acquiring an original characteristic coding sequence corresponding to the text sample to be recognized.

In order to more clearly illustrate the training of the intention recognition model, in this step, an exemplary description is given to the structure of the intention recognition model, as shown in fig. 2B.

As can be seen from fig. 2B, the intention recognition model in the embodiment of the present application includes a first deep neural network encoder portion, a second deep neural network encoder portion, a graph neural network portion, and a prediction output portion. Wherein, the first deep neural network Encoder part can be implemented by an Encoder with a transform structure (including but not limited to transform Encoder); the second deep neural network Encoder part can be realized by an Encoder (including but not limited to a transform Encoder) which has an MLM (self-supervision) task and has a transform structure; the graph neural network part can be realized by adopting models such as GNN, GCN, GAT and the like; the prediction output part can comprise a splicing layer and an MLP layer, wherein the splicing layer is used for splicing the original feature coding sequence and the structural feature vector corresponding to the text sample to be recognized so as to realize the aggregation of the original feature coding sequence and the structural feature vector, and the MLP layer is used for outputting an intention prediction result based on the splicing result. In the training stage, the MLP layer also calculates the difference between the predicted result and the intention label, and trains the intention recognition model based on the difference. In the application stage, the MLP directly outputs the intent prediction result. For the second deep neural network encoder, the MLM is a model trained by destroying some mask positions in a text sequence and requiring the model to recover, and mask and self-supervised learning on the text sequence can be effectively realized based on the MLM without depending on external labeled label information. In this embodiment, the MLM is set as a model that has completed training.

Based on this, in this step, after the training sample is input into the intention recognition model, the text sample to be recognized is obtained first, and the text sample is encoded by the first deep neural network encoder part to obtain a corresponding encoding sequence, which is called an original feature encoding sequence, and the original feature encoding sequence can effectively represent semantic information of the text sample to be recognized.

Step S206: and performing mask processing on the marks in the text sample to be recognized to obtain a mask text sample and a masked feature coding sequence corresponding to the mask text sample.

In one possible approach, this step can be implemented as: respectively performing mask processing on each mark in a text sample to be recognized to obtain a plurality of mask text samples with the same number as the marks; and respectively coding the mask text samples to obtain a plurality of corresponding masked feature coding sequences.

Illustratively, let a, B, C, D, E respectively represent five tokens token in a long hard sentence text (e.g. a text with a word number exceeding a preset word number, or a text containing more than two punctuations, or a text containing multiple grammatical relations and/or sentence patterns, etc.), and token may be a word, or a phrase. For each token, after the long hard sentence is subjected to mask processing, five new token sequences (namely mask text samples) are obtained, which are respectively:

specifically, in the intention recognition model shown in fig. 2B, the masking process may be performed by the second deep neural network encoder to obtain the five mask text samples; further, the five mask text samples are respectively encoded to obtain five corresponding new encoding vectors, namely, the masked feature encoding sequences, which respectively correspond to the notations a ', b ', c ', d ', e '.

Step S208: and determining structural information among all marks in the text sample to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information.

In one possible approach, this step can be implemented as: for each mark subjected to mask processing, determining the similarity between codes corresponding to marks which are not subjected to mask processing and codes corresponding to the original feature code sequence in the feature code sequence after the mask corresponding to the mask text sample to which the mark belongs; according to the similarity, structural information between the marks which are subjected to the masking processing and the marks which are not subjected to the masking processing is determined. Therefore, accurate acquisition of the structure information among the tokens can be effectively realized without depending on external structure information labeling, and the acquisition cost of the structure information is reduced.

When the intention recognition model shown in fig. 2B is used, the obtaining of the structural information may be achieved by the second deep neural network encoder shown in fig. 2B. After the second deep neural network encoder acquires the masked feature coding sequence corresponding to a certain masked token, similarity calculation (which can ignore the similarity corresponding to the masked token) is performed on the sequence and the original feature coding sequence by taking the token as a unit, for example, cosine similarity calculation is performed, so as to determine the similarity between other tokens except the masked token in the two sequences. Because if the masked token has a strong association relationship with one or some of the other non-masked tokens, the non-masked token will have a larger difference, i.e. a low similarity, in its encoding from the original encoding due to the lack of the corresponding information of the masked token. Conversely, if the masked token has a weaker association relationship with one or some of the other non-masked tokens, the post-mask feature code of the non-masked token will have a smaller difference from the original code, i.e. a higher similarity. Based on this, for each masked token, the syntactic dependency, i.e., structure information, between it and the other tokens can be determined. After the similarity of all positions of all the sequences is traversed, the syntactic dependency relationship between any two tokens in all the tokens corresponding to the text sample to be recognized can be obtained.

Illustratively, when calculating the syntactic dependency relationship between the nth token and other tokens, the original feature coding sequence corresponding to the original text sample to be recognized and the masked feature coding sequence corresponding to the nth masked text sample corresponding to the nth token are selected. For example, when calculating the syntactic dependency relationship between two tokens, i.e., a and B, the original feature encoding sequence and the masked feature encoding sequence corresponding to the sequence a are selected.

the syntactic dependencies between tokens are computed in the manner: and for the Nth token and the Mth token, calculating the similarity, such as cosine similarity, between the feature codes at the Mth position in the two feature code sequences. For example, to calculate the syntactic dependency between two tokens, i.e., a and B, the cosine similarity of the positions corresponding to B in the original feature encoding sequence and the masked feature encoding sequence corresponding to sequence a needs to be calculated.

After traversing all the positions of all the sequences, the syntactic dependency relationship between any two tokens in the original text sample to be recognized can be obtained. In a feasible manner, after obtaining the syntactic dependency relationship between any two tokens, an edge greater than or equal to a preset similarity threshold may be retained, and an edge smaller than the similarity threshold is removed, so as to complete the construction of the syntactic dependency tree associated with the text sample to be recognized.

Or, for each masked token, according to a preset similarity threshold, screening the similarity corresponding to each token which is not masked and obtained based on the masked token; and according to the screening result, generating structural information between the token subjected to mask processing and the token corresponding to the screened similarity.

In the above process, the similarity threshold may be set by a person skilled in the art according to actual requirements, such as a normalized value of 0.7, and the like, which is not limited in the embodiment of the present application. By the mode, the data influencing the decisive syntax dependency relationship among the tokens can be filtered, so that the syntax dependency relationship among the tokens can be effectively and accurately determined, and the adverse effects of data processing amount and noise data on the syntax dependency relationship are greatly reduced.

Further, based on the structural information, a corresponding structural feature vector thereof may be obtained. When the structural information is represented in the form of a syntactic dependency tree, its corresponding structural feature vector can be extracted by a graph neural network model. That is, a syntactic dependency tree of the text sample to be recognized may be generated according to the structure information; and performing feature extraction on the syntactic dependency tree through the graph neural network model to obtain a corresponding structural feature vector. The structural feature vectors are easier to process and subsequently aggregate with the original feature-encoded vectors.

Illustratively, when the intent recognition model shown in FIG. 2B is employed, its graph neural network (including but not limited to graph neural networks of particular forms such as GNN, GCN, GAT, etc.) portion can be applied to extract the structural information of the syntactic dependency tree. The extracted information comprises respective information of a plurality of graph nodes, and the information of all the nodes is subjected to average pooling to obtain structural information, namely structural feature vectors, finally associated with the text sample to be recognized.

Step S210: and aggregating the original feature coding sequence and the structural feature vector, and performing intent prediction on an aggregation result.

In this embodiment, the aggregation is implemented in a form of concatenation, but is not limited thereto, and other forms of vector fusion are also applicable to the scheme of the embodiment of the present application.

Illustratively, the structural feature vector and the original feature coding sequence can be spliced, and the spliced vector carries both semantic information and structural information of the text sample to be recognized. Then, intent prediction is carried out on the spliced vector through a nonlinear multilayer perceptron MLP, and the spliced vector is mapped to a C-dimensional vector, wherein the C-dimensional dimension is consistent with the number of intent labels.

Step S212: and training an intention recognition model according to the difference between the intention prediction result and the intention label.

After the intention prediction result is obtained, loss calculation may be performed on the intention prediction result and an intention label corresponding to the text sample to be recognized based on a preset loss function, a loss value, that is, a difference between the intention prediction result and the text sample to be recognized is obtained, and an intention recognition model is trained based on the difference.

In one example, the intent tag of the text sample to be recognized may be implemented in the form of a one-hot code, based on which the cross-entropy loss between the intent prediction result and the true one-hot intent tag may be calculated and optimized, based on which model training is performed. The cross entropy loss is determined by a corresponding cross entropy loss function, the cross entropy loss function can adopt a conventional cross entropy loss function, and detailed implementation of the cross entropy loss function in the embodiment of the application is not described in detail.

The training is executed iteratively until a training termination condition is reached, such as a preset number of training times is reached, or a loss value reaches a set threshold, and the like.

Therefore, the intention recognition model obtained by training can comprehensively consider semantic information and structural information of the text so as to predict the text intention more accurately. And when the model predicts the structural information of the text, the model determines the coding difference of the corresponding position based on the original characteristic coding sequence and the masked characteristic coding sequence corresponding to the token so as to predict the structural information among the tokens without depending on the external label data of the structural information, so that the prediction of the structural information is more accurate, and the overall cost for realizing the model is reduced.

Example two

Referring to fig. 3A, a flowchart illustrating steps of an intention identification method according to a second embodiment of the present application is shown.

In this embodiment, an intention recognition model trained based on the method of the first embodiment is described with emphasis on the intention recognition method of the embodiment of the present application from the application perspective.

The intention identifying method of the embodiment comprises the following steps:

step S302: and acquiring an original characteristic coding sequence corresponding to the text to be recognized.

The text to be recognized may be text in any field that needs to be subjected to intent recognition, including but not limited to a human-computer conversation field, an online customer service field, an online search field, an electronic commerce field, an online medical field, and the like.

When the scheme of this embodiment is implemented based on the trained intention recognition model of the first embodiment, the text to be recognized may be encoded by the first deep neural network encoder shown in fig. 2B, so as to obtain the original feature encoding sequence. For specific implementation, reference may be made to the description of relevant parts in the first embodiment, which is not described herein again.

Step S304: and performing mask processing on the marks in the text to be recognized to obtain a mask text and a masked feature coding sequence corresponding to the mask text.

In one possible approach, this step can be implemented as: respectively performing mask processing on each mark token in the text to be recognized to obtain a plurality of mask texts with the same number as the mark tokens; and respectively coding the mask texts to obtain a plurality of corresponding post-mask feature coding sequences.

When the scheme of this embodiment is implemented based on the intention recognition model that is trained in the first embodiment, masking is performed on each token in the text to be recognized by using the second deep neural network encoder shown in fig. 2B, so as to obtain a plurality of mask texts and a plurality of post-mask feature coding sequences corresponding to the plurality of mask texts. For specific implementation, reference may be made to the description of relevant parts in the first embodiment, which is not described herein again.

Step S306: and determining structural information among all marks in the text to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information.

In one possible approach, this step can be implemented as: for each token subjected to mask processing, determining similarity between codes corresponding to tokens not subjected to mask processing in the post-mask feature coding sequence corresponding to the mask text to which the token belongs and codes corresponding to the original feature coding sequence; according to the similarity, structural information between the masked tokens and the tokens which are not masked is determined.

Further optionally, according to the similarity, determining the structural information between the masked tokens and the tokens that are not masked may be implemented as: according to a preset similarity threshold, screening the similarity corresponding to each token which is not processed by the mask; and according to the screening result, generating structural information between the token subjected to mask processing and the token corresponding to the screened similarity.

When the scheme of this embodiment is implemented based on the trained intention recognition model of embodiment one, the structural information between tokens in the text to be recognized can still be obtained by the second deep neural network encoder shown in fig. 2B. For specific implementation, reference may be made to the description of relevant parts in the first embodiment, which is not described herein again.

Further, a syntactic dependency tree of the text to be recognized can be generated according to the structure information among tokens in the text to be recognized; and then, performing feature extraction on the syntactic dependency tree through a graph neural network model to obtain a corresponding structural feature vector.

When the solution of the present embodiment is implemented based on the trained intention recognition model of the first embodiment, the structural feature vector corresponding to the structural information can be obtained through the graph neural network part shown in fig. 2B. For specific implementation, reference may be made to the description of relevant parts in the first embodiment, which is not described herein again.

Step S308: and aggregating the original feature coding sequence and the structural feature vector, and identifying the text intentions to be identified according to the aggregation result.

The original feature coding sequence and the structural feature vector are aggregated by adopting a splicing or fusion mode, so that the semantic information and the structural information of the text to be recognized can be effectively fused to accurately recognize the intention of the text to be recognized.

When the solution of the present embodiment is implemented based on the intention recognition model of which embodiment one training is completed, vector aggregation and intention recognition can be implemented by the prediction output section shown in fig. 2B. For specific implementation, reference may be made to the description of relevant parts in the first embodiment, which is not described herein again. Different from the training phase in the first embodiment, in this embodiment, a vector value may be calculated for the C-dimensional vector generated by the MLP to obtain a probability result of each type of intent in a plurality of preset intent categories, and the intent category with the highest probability may be taken as the final intent recognition result.

As can be seen from the above, in the scheme of this embodiment, an original feature coding sequence corresponding to a text to be recognized is first obtained; then, performing mask processing on the texts to be recognized one by one, generating a new text sequence for any masked position, and coding the new text sequence through an intention recognition model to obtain surface layer representation; further, analyzing the syntactic dependency relationship between the masked position and all other positions in the sequence, if the syntactic dependency relationship is larger than a certain threshold (namely a preset similarity threshold), keeping the syntactic dependency edge, otherwise, pruning the syntactic dependency edge; as the syntactic dependency corresponding to each position is calculated, the syntactic dependency tree of the whole text to be recognized is also generated; further, a graph neural network is applied to the syntactic dependency relationship tree to obtain a structural feature vector of the text to be recognized, and the intent recognition of the text to be recognized is performed jointly by combining an original feature coding sequence corresponding to the original text of the text to be recognized to obtain an intent recognition result.

Hereinafter, the above-described procedure is exemplified by way of a scenario example.

As shown in fig. 3B, an intent recognition method in a human-machine dialog scenario is shown.

Specifically, the smart device receives voice conversation data (which may be question data or simple non-question conversation communication data, and this example does not limit a specific conversation data field or content) of the user; sending it to the server; after receiving the voice conversation data, the server converts the voice conversation data into a corresponding conversation text (or directly converts the voice conversation data into the conversation text by the intelligent equipment and then sends the conversation text to the server), inputs an intention recognition model, and obtains an original feature coding sequence corresponding to the conversation text through a first neural network encoder of the intention recognition model; performing mask processing on the marks in the dialog text through a second neural network encoder of the intention recognition model to obtain a mask text and a post-mask feature coding sequence corresponding to the mask text; and determining structural information among tokens in the dialog text based on the original feature coding sequence and the masked feature coding sequence through a second neural network encoder; obtaining a structural feature vector corresponding to the structural information through a graph neural network of the intention recognition model; and aggregating the original feature coding sequence and the structural feature vector through a prediction output part of the intention recognition model, performing conversation intention recognition according to an aggregation result, and outputting an intention recognition result. Further, subsequent processing may be performed based on the intention recognition result, such as determining a reply text that replies to the voice dialog data of the user, and feeding the reply text back to the smart device for conversion by the smart device into voice to be played to the user, and so on.

In another scenario, as shown in FIG. 3C, an intent recognition method in an online search scenario is shown.

Specifically, the intelligent device receives a search request input by a user and sends the search request to a server; after receiving the search request, the server inputs an intention identification model, and obtains an original characteristic coding sequence corresponding to the search request through a first deep neural network encoder of the intention identification model; masking the token in the search request by a second deep neural network encoder of the intention recognition model to obtain a mask text and a mask post-feature coding sequence corresponding to the mask text; determining structural information among tokens in the search request based on the original feature coding sequence and the masked feature coding sequence through a second deep neural network encoder; obtaining a structural feature vector corresponding to the structural information through a graph neural network of the intention recognition model; and aggregating the original feature coding sequence and the structural feature vector through a prediction output part of the intention recognition model, performing search intention recognition aiming at the search request according to an aggregation result, and outputting an intention recognition result. Further, subsequent processing may be performed based on the intent recognition result, such as determining a search result for the search request and feeding the search result back to the smart device for presentation by the smart device to the user, and so forth.

Therefore, through the embodiment, when the intention recognition is performed on the text, not only the semantic information carried by the original feature coding sequence is considered, but also the structural information of the text is obtained. The structural information of the text can effectively represent the logical relationship among all parts (such as characters, words, phrases and the like) in the text, so that the intention identification performed after the structural information and the semantic information are combined can more accurately capture the dependence and influence relationship among all parts in the text, and more accurate intention identification is realized. In addition, the structural information of the text is generated on the basis of the original characteristic coding sequence and the mask characteristic coding sequence, so that the structural information of the text can be accurately obtained without depending on external information, the intention identification cost is reduced, and the intention identification efficiency is integrally improved.

EXAMPLE III

Referring to fig. 4, a schematic structural diagram of an electronic device according to a third embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 4, the electronic device may include: a processor (processor)402, a Communications Interface 404, a memory 406, and a Communications bus 408.

Wherein:

the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408.

A communication interface 404 for communicating with other electronic devices or servers.

The processor 402 is configured to execute the program 410, and may specifically perform the relevant steps in any of the above method embodiments.

In particular, program 410 may include program code comprising computer operating instructions.

The processor 402 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may be specifically configured to enable the processor 402 to execute the operations corresponding to the intention recognition method or the intention recognition model training method described in any of the foregoing method embodiments.

For specific implementation of each step in the program 410, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, and corresponding beneficial effects are provided, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The present application further provides a computer program product, which includes computer instructions for instructing a computing device to execute an operation corresponding to any one of the intent recognition methods or the intent recognition model training method in the above multiple method embodiments.

It should be noted that, in the embodiments of the present application, the intention of a long hard sentence is identified as an example, but it should be understood by those skilled in the art that, in practical applications, the solution of the embodiments of the present application may be applied not only to a long hard sentence, but also to other sentences, such as a standard sentence or a short sentence.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. An intent recognition method comprising:

acquiring an original characteristic coding sequence corresponding to a text to be recognized;

performing mask processing on the marks in the text to be recognized to obtain a mask text and a masked feature coding sequence corresponding to the mask text;

determining structural information among all marks in the text to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining structural feature vectors corresponding to the structural information;

and aggregating the original feature coding sequence and the structural feature vector, and identifying the text intent to be identified according to an aggregation result.

2. The method according to claim 1, wherein the masking the mark in the text to be recognized to obtain a masked text and a masked feature encoding sequence corresponding to the masked text includes:

respectively performing mask processing on each mark in the text to be recognized to obtain a plurality of mask texts with the same number as the marks;

and respectively coding the mask texts to obtain a plurality of corresponding post-mask feature coding sequences.

3. The method according to claim 2, wherein the determining structural information between the respective marks in the text to be recognized based on the original feature encoding sequence and the masked feature encoding sequence comprises:

for each mark subjected to mask processing, determining the similarity between codes corresponding to marks which are not subjected to mask processing and codes corresponding to the original feature coding sequence in the feature coding sequence after the mask corresponding to the mask text to which the mark belongs;

and determining structural information between the marks which are subjected to masking processing and the marks which are not subjected to masking processing according to the similarity.

4. The method of claim 3, wherein the determining structure information between the masked tokens and the respective tokens that are not masked according to the similarity comprises:

according to a preset similarity threshold value, screening the similarity corresponding to each mark which is not processed by the mask;

and according to the screening result, generating structure information between the marks subjected to mask processing and the marks corresponding to the screened similarity.

5. The method according to any one of claims 1-4, wherein the obtaining of the structural feature vector corresponding to the structural information includes:

generating a syntactic dependency tree of the text to be recognized according to the structure information;

and performing feature extraction on the syntactic dependency tree through a graph neural network model to obtain a corresponding structural feature vector.

6. An intent recognition model training method, comprising:

acquiring a training sample for training an intention recognition model, wherein the training sample comprises a text sample to be recognized and an intention label corresponding to the text sample to be recognized;

inputting the training sample into an intention recognition model to be trained, and acquiring an original characteristic coding sequence corresponding to the text sample to be recognized;

performing mask processing on the marks in the text sample to be identified to obtain a mask text sample and a masked feature coding sequence corresponding to the mask text sample;

determining structural information among all marks in the text sample to be recognized based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information;

aggregating the original feature coding sequence and the structural feature vector, and performing intent prediction on an aggregation result;

and training the intention recognition model according to the difference between the intention prediction result and the intention label.

7. The method according to claim 6, wherein the masking the mark in the text sample to be recognized to obtain a masked text sample and a masked feature encoding sequence corresponding to the masked text sample includes:

respectively performing mask processing on each mark in the text sample to be recognized to obtain a plurality of mask text samples with the same number as the marks;

and respectively coding the mask text samples to obtain a plurality of corresponding masked feature coding sequences.

8. The method according to claim 7, wherein the determining structural information between the respective marks in the text sample to be recognized based on the original feature encoding sequence and the masked feature encoding sequence comprises:

for each mark subjected to mask processing, determining the similarity between codes corresponding to marks which are not subjected to mask processing and codes corresponding to the original feature code sequence in the feature code sequence after the mask corresponding to the mask text sample to which the mark belongs;

9. The method according to any one of claims 6-8, wherein the obtaining of the structural feature vector corresponding to the structural information comprises:

generating a syntactic dependency tree of the text sample to be recognized according to the structure information;

10. An intent recognition method comprising:

obtaining a dialogue text corresponding to voice dialogue data of a user and an original feature coding sequence of the dialogue text;

performing mask processing on the marks in the dialog text to obtain a mask text and a masked feature coding sequence corresponding to the mask text;

determining structural information among all marks in the dialog text based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information;

and aggregating the original feature coding sequence and the structural feature vector, and identifying the conversation intention according to an aggregation result.

11. An intent recognition method comprising:

acquiring an original characteristic coding sequence corresponding to a search request of a user;

performing mask processing on the marks in the search request to obtain a mask text and a masked feature coding sequence corresponding to the mask text;

determining structural information among all marks in the search request based on the original feature coding sequence and the masked feature coding sequence, and obtaining a structural feature vector corresponding to the structural information;

and aggregating the original feature coding sequence and the structural feature vector, and identifying the search intention aiming at the search request according to an aggregation result.

12. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction which causes the processor to execute the corresponding operation of the method according to any one of claims 1-11.

13. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 11.

14. A computer program product comprising computer instructions to instruct a computing device to perform operations corresponding to the method of any of claims 1-11.