WO2021204017A1

WO2021204017A1 - Text intent recognition method and apparatus, and related device

Info

Publication number: WO2021204017A1
Application number: PCT/CN2021/083876
Authority: WO
Inventors: 李�杰; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-20
Filing date: 2021-03-30
Publication date: 2021-10-14
Also published as: CN112417855A

Abstract

A text intent recognition method, comprising: acquiring speech information and a text queue, and converting the speech information into text to be recognized (S101); extracting features from the text to be recognized and each piece of text in the text queue, so as to obtain text features of the text to be recognized and text features corresponding to each piece of text (S102); according to the text features of the text to be recognized and the text features of each piece of text, obtaining fused features of the text to be recognized (S103); and performing intent classification on the fused features to obtain an intent corresponding to the text to be recognized (S104). Context matching information of text to be recognized and each piece of text in a text queue is captured from a word level to a sentence level, and in this way, feature fusion is performed on different pieces of text on different granularities, such that historical semantic information can be fully used to fuse context information, and a feature at the word level and a feature at the sentence level are combined to obtain a feature that further has a discrimination capability, thereby improving the precision of text intent recognition.

Description

Text intention recognition method, device and related equipment

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 20, 2020, the application number is 202011309413.X, and the invention title is "Text Intent Recognition Method, Apparatus, and Related Equipment", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and related equipment for text intent recognition.

Background technique

With the increasing popularity of computer network technology, text intent recognition is widely used in products such as intelligent voice assistants and intelligent dialogue robots. In order to better understand customer needs, respond more accurately, and improve customer satisfaction, a machine dialogue system is required Accurately and completely identify the actual intention corresponding to a paragraph sent by the customer.

At present, text intent recognition is mainly the recognized text obtained by voice recognition of the customer's voice by the intelligent customer service system, and further determines the meaning expressed by the customer by recognizing the text intent, and then responds to the customer according to the intent matching corresponding text. The inventor realizes that using only a single sentence for intention recognition may identify wrong intentions. For example, the customer's current words may be based on the previous few sentences as the premise. When the premise is not satisfied, the intention expressed by the current sentence may be completely different. Leading to wrong responses from customer robots not only reduces the customer experience, but also provides customers with wrong services.

Summary of the invention

The present application provides a text intention recognition method and system, which effectively solves the problem of intent recognition errors caused by the complexity and diversity of the dialogue content in the previous single sentence dialogue intention recognition.

In the first aspect, an embodiment of the present application provides a text intent recognition method, including: acquiring voice information and a text queue, and converting the voice information into text to be recognized, the text queue includes one or more texts; the text to be recognized and the text queue Extract features of each text in the to obtain the text feature of the text to be recognized and the text feature corresponding to each text; according to the text feature of the text to be recognized and the text feature corresponding to each text, the fusion feature corresponding to the text to be recognized is obtained; The fusion features are classified by the intention classification model, and the intent corresponding to the text to be recognized is obtained.

In a second aspect, an embodiment of the present application provides a text intent recognition device, including: an acquisition unit for acquiring voice information and a text queue; a preprocessing unit for converting the voice information into text to be recognized and adding it to the text queue The feature extraction unit is used to extract features of the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text; the fusion unit is used to extract the features according to the text feature of the text to be recognized and The text feature of each text obtains the fusion feature of the text to be recognized; the classification unit is used to classify the fusion feature through the intent classification model to classify the intent to obtain the intent corresponding to the text to be recognized.

In a third aspect, an embodiment of the present application provides a text intent recognition device, including: a processor and a memory, and the processor executes the code in the memory to execute a text intent recognition method, including: acquiring voice information and a text queue, and converting the voice information into The text to be recognized, the text queue includes one or more texts; the text to be recognized and each text in the text queue are extracted from features to obtain the text feature of the text to be recognized and the corresponding text feature of each text; according to the text feature of the text to be recognized The text feature corresponding to each piece of text obtains the fusion feature corresponding to the text to be recognized; the fusion feature is classified by the intention classification model to obtain the intent corresponding to the text to be recognized.

In a fourth aspect, a computer-readable storage medium includes instructions. When the instructions run on a computer, the computer executes a text intent recognition method, including: acquiring voice information and a text queue, and converting the voice information into text to be recognized. The queue includes one or more texts; extract features of the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text; corresponding to each text according to the text feature of the text to be recognized The fusion feature corresponding to the text to be recognized is obtained; the fusion feature is classified into intent through the intent classification model, and the intent corresponding to the text to be recognized is obtained.

The embodiment of the application captures context matching information from the word level to the sentence level between the text to be recognized and each text in the text queue, so that the feature fusion of different texts at different granularities can make full use of historical semantic information to achieve context. The fusion of information combines word-level features and sentence-level features to obtain a more discriminative feature, which improves the accuracy of text intent recognition.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the background art, the following will describe the drawings that need to be used in the embodiments of the present application or the background art.

FIG. 1 is a schematic diagram of a work flow of a text intent recognition intelligent customer service system provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for text intent recognition provided by an embodiment of the present application;

Fig. 3 is a schematic diagram of a model for extracting text sentence-level features provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a text intention recognition device provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of feature extraction provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a text intention recognition device provided by an embodiment of the present application.

Detailed ways

The terms used in the embodiments of the application are only used to explain the specific embodiments of the application, and are not intended to limit the application.

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

The technical solution of the present application may involve the field of artificial intelligence and/or big data technology, and may be used in scenarios such as financial technology such as intelligent question answering in a banking system to realize intention recognition. Optionally, the data involved in this application, such as voice, text, and/or intention information, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.

The text mentioned in this embodiment includes words or sentences. Words are collective names of words and phrases, including words (including words and compound words) and phrases (also known as phrases), which constitute the smallest word structure unit of a sentence article . Sentence is the basic unit of language use. It is composed of words and phrases (phrases). It can express a complete meaning, such as telling someone something, asking a question, expressing a request or stopping, expressing a certain emotion, expressing a certain sentence The continuation or omission of.

First, the intelligent customer service system for text intent recognition involved in the embodiments of the present application is described.

Figure 1 shows a schematic diagram of the workflow of a text intent recognition system. The framework describes the overall workflow of the intelligent customer service system. In this embodiment, first obtain customer voice information, perform voice recognition, and obtain the text to be recognized, and add the text to be recognized into the text queue, where the text queue includes one or more text to be recognized; then the text to be recognized and the text in the text queue Extract features of each text to obtain the text features of the text to be recognized and the text features of each text in the text queue. The text features include word-level features and sentence-level features; according to the text features of the text to be recognized and the text queue The text feature of each text, get the fusion feature corresponding to the text to be recognized; classify the fusion features through intent to get the intent corresponding to the text to be recognized; finally, the intelligent customer service system can select according to the current process link and customer intent category Reply with appropriate reply words.

In a specific embodiment, as shown in FIG. 2, a flow chart of a method for text intent recognition is provided. Taking the method applied to the intelligent customer service system in FIG. 1 as an example for description, the method includes the following steps:

S101: Acquire voice information and a text queue, and convert the voice information into text to be recognized.

In the specific embodiment of the present application, the voice information input by the customer is obtained, and the voice information is used for the intelligent customer service system to convert the voice information into the text to be recognized to obtain the text classification intention, that is, the corresponding customer demand intention. For example, the user inputs "I want to listen to Jay Chou's song", which is used by the intelligent customer service system to convert the voice input by the customer into the text to be recognized, so as to obtain the intent of listening to the song. After obtaining the customer's voice information, the voice recognition algorithm wav2letter++ is used for voice recognition, and the voice input by the customer is converted into the corresponding text to be recognized. At the same time, a text queue is obtained, where the text queue includes one or more texts. The text queue can hold k pieces of text. The way to add text in the text queue is: after the voice information is converted into the text to be recognized, when the number of texts in the text queue is less than k, the text to be recognized is added to the text queue. The k pieces of text are arranged in the order of adding time; when the number of texts in the text queue is equal to k, the first text added to the text queue is deleted, and the text to be recognized is added to the text queue.

Exemplarily, if the size of the text queue is 5, the text in the text queue is sorted in the order of entry, followed by {1, 2, 3, 4, 5}, where 1 represents the first text added in the text queue , In the same way, 2, 3, 4, 5 and so on. When the number of texts in the text queue does not exceed 5, the texts to be recognized are directly added to the text queue in order. When the number of texts in the text queue is 5, text 1 is deleted first, and then the texts to be recognized are added.

S102: Extract features from the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text.

In a specific embodiment, to recognize the text and each text in the text queue, first, extract the word-level features, then use m attention models to extract the sentence-level features; finally, the text to be recognized is corresponding The word-level features and sentence-level features are combined as the features of the text to be recognized;

The specific steps for extracting features of the text to be recognized and each text in the text queue include:

The first step is to extract features at the word level

Specifically, first, for each text in the text queue, a word segmentation tool is used to perform word segmentation processing to obtain the word x, where the word segmentation tool can be jieba, SnowNLP, THULAC, NLPIR, etc. This does not constitute any limitation in the embodiments of the present application. For the i-th recognized text in the text queue, the word x _i can be obtained after word segmentation processing. _{Then, map the n words x i} in the i-th recognized text to the word embedding matrix V to obtain n word vectors V(x _i ). Finally, connect n word vectors to obtain the word vector matrix W _{i of} the i-th recognized text as the word-level feature. After k texts are processed, k word vector matrices {W ₁ , W ₂ …W _k } can be obtained. It is understandable that, after the above-mentioned processing of the text to be recognized, the word-level feature W _{k+1 of the} text to be recognized can be obtained.

The word embedding matrix V may be obtained by training the Word2vec model on 3 million pieces of text data, or it may be obtained by training on other models, which is not limited in the embodiment of the present application. Before or after word segmentation, there may be corpus cleaning, part-of-speech tagging, and removal of stop words, such as deleting noise data, removing modal particles according to a preset modal particle table, etc., which are not limited in the embodiment of this application.

Exemplarily, take jieba word segmentation as an example. When the input recognition text is "How is the weather today?", the output after jieba word segmentation processing can be: "today", "every day", "weather", "how", "How", "Ah", "?", after part-of-speech tagging, the output can be: "Today n", "Everyday v", "Weather n", "How to r", "How to r", "Ah y" , "?Vv", where n represents a noun, v represents a verb, r represents a pronoun, y represents a modal particle, and vv represents a punctuation mark. After removing the stop words, the output can be: "today n", "every day v", "weather n", "how r", "how r". Among them, the removed modal particles can be removed according to the preset modal particle list. In this way, n words in one sentence can be obtained.

The second step is to extract features at the sentence level

_{Specifically, for the feature word vector matrix W i} at the word level extracted from the i-th text, m attention models are used _{to process the word vector matrix W i} to reach m sentence features at different levels: u _i,1 ～ u _i,m . Among them, the output of the i-th attention model is used as the input of the i+1-th attention model, and i is a positive integer greater than or equal to 1 and less than m. In other words, the output of the previous attention model in the m attention modules is used as the input of the next attention model. Taking y={u _i,1 ～u _i,m } as a sentence-level feature in a sentence text, k texts can be processed to obtain k sentence-level features {y ₁ , y ₂ …y _k }. As shown in Figure 3, the feature word vector matrix W _i at the word level is used as the input of the first attention model, the output of the first attention model is used as the input of the second attention model, and the last attention model is used in turn. The output is used as the input of the next attention model, and finally to m sentence features u _i,1 ~ u _{i,m at} different levels. In this way, using m attention model processing can obtain deeper semantic information. It will be appreciated that the text to be recognized through the above process, wherein the recognized text sentence level may be obtained _{_{y k + 1 = {u k}} + 1,1 ~ u k + 1, m}.

The attention model can be understood as imagining that the constituent elements in the Source are composed of a series of <Key, Value> data. At this time, given an element Query in the Target, by calculating the similarity or correlation between the Query and each Key The weight coefficient of each Key corresponding to the Value is obtained, and then the Value is weighted and summed to obtain the final Attention value. The calculation process is as follows:

The first step: Calculate the similarity or correlation between the two based on Query and Key;

Among them, t, i, and j respectively represent the number of words in Query, Key, and Value, and d represents the dimension of the word. Q[t]·K[i] ^T represents the dot product of Q[t] and K[i] ^T , and the result S(Q _t ,K _i ) represents a certain element Q[t] in target and K[ in source i] corresponds to the similarity value of V[j] to obtain the dependency relationship between the input word and the word. It is understandable that in the embodiments of the present application, the method of calculating similarity or correlation is only used as an example. In practical applications, calculating similarity or correlation can be the vector dot product of the two and the calculation of the two. The vector Cosine similarity of the vector, and the introduction of an additional neural network for evaluation, etc., are not limited in the embodiment of the present application.

The second step: normalize the original score of the first step to obtain the weight coefficient;

Specifically, the value range of the score generated in the first step is different depending on the specific generation method. Therefore, the calculation method of softmax is introduced in the second step to convert the score of the first step. On the one hand, it is normalized. , Organize the original calculated scores into a probability distribution with the sum of all element weights being 1. On the other hand, it is also possible to highlight the weights of important elements through the internal mechanism of SoftMax. The specific calculation process is as follows:

Among them, at _{, i} is a weight matrix, which represents the weight coefficient of V corresponding to K.

Step 3: Perform a weighted sum of Value according to the weight coefficient.

Among them, n _Q represents the number of words in Q, and V _att represents the final Attention value of the element Q[t] pair.

The third step is to obtain text features based on word-level features and sentence-level features

Specifically, the feature of the i-th recognized text word level and the feature of the i-th recognized text sentence level are combined as the feature of the i-th recognized text: [W _i , y _i ], for the k texts in the text queue Get k text features. It is understandable that for the text to be recognized, the feature sentence level at the word level of the text to be recognized can be combined as the feature of the text to be recognized, which can be up to [W _k+1 ,y _k+1 ].

The embodiment of the application uses m attention models to process the word vector matrix of each text and the word vector matrix of the text to be recognized, and the output of the previous attention model is used as the input of the next model to obtain each text The features of different levels of the m song and the features of m different levels of the text to be recognized, so as to obtain richer, multi-level and multiple-level features.

S103: Obtain a fusion feature corresponding to the text to be recognized according to the text feature of the text to be recognized and the text feature corresponding to each text in the text queue.

In a specific embodiment, the obtained k text features and the text features to be recognized are used to perform the Deep Attention Matching (DAM) algorithm on the feature W at the word level and the feature y at the sentence level. Matching, the fusion feature of the text to be recognized is obtained. Specifically, the word-level features of the text to be recognized are matched with the word-level features of the obtained k texts to obtain word-level matching results, and the word-level matching results are merged to obtain the first fusion feature; The sentence-level features of the recognized text are matched with the sentence-level features of the k texts to obtain sentence-level matching results, and the sentence-level matching results are fused to obtain the second fusion feature. The first fusion feature and the second fusion feature are fused to obtain the fusion feature corresponding to the text to be recognized.

Among them, the idea of the DAM algorithm is to select the most matching response from a set of candidate responses under a given dialogue context. Specifically, first, each word in the context text or in the response text is regarded as the central meaning of the abstract semantic segment, and stacked attention is used to construct text representations of different granularities; second, taking into account the text relevance and dependency information, based on different Granular segment matching is used to match each text in the context and the response. In this way, the DAM algorithm captures the matching information between the context and the response from the word level to the sentence level; then it extracts important information through convolution and maximum pooling operations. The matching features are finally fused into a single matching score through a single-layer perceptron. In this way, feature fusion of different texts at different granularities can make full use of historical semantic information and achieve contextual information fusion.

The specific steps of the DAM algorithm are as follows: firstly, use layered attention to construct text representations of different granularities, and secondly, extract the truly paired fragments from the entire context and response.

Specifically, the DAM algorithm model framework can be: representation-matching-aggregation. The following uses sentence-level feature matching as an example to introduce the DAM algorithm.

The first layer of the DAM algorithm is the word embedding layer, which uses the sentence-level features y _{k+1 of the} text to be recognized and the k text-level features y ₁ , y ₂ ,..., y _k as the input of the word embedding layer. Among them, the column of the matrix y is the dimension of the word vector, and the row of the matrix y is the length of the text.

The second layer of the DAM algorithm is the presentation layer, and the role of the presentation layer is to construct semantic representations of different granularities. The presentation layer has L layers, and L identical self-attention layers are stacked for processing. The input of the first layer is the output of the 1-1th layer, and the input semantic vector can be combined into a multi-granularity representation. Among them, the multi-granularity representation process is as follows:

Among them, Attentive represents the attention function, and the multi-granularity representations of _{y i} and y _{k+1 are gradually constructed as}

with

Among them, l∈{0,L-1} represents different granularities.

The third layer of the DAM algorithm is the matching layer, and the multi-granularity representation of each text output by the second layer of presentation layer

with

Construct self-attention matching matrix on granularity l

And cross-attention matching matrix

Perform multi-granularity matching to obtain matching features.

Among them, the self-attention matching process is as follows:

in,

Represents the number of words in each sentence text, matrix

Each element in is

with

Dot product,

The kth embedding and

The t-th embedding in y reflects the textual relevance of the k-th segment _{in y i} _{and the t-th segment in y k+1} at the l-th granularity.

The cross-attention matching matrix is based on the cross-attention module, and the specific process is as follows:

Through the attention module

with

Pay attention to each other and construct two new representations:

with

It captures the semantic structure of k texts and texts to be recognized across the text queue. Therefore, interdependent segments within the dialogue text are close to each other in the representation, and the dot product between these potential internal dependencies can be increased, thereby providing perceptually dependent matching information.

The fourth layer of the DAM algorithm is the aggregation layer. DAM finally aggregates all the segment matching degrees of the k text in the text queue and the text to be recognized into a 3D matching image Q. The specific process is as follows:

in,

Represents the cascade operation. Each pixel has 2 (L+1) channels. The matching degree between a specific segment is stored at different granularity levels. Then, the DAM algorithm uses the double-layer convolution with the maximum pooling operation to Extract important matching features f _match (y _i , y _k+1 ) from the entire image. Finally, through a single-layer perceptron, the extracted matching feature f _match (y _i ,y _k+1 ) is used to calculate the matching score g(y _i ,y _k+1 ). The specific process is as follows:

g(y _i ,y _k )=σ(Mf _match (y _i ,y _k+1 )+b)

Among them, f _match (·) represents a matching function, M and b are learning parameters, and σ is a sigmoid function.

For the sake of simplicity, only the sentence-level features are matched and fused based on the DAM algorithm. In fact, the word-level features are matched and fused based on the DAM algorithm in a similar way, so I won’t go into details here.

In S104, the fusion feature is classified by intent to obtain an intent corresponding to the recognized text.

In a specific embodiment, a two-layer convolutional neural network is further used for the fused features to perform deeper feature extraction and dimensionality reduction, and finally the softmax function is used for intent classification to obtain the intent corresponding to the recognized text. Among them, the type of intent is preset in the intelligent customer service system.

Optionally, in the task-based robot customer service system, the intention classification is set to, but not limited to, checking the weather, setting an alarm clock, ordering meals, ordering tickets, broadcasting songs, and so on. For example, the customer input I want to listen to Jay Chou’s song, then it can be classified as a song intent; the customer input how is the weather today, then it can be classified as the intent to check the weather; the customer input helps me set an alarm clock at 6 o’clock tomorrow morning, then it can be Be classified as the intention of setting an alarm clock.

S105: Perform a corresponding action according to the intent corresponding to the text to be recognized.

In a specific embodiment, after obtaining the classified intention, the customer service system selects an appropriate reply utterance in the corpus to reply according to the current process link and the customer's intention category. Among them, the utterances in the corpus are preset by the system. Exemplarily, the customer enters "Today is in a great mood" to classify the intention and can be classified as mood intention. The customer service system finds the corpus of mood intention from the preset corpus, and selects appropriate words to reply to the customer, such as "What is the mood? Okay, hurry up and share it with me.".

It is understandable that the smart customer service system in the embodiment of the present application is merely an example, but the embodiment does not constitute any specific limitation on the function and application scope of the present application. The text intention recognition method provided in this application can also be applied to electronic devices such as mobile phones and computers. For example, in a search engine, the text intent recognition method provided in this application is also suitable for recognizing user query intent based on one or more voices input by the user.

The embodiment of the present application also provides a text intent recognition device, which can be used to implement the above-mentioned text intent recognition method embodiments of the present application. Specifically, referring to FIG. 4, FIG. 4 is a schematic structural diagram of a text intention recognition apparatus provided by an embodiment of the present application. The system 400 of this embodiment includes:

The acquiring unit 401 is used to acquire voice information and text queues;

The preprocessing unit 402 is used to convert voice information into text to be recognized and add it to the text queue;

The feature extraction unit 403 is configured to extract features of the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

The fusion unit 404 is configured to obtain the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each text;

The classification unit 405 is configured to classify the fusion features through the intent classification model for intent classification to obtain the intent corresponding to the text to be recognized.

In a specific implementation, refer to FIG. 5, which is a schematic structural diagram of a feature extraction unit provided by an embodiment of the present application. The feature extraction unit 403 includes a first extraction unit 4031, a second extraction unit 4032, and a merging unit. 4033,

The first extraction unit 4031 is used to extract the word-level features using the word embedding matrix for the text to be recognized and each text in the text queue;

The second extraction unit 4032 is configured to use multiple attention models to extract sentence-level features for the text to be recognized and each text in the text queue;

The merging unit 4033 is used to combine word-level features and sentence-level features as features for text recognition.

In a specific embodiment of the text intent recognition apparatus of the present application, the acquiring unit 402 is configured to use the voice recognition algorithm wav2letter++ algorithm to perform voice recognition after acquiring the customer's voice information, and convert the voice input by the customer into the corresponding text to be recognized. At the same time, a text queue is obtained, where the text queue includes one or more texts. The text queue can hold k pieces of text. The way to add text in the text queue is: after the voice information is converted into the text to be recognized, when the number of texts in the text queue is less than k, the text to be recognized is added to the text queue. The k pieces of text are arranged in the order of adding time; when the number of texts in the text queue is equal to k, the first text added to the text queue is deleted, and the text to be recognized is added to the text queue.

In a specific embodiment, the first extraction unit 4031 is used to: first, use a word segmentation tool to perform word segmentation processing on each text in the text queue to obtain the word x, where the word segmentation tool can be jieba, SnowNLP, THULAC, NLPIR etc. This does not constitute any limitation in the embodiments of the present application. For the i-th recognized text in the text queue, the word x _i can be obtained after word segmentation processing. _{Then, map the n words x i} in the i-th recognized text to the word embedding matrix V to obtain n word vectors V(x _i ). Finally, connect n word vectors to obtain the word vector matrix W _{i of} the i-th recognized text as the word-level feature. After k texts are processed, k word vector matrices {W ₁ , W ₂ …W _k } can be obtained. It is understandable that, after the above-mentioned processing of the text to be recognized, the word-level feature W _{k+1 of the} text to be recognized can be obtained.

In a specific embodiment, the first extraction unit 4032 is configured to use m attention models to process the word vector matrix W _i for _{the feature word vector matrix W i at the word level extracted from the i-th text.} Up to m sentence features at different levels: u _i,1 ~ u _i,m . Among them, the output of the i-th attention model is used as the input of the i+1-th attention model, and i is a positive integer greater than or equal to 1 and less than m. In other words, the output of the previous attention model in the m attention modules is used as the input of the next attention model. Taking y={u _i,1 ～u _i,m } as a sentence-level feature in a sentence text, k texts can be processed to obtain k sentence-level features {y ₁ , y ₂ …y _k }. As shown in Figure 3, the feature word vector matrix W _i at the word level is used as the input of the first attention model, the output of the first attention model is used as the input of the second attention model, and the last attention model in turn The output is used as the input of the next attention model, and finally to m sentence features u _i,1 ~ u _{i,m at} different levels. In this way, using m attention model processing can obtain deeper semantic information. It will be appreciated that the text to be recognized through the above process, wherein the recognized text sentence level may be obtained _{_{y k + 1 = {u k}} + 1,1 ~ u k + 1, m}.

In a specific embodiment, the first extraction unit 4033 is used to combine the feature of the i-th recognized text word level and the feature of the i-th recognized text sentence level as the feature of the i-th recognized text: [W _i ,y _i ], get k text features for k texts in the text queue. It is understandable that for the text to be recognized, the feature sentence level at the word level of the text to be recognized can be combined as the feature of the text to be recognized, which can be up to [W _k+1 ,y _k+1 ].

In a specific embodiment, the fusion unit 404 is configured to use the DAM algorithm to match the obtained k text features and the text features to be recognized on the feature W at the word level and the feature y at the sentence level to obtain The fusion features of the text to be recognized. Specifically, the word-level features of the text to be recognized are matched with the word-level features of the obtained k texts to obtain word-level matching results, and the word-level matching results are merged to obtain the first fusion feature; The sentence-level features of the recognized text are matched with the sentence-level features of the k texts to obtain sentence-level matching results, and the sentence-level matching results are fused to obtain the second fusion feature. The first fusion feature and the second fusion feature are fused to obtain the fusion feature corresponding to the text to be recognized.

In a specific embodiment, a two-layer convolutional neural network is further used for the fused features to perform deeper feature extraction and dimensionality reduction, and finally the softmax function is used for intent classification to obtain the intent corresponding to the recognized text. Among them, the type of intention is preset in the intelligent customer service system.

In addition, an embodiment of the present application provides an electronic device, which may include the text intention recognition method of any of the foregoing embodiments of the present application. Specifically, the electronic device may be, for example, a terminal device or a server or other devices.

The embodiment of the present application also provides another electronic device, including:

The processor and the memory, and the processor executes the code in the memory, thereby completing the operation of the method according to the textual intention of any of the foregoing embodiments of the embodiments of the present application.

Fig. 6 is a structural block diagram of an electronic device provided by an embodiment of the present application. The electronic device may be the aforementioned text intent recognition device. Reference is now made to FIG. 6, which shows a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server in the embodiments of the present application. As shown in FIG. 6, the electronic device includes: one or more processors 601; one or more input devices 602, one or more output devices 603, and a memory 604. The aforementioned processor 601, input device 602, output device 603, and memory 604 are connected via a bus 605. The memory 602 is used to store instructions, and the processor 601 is used to execute instructions stored in the memory 602. Wherein, the processor 601 is configured to call program instructions to execute:

Acquire voice information and text queue, and convert voice information into text to be recognized;

Extract features from the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

According to the text feature of the text to be recognized and the text feature corresponding to each text, the fusion feature corresponding to the text to be recognized is obtained;

The fusion features are classified by the intention classification model, and the intent corresponding to the text to be recognized is obtained.

It should be understood that in the embodiment of the present application, the processor 601 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors or digital signal processors (Digital Signal Processors, DSPs). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The input device 602 may include a camera, where the camera has a function of storing image files and a function of transmitting image files, and the output device 603 may include a display, a hard disk, a U disk, and the like.

The memory 604 may include a read-only memory and a random access memory, and provides instructions and data to the processor 601. A part of the memory 604 may also include a non-volatile random access memory. For example, the memory 604 may also store device type information.

In specific implementation, the processor 601, the input device 602, and the output device 603 described in the embodiments of the present application can execute the implementation manners described in the various embodiments of the text intent recognition method and system provided in the embodiments of the present application. I won't repeat them here.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program includes program instructions. Convert voice information into text to be recognized, the text queue includes one or more texts; extract features from the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text; The text feature of the recognized text and the text feature of each text are obtained to obtain the fusion feature of the text to be recognized; the fusion feature is classified by the intention classification model to obtain the intent corresponding to the text to be recognized.

Optionally, when the program instructions are executed by the processor, other steps of the method in the foregoing embodiment may be implemented, which will not be repeated here. Further optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

The computer-readable storage medium may be an internal storage unit of the electronic device of any of the foregoing embodiments, such as a hard disk or memory of a terminal. The computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash card (Flash Card). )Wait. Further, the computer-readable storage medium may also include both an internal storage unit of an electronic device and an external storage device. The computer-readable storage medium is used to store computer programs and other programs and data required by electronic devices. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working processes of the servers, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and can also perform the descriptions in the embodiments of the invention. The implementation method of the electronic device of, I will not repeat it here.

In the several embodiments provided in this application, it should be understood that the disclosed server, device, and method may be implemented in other ways. For example, the server embodiments described above are only illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method for text intent recognition, including:

Acquiring voice information and a text queue, and converting the voice information into text to be recognized, the text queue including one or more texts;

Extracting features from the text to be recognized and each text in the text queue, to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

Obtaining the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each piece of text;

The fusion feature is classified into intent through an intent classification model, and the intent corresponding to the text to be recognized is obtained.
The method according to claim 1, wherein the text queue contains k pieces of text, and after converting the voice information into the text to be recognized, the method further comprises:

When the number of texts in the text queue is less than k, adding the text to be recognized into the text queue, and the k pieces of text in the text queue are arranged in the order of adding time;

When the number of texts in the text queue is equal to k, the text added first in the text queue is deleted, and the text to be recognized is added to the text queue.
The method according to claim 1, wherein extracting features from the text to be recognized and each text in the text queue to obtain the text features of the text to be recognized and the text features of each text comprises:

For the text to be recognized, the word-level features are extracted to obtain the word-level features of the text to be recognized, and m attention models are used to extract the sentence-level features to obtain the sentence-level features of the text to be recognized. The word-level features of the text to be recognized and the sentence-level features of the text to be recognized are used as the text features of the text to be recognized; where m is a positive integer greater than 1;

For each piece of text, extract the word-level features to obtain the word-level features of each text. Use m attention models to extract the sentence-level features to obtain the sentence-level features of each text. The word-level feature of each text and the sentence-level feature of each text are used as the text feature of each text.
The method according to claim 3, wherein said extracting the word-level features of the text to be recognized to obtain the word-level features of the text to be recognized comprises:

Use a word segmentation tool to perform word segmentation processing on the text to be recognized to obtain n words, map the n words of each text to the word embedding matrix V to obtain n word vectors, and connect the n word vectors Obtain the word vector matrix of each text as the word-level feature of the text to be recognized;

For each piece of text, extract the word-level features to obtain the word-level features of each text, including:

Use a word segmentation tool to perform word segmentation processing on each text to obtain n words, map the n words in each text to the word embedding matrix V to obtain n word vectors, and connect the n words The vector obtains the word vector matrix of each text as the word-level feature of each text.
The method according to claim 3, wherein said extracting sentence-level features using m attention models to obtain sentence-level features of the text to be recognized comprises:

For the word vector matrix of the text to be recognized, m attention models are used to process the word vector matrix of the text to be recognized to obtain m features at different levels, and use the m features at different levels as the Sentence-level features of the text to be recognized; among them, the output of the i-th attention model is used as the input of the i+1-th attention model, and each of the m attention models outputs a level feature, i is a positive integer greater than or equal to 1 and less than m.
The method according to claim 5, wherein said extracting sentence-level features using m attention models to obtain sentence-level features of each text comprises:

For the word vector matrix of each text, m attention models are used to process the word vector matrix of each text to obtain m features at different levels, and the m features at different levels are used as the The sentence-level features of each text.
The method according to claim 3, wherein the obtaining the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each text in the text queue comprises:

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the word-level features of the text to be recognized and the word-level features of each text in different Perform matching and fusion at granularity to obtain the first fusion feature;

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the sentence-level features of the text to be recognized and the sentence-level features of each text in different Perform matching and fusion at the granularity to obtain the second fusion feature;

The first fusion feature and the second fusion feature are fused to obtain the fusion feature of the text to be recognized.
A text intent recognition device, including:

The acquisition unit is used to acquire voice information and text queues;

The preprocessing unit is used to convert the voice information into text to be recognized and add it to the text queue;

A feature extraction unit, configured to extract features from the text to be recognized and each text in the text queue to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

The fusion unit is configured to obtain the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each piece of text;

The classification unit is used to classify the fusion features through an intent classification model to obtain an intent corresponding to the text to be recognized.
A text intent recognition device includes: a processor and a memory, the processor executes the code in the memory to execute the following method:

Acquiring voice information and a text queue, and converting the voice information into text to be recognized, the text queue including one or more texts;

Extracting features from the text to be recognized and each text in the text queue, to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

Obtaining the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each piece of text;

The fusion feature is classified into intent through an intent classification model, and the intent corresponding to the text to be recognized is obtained.
The text intent recognition device according to claim 9, wherein the text queue contains k pieces of text, and after the voice information is converted into the text to be recognized, the processor is further configured to execute:

When the number of texts in the text queue is less than k, adding the text to be recognized into the text queue, and the k pieces of text in the text queue are arranged in the order of adding time;

When the number of texts in the text queue is equal to k, the text added first in the text queue is deleted, and the text to be recognized is added to the text queue.
The text intent recognition device according to claim 9, wherein the extraction of features of the text to be recognized and each text in the text queue is performed to obtain the text features of the text to be recognized and the text of each text. Text characteristics of the text, including:

For the text to be recognized, the word-level features are extracted to obtain the word-level features of the text to be recognized, and m attention models are used to extract the sentence-level features to obtain the sentence-level features of the text to be recognized. The word-level features of the text to be recognized and the sentence-level features of the text to be recognized are used as the text features of the text to be recognized; where m is a positive integer greater than 1;

For each piece of text, extract the word-level features to obtain the word-level features of each text. Use m attention models to extract the sentence-level features to obtain the sentence-level features of each text. The word-level feature of each text and the sentence-level feature of each text are used as the text feature of each text.
The text intent recognition device according to claim 11, wherein performing the extraction of the word-level features of the to-be-recognized text to obtain the word-level features of the to-be-recognized text comprises:

Use a word segmentation tool to perform word segmentation processing on the text to be recognized to obtain n words, map the n words of each text to the word embedding matrix V to obtain n word vectors, and connect the n word vectors Obtain the word vector matrix of each text as the word-level feature of the text to be recognized;

For each piece of text, extract the word-level features to obtain the word-level features of each text, including:

Use a word segmentation tool to perform word segmentation processing on each text to obtain n words, map the n words in each text to the word embedding matrix V to obtain n word vectors, and connect the n words The vector obtains the word vector matrix of each text as the word-level feature of each text.
The text intent recognition device according to claim 11, wherein executing the use of m attention models to extract sentence-level features to obtain sentence-level features of the text to be recognized comprises:

For the word vector matrix of the text to be recognized, m attention models are used to process the word vector matrix of the text to be recognized to obtain m features at different levels, and use the m features at different levels as the Sentence-level features of the text to be recognized; among them, the output of the i-th attention model is used as the input of the i+1-th attention model, and each of the m attention models outputs a level feature, i is a positive integer greater than or equal to 1 and less than m.
The text intent recognition device according to claim 11, wherein the execution of obtaining the fusion feature of the text to be recognized based on the text feature of the text to be recognized and the text feature of each text in the text queue comprises:

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the word-level features of the text to be recognized and the word-level features of each text in different Perform matching and fusion at granularity to obtain the first fusion feature;

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the sentence-level features of the text to be recognized and the sentence-level features of each text in different Perform matching and fusion at the granularity to obtain the second fusion feature;

The first fusion feature and the second fusion feature are fused to obtain the fusion feature of the text to be recognized.
A computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following method:

Acquiring voice information and a text queue, and converting the voice information into text to be recognized, the text queue including one or more texts;

Extracting features from the text to be recognized and each text in the text queue, to obtain the text feature of the text to be recognized and the text feature corresponding to each text;

Obtaining the fusion feature of the text to be recognized according to the text feature of the text to be recognized and the text feature of each piece of text;

The fusion feature is classified into intent through an intent classification model, and the intent corresponding to the text to be recognized is obtained.
The computer-readable storage medium according to claim 15, wherein the text queue contains k pieces of text, and after the voice information is converted into the text to be recognized, when the instruction is run on the computer, the Said computer execution:

When the number of texts in the text queue is less than k, adding the text to be recognized into the text queue, and the k pieces of text in the text queue are arranged in the order of adding time;

When the number of texts in the text queue is equal to k, the text added first in the text queue is deleted, and the text to be recognized is added to the text queue.
The computer-readable storage medium according to claim 15, wherein the extraction of features of the text to be recognized and each text in the text queue is performed to obtain the text features of the text to be recognized and each text in the text queue. The text characteristics of the text, including:

For the text to be recognized, the word-level features are extracted to obtain the word-level features of the text to be recognized, and m attention models are used to extract the sentence-level features to obtain the sentence-level features of the text to be recognized. The word-level features of the text to be recognized and the sentence-level features of the text to be recognized are used as the text features of the text to be recognized; where m is a positive integer greater than 1;

For each piece of text, extract the word-level features to obtain the word-level features of each text. Use m attention models to extract the sentence-level features to obtain the sentence-level features of each text. The word-level feature of each text and the sentence-level feature of each text are used as the text feature of each text.
18. The computer-readable storage medium according to claim 17, wherein performing the extraction of word-level features of the text to be recognized to obtain the word-level features of the text to be recognized comprises:

Use a word segmentation tool to perform word segmentation processing on the text to be recognized to obtain n words, map the n words of each text to the word embedding matrix V to obtain n word vectors, and connect the n word vectors Obtain the word vector matrix of each text as the word-level feature of the text to be recognized;

For each piece of text, extract the word-level features to obtain the word-level features of each text, including:

Use a word segmentation tool to perform word segmentation processing on each text to obtain n words, map the n words in each text to the word embedding matrix V to obtain n word vectors, and connect the n words The vector obtains the word vector matrix of each text as the word-level feature of each text.
18. The computer-readable storage medium according to claim 17, wherein executing the use of m attention models to extract sentence-level features to obtain sentence-level features of the text to be recognized comprises:

For the word vector matrix of the text to be recognized, m attention models are used to process the word vector matrix of the text to be recognized to obtain m features at different levels, and use the m features at different levels as the Sentence-level features of the text to be recognized; among them, the output of the i-th attention model is used as the input of the i+1-th attention model, and each of the m attention models outputs a level feature, i is a positive integer greater than or equal to 1 and less than m.
18. The computer-readable storage medium according to claim 17, wherein executing the method of obtaining the fusion feature of the text to be recognized based on the text feature of the text to be recognized and the text feature of each text in the text queue comprises:

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the word-level features of the text to be recognized and the word-level features of each text in different Perform matching and fusion at granularity to obtain the first fusion feature;

The text features of the text to be recognized and the text features of each piece of text, using the deep attention matching algorithm DAM, are used to compare the sentence-level features of the text to be recognized and the sentence-level features of each text in different Perform matching and fusion at the granularity to obtain the second fusion feature;

The first fusion feature and the second fusion feature are fused to obtain the fusion feature of the text to be recognized.