CN113297364A

CN113297364A - Natural language understanding method and device for dialog system

Info

Publication number: CN113297364A
Application number: CN202110632046.5A
Authority: CN
Inventors: 刘露; 王乃钰; 包铁; 张雪松; 彭涛
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-08-24
Anticipated expiration: 2041-06-07
Also published as: CN113297364B

Abstract

The invention belongs to the technical field of intelligent dialogue, in particular to a natural language understanding method and a device oriented to a dialogue system, which comprises a word embedding layer, a coding representation layer and a joint learning layer, has reasonable structure and is definitely based on a collected specific field data set and 1) an original BERT-WWM model 2) an original ERNIE model 3) and a pre-trained joint learning model 4) a 3-layer BERT-WWM model after knowledge distillation. Compared experiments are carried out on the four models, on the basis of a specific field data set, 3) the models are better than 1) and 2) models in the two performance indexes of the intention classification accuracy and slot position identification F1, and 4) the model parameter scale after knowledge distillation is greatly reduced, the inference delay is effectively reduced, and the performance loss is small.

Description

Natural language understanding method and device for dialog system

Technical Field

The invention relates to the technical field of intelligent dialogue, in particular to a natural language understanding method and a natural language understanding device oriented to a dialogue system.

Background

The rapid development and widespread popularity of the internet has made the 21 st century an era of data explosion. The demand of people for various information is increased sharply, the types of the required information are wider, when users face large-scale and complex information, how to effectively search and acquire massive information becomes a key for utilizing the information, and higher requirements are provided for an information retrieval mode. The traditional retrieval mode comprises the following steps: (1) only the keywords are matched, and the requirement of the user semantic level is not considered; (2) search results typically return a large amount of text and web pages, requiring further selections by the user. A Dialog System (Dialog System) and a Question-Answering System (Question Answering System) for a specific field are research subjects for improving a conventional search method. Compared with the traditional retrieval mode, the dialogue question-answering system can understand the question from the semantic level, and is not simple in keyword matching. The method can also replace the user to filter the content in the webpage and the document, the returned result is more accurate, and the answer corresponding to the question is returned instead of the webpage or the document.

The system can be divided into four types according to different dialogue question-answering systems of application scenes: (1) common problems (frequenctly assigned Questions, FAQ) type: this type of intelligent system will generally give questions and corresponding answers, analyze and process the user input using models and algorithms, find the question with the highest similarity in the question bank using some metric algorithm, and return the corresponding answer. (2) Task type: the design purpose of the intelligent system is to assist a user to complete a certain task, analyze the input of the user, analyze the intention of the user and adopt a series of actions to complete the requirements of the user under the guidance of a conversation strategy model. (3) The common sense type: the knowledge graph is generally used as a knowledge base of the system, triplets in the knowledge graph contain common knowledge information stored in a natural language form in reality, and answers are retrieved from the knowledge graph and returned according to user input. (4) Chatting with ease: this type of dialogue system is intended to perform many rounds of dialogue with a user in an open field, but has a high requirement for the intelligence and semantic consistency of the system.

In the intelligent conversation, problems of user privacy, low user acceptance, general user experience and the like exist in the application of the conversation system, so that it is very difficult to acquire a large number of public and high-quality conversation data sets and unsupervised corpora, and the development of the conversation system is limited to a greater extent due to the lack of the data sets, so that the challenge is brought. On the other hand, the user input in the dialogue system is often spoken expression, the semantic ambiguity and the grammar randomness degree are high, and the dialogue system also has the characteristics of unfixed sentence length distribution, content divergence and the like. The above features all bring great difficulty to the intention classification task. In addition, the user input may also include multiple intentions, and there is a certain correlation between the multiple intentions, and how to identify whether multiple intentions exist and accurately classify the multiple intentions is also a challenge facing the intent classification task.

The key modules of the system include 4 parts in total for semantic understanding, dialog state tracking, dialog management, and dialog generation. The natural language understanding task generally includes the following three subtasks: domain classification, intent classification, and slot identification. The purpose of the domain classification is to give a domain class to which the user input belongs using a model or algorithm, and the intention classification is to identify the intention of the user input. Slot identification is typically addressed by a sequence tagging task, identifying and tagging entities in user input. The present patent proposes a domain-specific natural language understanding method, whereby the domain recognition and the intention recognition are modeled as one subtask in the natural language understanding part. That is, the user input is generally divided into two parts, one part is irrelevant to the education field, the other part is relevant to the education field, and the input relevant to the field is classified more finely.

Currently, people pay more and more attention to the dialog system in various fields, and the development of the dialog system is greatly promoted by the continuous progress of deep learning technology. For conversational systems, deep learning techniques may utilize a large amount of data to learn feature representations and to classify and identify user intentions, where only a small amount of manual work is required.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the problems occurring in the existing intelligent dialog systems.

Therefore, an object of the present invention is to provide a natural language understanding method and apparatus for a dialog system, which can improve accuracy of intent classification and slot position identification in an intelligent dialog system, improve semantic representation capability of a natural language understanding module by introducing a pre-training model and using a new pre-training task, and improve performance of the natural language understanding module in a specific field by introducing field adaptive pre-training and task adaptive pre-training. Meanwhile, knowledge distillation is carried out on the model, the model reasoning speed is improved, and the delay of a dialogue system is relieved.

To solve the above technical problem, according to an aspect of the present invention, the present invention provides the following technical solutions:

a natural language understanding method and device facing to dialogue system, it includes word embedding layer, coding representing layer and joint learning layer;

wherein the content of the first and second substances,

(1) word Embedding Layer (Word Embedding Layer)

X in FIG. 1₁，...，X₅Words representing the input sequence, and e (X)₁)，...，e(X₅) Representing the embedded word representation. The used Embedding is generated in a pre-training mode, and the layer mainly completes the representation from text to vector;

(2) coding Representation Layer (Encoding Representation Layer)

Inputting the word vectors after the embedded expression into a pre-training language model formed by stacking multiple layers of transformers, and performing high-level feature coding and extraction;

the input sequence is recorded as X₁，...，X₅In a multi-head Attention layer in an Encoder part of a pre-training model based on a Transformer, a Scaled Dot Product Attention mechanism (Scaled Dot-Product Attention) is adopted, a multi-head mechanism is introduced to participate in calculation by using a plurality of different Attention moment arrays, and then a feedforward network is input to complete nonlinear transformation, wherein the formula of the process is as follows:

e(X)×W^Q＝Q (1)

e(X)×W^K＝K (2)

e(X)×W^V＝V (3)

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^o (5)

FFN(x)＝max(0，xW₁+b₁)W₂+b₂ (6)

in the formula (d)_kDenotes the dimension of K, and FFN denotes the fully connected sublayer in BERT Encoder.

The representation after the pre-training model coding can introduce rich unsupervised grammar and semantic knowledge learned by the model in the pre-training process, thereby improving the classification and sequence marking capability of the model;

(3) joint Learning Layer (Joint Learning Layer)

In the field of image processing, the CNN module is an indispensable module in constructing a network, and the performance thereof has been effectively proven to have good effects also in the field of natural language processing. In order to capture features from lower to higher layers, researchers have proposed VDCNN models, using convolution blocks of up to tens of layers, and also introducing residual concatenation in order to alleviate the problems of gradient vanishing. In the aspect of the recurrent neural network, both LSTM and GRU can further improve the problems of gradient disappearance and sharp gradient expansion. The introduction of bidirectional modeling and maximum pooling operations on the basis of the RNN can enhance the processing capability of the model in long-distance dependency and retain important semantic information. For slot position identification, the bidirectional long-short term memory network model has great advantages for sequence coding, excellent representation learning capacity is shown in sequence tasks, and the conditional random field has the advantage of utilizing information among labels in a sequence output stage, so that the bidirectional LSTM and the Conditional Random Field (CRF) are used for modeling together in the slot position identification task;

therefore, a mixed network formed by three models of CNN, RCNN and VDCNN is adopted for prediction during intention classification, and a BilSTM + conditional random field mode is adopted for sequence marking during slot position identification.

As a preferred solution of the natural language understanding method and apparatus in the dialog system according to the present invention, wherein: the method comprises the following steps:

the method comprises the following steps: firstly, realizing vectorization representation of a user input text by using an embedded layer of a pre-trained language model which is pre-trained on a general field corpus;

step two: then, performing high-level feature extraction and semantic representation on the quantitative representation by using a multilayer Transformer of a pre-training language model, and integrating context information to complete input coding representation;

in the joint learning layer, the intention classification is realized by using a mixed network model containing TextCNN, RCNN and VDCNN, and the specific calculation process is as follows:

(1) CNN module

The CNN module adopts a TextCNN model. The method comprises the following steps that feature extraction is completed in a one-dimensional mode in the natural language field CNN, namely after text is represented in a vectorization mode, convolution and pooling operations are conducted in the text sequence direction, a sequence with a fixed size is selected for interactive learning through convolution each time in a sliding window mode, and a formula of a model is represented as follows:

C_i＝f(w·e(w_i：i+h-1)+b) (8)

c＝[C_i，C₂，...，C_n-h+1] (9)

R₁＝max{c} (10)

wherein, w₁，...，w_nRepresenting the input text sequence, formula (7) represents the splicing operation on the coded representation of the previous layer, and w in formula (8) represents the convolution kernel matrix. Equations (9) and (10) represent the concatenation and max pooling operation on the convolved context representations;

(2) RCNN module

The RCNN model introduces the idea of simultaneous context modeling, taking into account the context and context of words when encoding and feature extraction, and using RNNs as feature extractors in order to capture long-range dependencies of word sequences. A bi-directional LSTM base unit is used in the implementation. The maximum pooling operation is carried out on the output after LSTM coding, semantic importance knowledge of words in sentences can be effectively learned, and the formula of RCNN is as follows:

c_l(x_i)＝f(W^(l)c_l(w_i-1)+W^(sl)e(w_i-1)) (11)

c_r(x_i)＝f(W^(r)c_i(w_i+1)+W^(sr)e(w_i+1)) (12)

h_i＝[c_l(w_i)；e(w_i)；c_r(w_i)] (13)

wherein, c_l(w_i) Represents the word w_iAbove, correspondingly, c_r(w_i) Represents the word w_iC is_l(w_i) And c_r(w_i) Calculated according to equations (11) and (12), respectively, W^(l)Representing the transformation matrix from the previous layer to the present layer, W^(sl)The matrix is used to fuse the semantic representation of the current word and the previous representation of the next word. Similarly, W^(r)And W^(sr)The same applies to the above description, where c is_l(w_i)、e(w_i) And the following is c_r(w_i) Splicing to obtain a final code expression, carrying out nonlinear transformation according to a formula (14), and expressing the maximum pooling operation by a formula (15);

(3) VDCNN module

Vdcnn (very Deep relational networks) was originally proposed for image recognition tasks in the field of computer vision. The main idea of the model is to use a small convolution kernel (3 x 3) in all convolution layers of the entire model, then stack to a very deep depth, up to 19 layers, and let the output of this layer be R₃；

Step three: finally, the finally obtained codes of the three models represent R₁，R₂，R₃Performing splicing operation, as shown in formula (16); then, after linear transformation once, calculating the probability of belonging to each category at the softmax layer, and using the cross entropy as a loss function:

S＝W_sR+b (17)

the Bi-LSTM network designed for slot identification mainly comprises two sublayers of Bi-LSTM and CRF:

(1) Bi-LSTM layer

Sending the vector representation coded by the pre-training model into a bidirectional LSTM network, wherein the text sequence of the vectorization representation needs to be subjected to learning processes in two directions of forward (from left to right) and backward (from right to left), the formal description of the process is consistent with the formulas (11) and (12), and finally obtaining a forward coding representation c_l(w_i) And backward coded representation c_r(w_i) Splicing the two according to a formula (13);

(2) CRF layer

The CRF layer decodes the best path of all possible label paths by learning the dependency between adjacent labels to restrain the label combination, and records H as the coded representation of Bi-LSTM, n is the number of characters in the sentence, m is the number of slot position label types, then H is_i，jRepresenting the score of the jth tag of the ith word in the sentence. In the CRF layer, the process of transition of the past output to the current input and the state corresponding to the input determine the current input, which are respectively called transition score and state score, and H is used as the state matrix of the CRF layer. While scores for transitions from state i to state j have a transition matrix T_i，jAnd (4) showing. The calculation was performed according to the following formula:

in the formula (21)

Representing the true value of the label, equations (22) and (23) give the calculation formula of the model objective function and the Viterbi algorithm for calculating the optimal label sequence;

in the joint learning model, the overall loss function of the model is the weighted sum of losses of the intention classification model and the slot position identification model, namely:

L＝αL_intent+(1-α)L_slot (24)

the model adopts an Adam optimizer minimized objective function with linear preheating and weight attenuation to update parameters.

As a preferred solution of the natural language understanding method and apparatus in the dialog system according to the present invention, wherein: the method aims at Chinese intention recognition and slot position recognition, so that when a proper coding presentation layer pre-training model is selected, two pre-training language models which are more suitable for a Chinese scene and based on a Transformer are selected. The two models are respectively an ERNIE model proposed by Baidu and BERT-WWM proposed by the union of Harvard and fly.

As a preferred solution of the natural language understanding method and apparatus in the dialog system according to the present invention, wherein: in order to enhance the intention recognition and slot position recognition performance of the model facing the education field, the invention provides a pre-training target task integrated with field dictionary information by constructing a field dictionary, and continuously pre-trains the pre-training language model used by the code expression layer.

As a preferred solution of the natural language understanding method and apparatus in the dialog system according to the present invention, wherein: in order to further improve the representation learning capacity of the pre-training model in the face of two subtasks in the education field, the invention provides that the pre-training model based on the pre-training model is further trained in two other stages, namely the model comprises four stages of general field pre-training, field adaptation pre-training, task adaptation pre-training and fine adjustment.

As a preferred solution of the natural language understanding method and apparatus in the dialog system according to the present invention, wherein: and (3) carrying out knowledge distillation on the joint learning model based on pre-training, and distilling to three layers of BERT-WWM respectively.

Compared with the prior art, the invention has the beneficial effects that: the invention is based on the collected domain-specific data set and 1) the original BERT-WWM model 2) the original ERNIE model 3) on the pre-trained joint learning model 4) the 3-layer BERT-WWM model after knowledge distillation. Compared experiments are carried out on the four models, on the basis of a specific field data set, 3) the models are better than 1) and 2) models in the two performance indexes of the intention classification accuracy and slot position identification F1, and 4) the model parameter scale after knowledge distillation is greatly reduced, the inference delay is effectively reduced, and the performance loss is small.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the present invention will be described in detail with reference to the accompanying drawings and detailed embodiments, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise. Wherein:

FIG. 1 is a pre-training-based joint learning model according to the present invention.

FIG. 2 is a process of dictionary construction in the field of the present invention.

FIG. 3 is a multi-stage pre-training process of the present invention.

FIG. 4 is a 3-layer BERT-WWM distillation framework of the present invention.

FIG. 5 is a flowchart showing the steps of embodiment 1 of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and it will be apparent to those of ordinary skill in the art that the present invention may be practiced without departing from the spirit and scope of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Example 1

The invention can be used in single-turn, question-answer pair and search type intelligent dialogue systems facing the education field, and is used for intention classification and slot position identification of user input.

1. Firstly, the user input text is subjected to embedded representation and semantic coding, for example, the user input is as follows: what are course information for one shift in six years tomorrow? After the system receives the user input, the text is sent into the combined learning model based on pre-training, the pre-trained model forms an input code expression in a two-dimensional matrix form consisting of 64-bit binary numbers, and semantic grammar information contained in the pre-trained language model is contained.

2. User intent and slot position are predicted. After the two-dimensional matrix representation is formed, the two-dimensional matrix representation is input into the joint learning model, the mixed network model predicts the input intention, and the BilSTM + CRF identifies the slot position in the input. What are course information for one shift in six years tomorrow? The intent contained in "is identified as: "query course information", and there are two semantic slots, respectively: "date: tomorrow "and" class: one shift for six years.

3. Having obtained the user intent and slot information, the dialog state tracking module is used to collect user input, historical dialog, context, and user intent and slot values, forming a current dialog state that can be learned by the dialog state tracking module.

4. The dialogue strategy module selects the action to be executed according to the current dialogue state, the dialogue strategy is the core function of the dialogue system, is equivalent to the brain of the dialogue system, and is responsible for determining which specific action to execute according to the feedback of the current user and the output of the dialogue strategy module, and how to update the dialogue state information, etc. the dialogue strategy module presets various actions in a program and is realized by programming, and the dialogue strategy can be given in the form of artificial rules and can also be trained to obtain a strategy model through machine learning and deep learning.

5. And the dialogue response module is used for matching question-answer pairs stored in the knowledge base according to the selected dialogue action, the knowledge base stores structured course information and knowledge in an unstructured question-answer pair form, the structured information is directly input after being retrieved, cosine similarity is used as a measurement standard of matching degree for the unstructured information, if the matching degree meets the threshold requirement, a corresponding answer is output, and if not, question tracing is performed or a general answer is returned.

While the invention has been described above with reference to an embodiment, various modifications may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In particular, the various features of the disclosed embodiments of the invention may be used in any combination, provided that no structural conflict exists, and the combinations are not exhaustively described in this specification merely for the sake of brevity and resource conservation. Therefore, it is intended that the invention not be limited to the particular embodiments disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A natural language understanding apparatus in a dialogue-oriented system, characterized in that: the method comprises a word embedding layer, a coding representation layer and a joint learning layer;

wherein the content of the first and second substances,

(1) word Embedding Layer (Word Embedding Layer)

(2) coding Representation Layer (Encoding Representation Layer)

(3) joint Learning Layer (Joint Learning Layer)

In the field of image processing, the CNN module is an indispensable module in constructing a network, and the performance thereof has been effectively proven to have good effects also in the field of natural language processing. In order to capture features from a lower layer to a higher layer, researchers propose a VDCNN model, use convolution blocks of up to tens of layers, introduce residual connection in order to relieve the problem caused by gradient disappearance, and in the aspect of a recurrent neural network, both LSTM and GRU can further improve the problems of gradient disappearance and gradient sharp expansion, and the introduction of bidirectional modeling and maximum pooling operation on the basis of RNN can enhance the processing capability of the model in long-distance dependence and retain important semantic information. For slot position identification, the bidirectional long-short term memory network model has great advantages for sequence coding, excellent representation learning capacity is shown in sequence tasks, and the conditional random field has the advantage of utilizing information among labels in a sequence output stage, so that the bidirectional LSTM and the Conditional Random Field (CRF) are used for modeling together in the slot position identification task;

2. The method of claim 1, wherein the natural language understanding device comprises: the method comprises the following steps:

(1) CNN module

The CNN module adopts a TextCNN model. Performing feature extraction in a one-dimensional form in the natural language field CNN, namely performing convolution and pooling operations in the text sequence direction after the text is represented in a vectorization mode, and selecting a sequence with a fixed size for interactive learning in a convolution mode each time in a sliding window mode;

(2) RCNN module

The RCNN model introduces the idea of simultaneous context modeling, taking into account the context and context of words when encoding and feature extraction, and using RNNs as feature extractors in order to capture long-range dependencies of word sequences. A bi-directional LSTM base unit is used in the implementation. The output after LSTM coding is subjected to maximum pooling operation, so that semantic importance knowledge of words in sentences can be effectively learned;

(3) VDCNN module

Step three: finally, the finally obtained codes of the three models represent R₁，R₂，R₃And carrying out splicing operation.

3. A method for using a natural language understanding device oriented to a dialogue system according to any one of claims 1-2, wherein: the method aims at Chinese intention recognition and slot position recognition, so that when a proper coding presentation layer pre-training model is selected, two pre-training language models which are more suitable for a Chinese scene and based on a Transformer are selected. The two models are respectively an ERNIE model proposed by Baidu and BERT-WWM proposed by the union of Harvard and fly.

4. A method for using a natural language understanding device oriented to a dialogue system according to any one of claims 1-3, wherein: in order to enhance the intention recognition and slot position recognition performance of the model facing the education field, the invention provides a pre-training target task integrated with field dictionary information by constructing a field dictionary, and continuously pre-trains the pre-training language model used by the code expression layer.

5. The use method of the natural language understanding apparatus in the dialog oriented system according to any one of claims 1 to 4, wherein: in order to further improve the representation learning capacity of the pre-training model in the face of two subtasks in the education field, the invention provides that the pre-training model based on the pre-training model is further trained in two other stages, namely the model comprises four stages of general field pre-training, field adaptation pre-training, task adaptation pre-training and fine adjustment.

6. The use method of the natural language understanding apparatus in the dialog oriented system according to any one of claims 1 to 4, wherein: and (3) carrying out knowledge distillation on the joint learning model based on pre-training, and distilling to three layers of BERT-WWM respectively.