CN112395876A - Knowledge distillation and multitask learning-based chapter relationship identification method and device - Google Patents
Knowledge distillation and multitask learning-based chapter relationship identification method and device Download PDFInfo
- Publication number
- CN112395876A CN112395876A CN202110078740.7A CN202110078740A CN112395876A CN 112395876 A CN112395876 A CN 112395876A CN 202110078740 A CN202110078740 A CN 202110078740A CN 112395876 A CN112395876 A CN 112395876A
- Authority
- CN
- China
- Prior art keywords
- model
- cost function
- classification
- discourse relation
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a discourse relation recognition method and device based on knowledge distillation and multi-task learning, and on one hand, knowledge is shared between a connecting word classification auxiliary task and an implicit discourse relation recognition main task based on a parameter sharing mode; on the other hand, the knowledge in the teacher model enhanced by the connecting words is migrated from the characteristic layer and the classification layer to the corresponding implicit discourse relation identification model based on the knowledge distillation technology; the recognition performance of the student model is improved by fully utilizing the information of the connecting words inserted during the corpus labeling.
Description
Technical Field
The invention relates to the technical field of computer intelligent analysis and processing, in particular to a discourse relation identification method and device based on knowledge distillation and multitask learning.
Background
Discourse generally refers to a whole language unit composed of a series of structural connection, semantically coherent language units (sentences or clauses) and according to a certain semantic relationship or hierarchical structure. Semantic relationships between sentences or clauses are often referred to as discourse relationships, e.g., causal relationships, turning relationships, etc. The chapter relationship identification refers to automatically judging the semantic relationship between two discourse elements (sentences or clauses), and is one of the core subtasks of chapter structure analysis and the performance bottleneck of the chapter structure analysis. Therefore, the improvement of the identification performance of discourse relations not only can promote the development of discourse structure analysis, but also is beneficial to a plurality of downstream natural language processing tasks. Such as machine translation, sentiment analysis, question and answer systems, and text summaries, etc.
Where discourse conjunctions (e.g., because, but, etc.) are one of the most important features in discourse relation recognition. When two arguments are connected by discourse connecting words, the explicit discourse relation recognition can achieve more than 90% of classification accuracy by using the connecting words as features. On the contrary, when discourse connection words are omitted between two arguments, implicit discourse relation identification needs to deduce the relation between the two arguments according to the semantics of the two arguments, and the corresponding accuracy is only about 60% at present. For example, as shown in fig. 1, the connection word "so" is omitted between two arguments of the implicit discourse relation example, and it is very difficult to deduce the semantic "causal relation" between the two arguments based on the texts "water accumulation" and "basketball not playing". In fact, even corpus annotators often utilize connective information to assist in the annotation of implicit discourse relationships. For example, when labeling The currently largest Bingzhou chapter Tree Bank (PDTB), The labeling personnel is also required to insert an appropriate connection word between two arguments of The implicit chapter relationship instance first, and then integrate information of both arguments and The inserted connection word to judge The chapter relationship of The instance. That is, discourse corpus annotators often use (inserted) conjunctive word information to assist in the annotation of implicit discourse relationships.
From the above analysis, it can be seen that: on one hand, a huge performance gap (90% to 60%) exists between explicit discourse relation identification based on the connection words and implicit discourse relation identification based on the argument semantics; on the other hand, the labeling process of the corpus also illustrates that the connective word information is helpful for implicit discourse relation identification. Therefore, some researchers have attempted to utilize conjunctive word information in an implicit discourse relation recognition model to improve the performance of recognition. At present, researchers use a method based on counterstudy to help implicit discourse relation identification by using connection word information inserted during corpus tagging.
However, the existing method based on counterstudy does not fully utilize the information of the connecting words, only stays on the feature extraction layer to transfer knowledge, and the recognition performance is not ideal.
Disclosure of Invention
In view of the above situation, there is a need to solve the problems that the conventional counterlearning-based method only remains in the feature extraction layer migration knowledge, and the recognition performance is not ideal.
The embodiment of the invention provides a discourse relation identification method based on knowledge distillation and multitask learning, wherein the method comprises the following steps:
taking an implicit discourse relation instance labeled with the connection words and implicit discourse relation categories as a training instance;
constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, and carrying out iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words by taking the connecting words as additional input until convergence to obtain a trained teacher model;
constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model;
and iteratively minimizing the total cost function of the student model until convergence so as to output the trained student model, and further identifying the implicit discourse relation of the test case.
The invention provides a discourse relation identification method based on knowledge distillation and multi-task learning, which takes an implicit discourse relation example marked with connection words and categories as a training example and aims to fully utilize information of the connection words inserted during corpus marking; firstly, constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, and iteratively minimizing a cost function until convergence by using the connecting words as additional input to obtain a trained teacher model; and then training the constructed multi-task learning student model, constructing a total cost function based on a multi-task learning and knowledge distillation method, and carrying out minimum iteration processing on the total cost function until convergence, thereby outputting the well-trained multi-task learning student model. On one hand, the discourse relation identification method based on knowledge distillation and multi-task learning, provided by the invention, shares knowledge between a connection word classification auxiliary task and an implicit discourse relation identification main task based on a parameter sharing mode (shared characteristic extraction layer); on the other hand, the knowledge in the teacher model enhanced by the connecting words is migrated from the feature extraction layer and the classification layer to the corresponding implicit discourse relation recognition model (multitask learning student model) based on the knowledge distillation technology; the recognition performance of the student model is improved by fully utilizing the information of the connecting words inserted during the corpus labeling. The method provided by the invention obtains better identification performance on the first-level and second-level implicit discourse relations of the common PDTB data set than the similar method.
The discourse relation identification method based on knowledge distillation and multitask learning is characterized in that in the training example, the implicit discourse relation example marked with the connection words and the implicit discourse relation category is represented as;
Wherein the content of the first and second substances,two arguments representing the implicit discourse relation training instance,a conjunction that indicates a label is used,and representing the implicit discourse relation category of the label.
The knowledge-based distillation and multitaskingThe discourse relation identification method comprises the steps that in the teacher model strengthened by the connecting words, the input isThe corresponding cost function is expressed as:
wherein the content of the first and second substances,are the parameters of the teacher model and are,implicit discourse relation classification for annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the label category,representing the prediction results obtained after the classification layer of the teacher model strengthened by the connecting words,is a training example set.
The chapter relationship identification method based on knowledge distillation and multitask learning is characterized in that in the multitask learning student model, the total cost function of the student model is expressed as:
wherein the content of the first and second substances,for the total cost function of the student model,are the parameters of the student model and are,the weight coefficients are respectively a cost function based on multitask learning and a cost function based on knowledge distillation;
the cost function based on the multi-task learning comprises two parts:for cross-entropy cost functions identified corresponding to implicit discourse relations,is a cross entropy cost function corresponding to the connected word classification; the cost function of knowledge-based distillation comprises two parts:as a cost function corresponding to the distillation of the knowledge of the feature extraction layer,as a cost function corresponding to the distillation of the knowledge of the classification layer.
The chapter relationship identification method based on knowledge distillation and multitask learning is characterized in that in the multitask learning student model, the input isThe cross-entropy cost function corresponding to implicit discourse relation identification is expressed as:
wherein the content of the first and second substances,are the parameters of the student model and are,implicit discourse relation classification for annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the label category,representing the prediction result obtained after the student model is classified into the layer 1 and corresponding to the implicit discourse relation identification,is a training example set.
The chapter relationship identification method based on knowledge distillation and multitask learning is characterized in that a cross entropy cost function corresponding to the connection word classification in the multitask learning student model is represented as follows:
wherein the content of the first and second substances,are the parameters of the student model and are,for marked conjunctionsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the annotation link,represents the prediction result corresponding to the connection word classification obtained after the student model classification layer 2,is a training example set.
The chapter relationship identification method based on knowledge distillation and multitask learning is characterized in that a cost function corresponding to feature extraction layer knowledge distillation in the multitask learning student model is represented as follows:
wherein the content of the first and second substances,which represents the mean-square error of the signal,representing the characteristics obtained after the teacher model characteristic extraction layer strengthened by the connecting words,representing features obtained after passing through a feature extraction layer of the multi-task learning student model,is a training example set.
The chapter relationship identification method based on knowledge distillation and multitask learning is characterized in that a cost function corresponding to classification layer knowledge distillation in the multitask learning student model is represented as follows:
wherein the content of the first and second substances,indicating the KL-distance between the two probability distributions,representing the prediction result obtained after the classification layer of the teacher model strengthened by the connecting words,and the prediction result obtained after the multi-task learning student model classification layer 1 is represented.
The knowledge distillation and multitask learning-based discourse relation identification method comprises the steps that the bidirectional attention mechanism classification model comprises an encoding layer, an interaction layer, an aggregation layer and a classification layer, wherein the encoding layer is used for learning the expression of words in arguments in context, and the encoding layer is expressed as follows:
wherein the content of the first and second substances,are respectively the first in argument 1A word vector of words and its representation in context,are respectively the first in argument 2Word vectors of words and their representation in context,andthe number of words in two arguments respectively,both are bidirectional long-time memory networks.
The invention also provides a chapter relationship recognition device based on knowledge distillation and multitask learning, wherein the device comprises:
the training input module is used for taking an implicit discourse relation example marked with the connection words and the implicit discourse relation category as a training example;
the first construction module is used for constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, taking the connecting words as additional input, and performing iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words until convergence to obtain a trained teacher model;
the second construction module is used for constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model;
and the training output module is used for iteratively minimizing the total cost function of the student model until convergence so as to output the trained student model and further identify the implicit discourse relation of the test case.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a diagram of an example of implicit discourse relation labeled with conjunctions and categories of implicit discourse relation;
FIG. 2 is a flow chart of the chapter relationship identification method based on knowledge distillation and multitask learning according to the present invention;
FIG. 3 is a schematic diagram of the concept of the chapter relationship identification method based on knowledge distillation and multitask learning according to the present invention;
FIG. 4 is a schematic diagram of a classification model based on a two-way attention mechanism;
fig. 5 is a schematic structural diagram of the chapter relationship recognition device based on knowledge distillation and multitask learning according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
The existing method based on counterstudy is not sufficient in utilization of the information of the connecting words, only stays in the feature extraction layer to transfer knowledge, and the recognition performance is not ideal.
In order to solve the technical problem, the present invention provides a chapter relationship identification method based on knowledge distillation and multitask learning, referring to fig. 1 to 3, the method includes the following steps:
s101, taking the implicit discourse relation example marked with the connection words and the implicit discourse relation category as a training example.
Specifically, any implicit discourse relation training instance labeled with conjunctions and relation categories in the corpus can be represented as. Wherein the content of the first and second substances,two arguments representing the implicit discourse relation training instance,the connecting words inserted during the labeling, namely the real connecting word marks,and representing the annotated implicit discourse relation category, namely the real category label.
And S102, constructing a teacher model reinforced by connecting words based on the two-way attention mechanism classification model, taking the connecting words as additional input, and performing iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words until convergence to obtain a trained teacher model.
It should be noted that the teacher model is an implicit discourse relation identification model reinforced by connecting words and uses argumentAnd conjunctions inserted at the time of annotationIs an input. The characteristics of the teacher model obtained after passing through the characteristic extraction layer are expressed asAnd the prediction result of the teacher model obtained after the classification layer is expressed as。
When a teacher model is trained, a teacher model cost function (cross-entropy classification cost function) is minimized on training corpora. Wherein the teacher model cost function is represented as:
wherein the content of the first and second substances,are the parameters of the teacher model and are,for the annotated implicit discourse relation category,implicit discourse relation classification for annotationsCorresponding One-hot Encoding (One-hot Encoding),a conjunction that indicates a label is used,indicating the expected value of the prediction result with respect to the label category,representing the prediction results obtained after the classification layer of the teacher model strengthened by the connecting words,is a training example set.
It should be added that the teacher model reinforced by the connection words simulates the process of human labeling the implicit discourse relation. In inserting conjunctionsWith the assistance of (2), the recognition performance is far higher than that of only argumentThe input multi-task learning student model (for example, the accuracy rate on the first-level implicit discourse relation classification task of the PDTB corpus can reach more than 85%), which fully shows that the teacher model with strengthened connecting words can well fuse the information of the connecting words inserted during corpus labeling.
S103, constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model.
The multi-task learning student model is a chapter relation identification model based on multi-task learning. With conjunctive word classification as an auxiliary task, i.e. giving implicit discourse relation examplesPredicting a conjunction word suitable for connecting two arguments; and taking implicit discourse relation identification as a main task. Models of two related tasks (implicit discourse relation identification task and connection word classification task) share a characteristic extraction layer, and the respective classification layers are used. Specifically, referring to fig. 3, the classification layer 1 is used for the implicit discourse relation identification task, and the classification layer 2 is used for the conjunctive word classification task. Through the shared feature extraction layer, the models of two related tasks can exchange information, so that the effect of mutual promotion is achieved. Multi-task learning student model only using argumentAs input, the student model features obtained after passing through the shared feature extraction layer are expressed asThe prediction result of the multi-task learning student model obtained after the classification layer 1 corresponding to the implicit discourse relation identification is represented asThe prediction result of the multi-task learning student model obtained after the classification layer 2 corresponding to the connection word classification is expressed as。
When training a multi-task learning student model, in order to enable the model to fit a training example as much as possibleIt is desirable to minimize the cost function based on multitask learning, i.e., to simultaneously minimize the cross-entropy classification cost function corresponding to implicit discourse relation identification and the cross-entropy classification cost function corresponding to conjunctive word classification.
Specifically, the cross-entropy classification cost function corresponding to implicit discourse relation identification is represented as:
wherein the content of the first and second substances,are the parameters of the student model and are,for the annotated implicit discourse relation category,implicit discourse relation categories representing annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the label category,shows that the prediction result about the implicit discourse relation is obtained after the student model is classified into the layer 1,is a training example set.
The cross-entropy classification cost function corresponding to the conjunctive word classification is expressed as:
wherein the content of the first and second substances,to learn the parameters of the student model for multiple tasks,a conjunction that indicates a label is used,conjunctions representing annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the annotation link,the prediction result about the connection words is obtained after the classification layer 2 of the student model is shown,is a training example set.
In order to learn the classification knowledge integrated with the connecting word information from the teacher model, the invention adopts a knowledge distillation method, and the basic idea is to make the student model simulate the behavior of the teacher model as much as possible.
On the one hand, it is desirable to learn features learned by a multi-task learning student model and a connection-enhanced teacher modelAndthe two models can be as close as possible, so that the knowledge transfer of the two models in the feature extraction layer is realized. As can be seen from the fact that the recognition performance of the teacher model on the PDTB data set is much higher than that of the student models, the characteristics of the teacher modelContaining specific student model featuresMore information useful for implicit discourse relation identification.
Specifically, a cost function corresponding to the distillation of the knowledge of the feature extraction layer in the student model is defined as:
wherein the content of the first and second substances,which represents the mean-square error of the signal,are the parameters of the student model and are,representing the characteristics obtained after the teacher model characteristic extraction layer strengthened by the connecting words,representing the features obtained after the feature extraction layer of the multi-task learning student model,is a training example set.
On the other hand, final prediction results of teacher model with reinforcement of multi-task learning student model and connection wordsAndthe two models can be as close as possible, so that the knowledge migration of the two models at the classification layer is realized. True class labels represented by one-hot codingCan be regarded as a Hard Label (Hard Label), and the predicted result of the teacher modelCan be regarded as a Soft Label (Soft Label), and the Soft Label is generally considered to contain more category information. For example, similarity information between categories. Specifically, a cost function corresponding to the distillation of knowledge of the classification layer in the multi-task learning student model is defined as:
wherein the content of the first and second substances,indicating the KL (Kullback-Leibler) distance between the two probability distributions,for the implicit discourse relation training example with the connection word information,representing the prediction result obtained after the classification layer of the teacher model strengthened by the connecting words,and representing a prediction result obtained after the multi-task learning student model is classified into the layer 1.
Finally, the multitask learning student model total cost function is defined as a linear summation of the multitask learning based cost function and the knowledge distillation based cost function.
Specifically, the total cost function of the multi-task learning student model is expressed as:
wherein the content of the first and second substances,are the parameters of the student model and are,the weight coefficients are respectively a cost function based on multitask learning and a cost function based on knowledge distillation; the cost function based on the multi-task learning comprises two parts:for cross-entropy cost functions identified corresponding to implicit discourse relations,is a cross entropy cost function corresponding to the connected word classification; the cost function of knowledge-based distillation consists of two parts:as a cost function corresponding to the distillation of the knowledge of the feature extraction layer,as a cost function corresponding to the distillation of the knowledge of the classification layer.
And S104, iteratively minimizing the total cost function of the student model until convergence, so as to output the trained student model, and further identifying the implicit discourse relation of the test case.
Algorithm 1 describes the training process of the discourse relation identification method based on knowledge distillation and multitask learning.
Specifically, the whole training process is divided into two stages: the first stage is based on a cost functionTraining a teacher model reinforced by connecting words (steps 1-5), and in the second stage, based on a cost functionTraining a multitask student model (step 6-12). For simplicity, the step of judging whether the model converges or not based on the verification data set is omitted in the algorithm 1, and the finally trained multi-task learning student model is the required implicit discourse relation identification model.
Algorithm 1 training algorithm
And (3) outputting: trained multi-task learning student model
2. Repeating the following steps:
7. Repeating the following steps:
9. Calculating corresponding characteristics based on trained teacher model reinforced by connecting words
10. Calculating corresponding prediction results based on trained connection word reinforced teacher model
Meanwhile, in the present invention, the above-mentioned two-way attention mechanism classification model is often used to model semantic relationships between two sentences, such as text implication recognition, automatic question-answering, sentence semantic matching, and the like.
Referring to fig. 4, in particular, the two-way attention mechanism classification model includes a coding layer, an interaction layer, an aggregation layer and a classification layer. The feature extraction layer is composed of a coding layer, an interaction layer and an aggregation layer. In addition, the coding layer is used for learning the expression of words in the argument in the context, and the coding layer is expressed as follows:
wherein the content of the first and second substances,are respectively the first in argument 1A word vector of words and its representation in context,are respectively the first in argument 2Word vectors of words and their representation in context,andare respectively two argumentsThe number of the Chinese words is equal to the number of the Chinese words,both are bidirectional long-time memory networks.
The interaction layer is represented as:
wherein the content of the first and second substances,is a fully connected multi-layer feedforward neural network,is the number 1 of argumentThe first word and argument 2Relevance weight of the individual words;is the number 1 of the argumentThe representation of a word in argument 2 to which the word is related,is the number 2 of the argumentThe representation of a word in argument 1 to which the word is related,is another fully-connected multi-layer feed-forward neural network,a stitching operation of the representation vector is performed,andcan be regarded as learned local semantic relation representation.
The aggregation layer calculates the global semantic relation based on the local semantic relation expression. The expression is specifically as follows:
wherein the content of the first and second substances,the characteristics extracted by the characteristic extraction layer are expressed as the characteristics in the student model and the teacher model respectivelyAnd。
in addition, the classification layer is used to calculate the final classification result. The details are as follows:
wherein the content of the first and second substances,by a fully-connected multi-layer feedforward neural network andlayer composition;is the final classification result.
For the teacher model with strengthened connecting words, the teacher model can be directly constructed based on the two-way attention mechanism classification model, and only needs to strengthen input of the connecting words, namely the input of the model is used as the input of the modelIn particular, connecting wordsIs spliced atThe beginning of argument 2 in, as new argument 2. The learned features are expressed asThe predicted result is expressed as。
For a multi-task learning student model, the construction of the two-way attention mechanism classification model, the implicit discourse relation recognition task and the connection word classification task need to be simply expanded to share a feature extraction layer, but the classification layers are respectively used. Specifically, for the input exampleFeatures obtained through the shared feature extraction layer areThen, based on the classification layer 1, the prediction result corresponding to the implicit discourse relation identification is calculated as:
wherein the content of the first and second substances,by a fully-connected multi-layer feedforward neural network andlayer composition; the prediction result corresponding to the connected word classification is calculated based on the classification layer 2 as:
wherein the content of the first and second substances,by a fully-connected multi-layer feedforward neural network andand (3) layer composition.
The invention provides a discourse relation identification method based on knowledge distillation and multi-task learning, which takes an implicit discourse relation example marked with connection words and categories as a training example and aims to fully utilize information of the connection words inserted during corpus marking; firstly, constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, and iteratively minimizing a cost function until convergence by using the connecting words as additional input to obtain a trained teacher model; and then training the constructed multi-task student model, constructing a total cost function based on a multi-task learning and knowledge distillation method, and carrying out minimum iteration processing on the total cost function until convergence, thereby outputting the well-trained multi-task student model.
According to the discourse relation identification method based on knowledge distillation and multitask learning, on one hand, knowledge is shared between a connection word classification auxiliary task and an implicit discourse relation identification main task based on a parameter sharing mode (shared characteristic extraction layer), on the other hand, knowledge in a teacher model enhanced by connection words is migrated to a corresponding implicit discourse relation identification model (multitask learning student model) from the characteristic extraction layer and the classification layer based on a knowledge distillation technology, so that the identification performance of the student model is improved by fully utilizing connection word information inserted during corpus labeling. The method provided by the invention obtains better identification performance on the first-level and second-level implicit discourse relations of the common PDTB data set than the similar method.
Referring to fig. 5, for the discourse relation identification device based on knowledge distillation and multitask learning according to the second embodiment of the present invention, the device includes a training input module 111, a first construction module 112, a second construction module 113, and a training output module 114, which are connected in sequence;
wherein the training input module 111 is specifically configured to:
taking an implicit discourse relation instance labeled with the connection words and implicit discourse relation categories as a training instance;
the first construction module 112 is specifically configured to:
constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, and carrying out iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words by taking the connecting words as additional input until convergence to obtain a trained teacher model;
the second construction module 113 is specifically configured to:
constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model;
the training output module 114 is specifically configured to:
and iteratively minimizing the total cost function of the student model until convergence so as to output the trained student model, and further identifying the implicit discourse relation of the test case.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A discourse relation identification method based on knowledge distillation and multitask learning is characterized by comprising the following steps:
taking an implicit discourse relation instance labeled with the connection words and implicit discourse relation categories as a training instance;
constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, and carrying out iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words by taking the connecting words as additional input until convergence to obtain a trained teacher model;
constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model;
and iteratively minimizing the total cost function of the student model until convergence so as to output the trained student model, and further identifying the implicit discourse relation of the test case.
2. The knowledge distillation and multitask learning-based discourse relation identification method as claimed in claim 1, wherein in the training instance, the implicit discourse relation instance labeled with the connection word and the implicit discourse relation category is represented as;
3. The knowledge distillation and multitask learning based discourse relation identification method as claimed in claim 2, wherein in said connectives enhanced teacher model, the input isThe corresponding cost function is expressed as:
wherein the content of the first and second substances,are the parameters of the teacher model and are,implicit discourse relation classification for annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the label category,representing the prediction results obtained after the classification layer of the teacher model strengthened by the connecting words,is a training example set.
4. The knowledge distillation and multitask learning based discourse relation identification method according to claim 2, wherein in the multitask learning student model, the total cost function of the student model is expressed as:
wherein the content of the first and second substances,for the total cost function of the student model,are the parameters of the student model and are,respectively costs based on multi-task learningWeighting coefficients of the function and the cost function based on knowledge distillation;
the cost function based on the multi-task learning comprises two parts:for cross-entropy cost functions identified corresponding to implicit discourse relations,is a cross entropy cost function corresponding to the connected word classification; the cost function of knowledge-based distillation comprises two parts:as a cost function corresponding to the distillation of the knowledge of the feature extraction layer,as a cost function corresponding to the distillation of the knowledge of the classification layer.
5. The knowledge distillation and multitask learning based chapter relationship identification method according to claim 4, characterized in that in said multitask learning student model, the input isThe cross-entropy cost function corresponding to implicit discourse relation identification is expressed as:
wherein the content of the first and second substances,implicit discourse relation classification for annotationsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the label category,representing the prediction result corresponding to the implicit discourse relation identification after the classification layer 1 of the multi-task learning student model,is a training example set.
6. The knowledge distillation and multitask learning based discourse relation identification method according to claim 4, wherein the cross-entropy cost function corresponding to the connected word classification in the multitask learning student model is expressed as:
wherein the content of the first and second substances,for marked conjunctionsThe corresponding one-hot code is coded,indicating the expected value of the prediction result with respect to the annotation link,represents the prediction result corresponding to the connection word classification obtained after the student model classification layer 2,is a training example set.
7. The knowledge distillation and multitask learning based discourse relation identification method according to claim 4, wherein the cost function corresponding to the feature extraction layer knowledge distillation in the multitask learning student model is expressed as:
wherein the content of the first and second substances,which represents the mean-square error of the signal,representing the characteristics obtained after the teacher model characteristic extraction layer strengthened by the connecting words,representing features obtained after passing through a feature extraction layer of the multi-task learning student model,is a training example set.
8. The knowledge distillation and multitask learning based chapter relationship identification method according to claim 4, wherein the cost function corresponding to the classification layer knowledge distillation in the multitask learning student model is expressed as:
wherein the content of the first and second substances,indicating the KL-distance between the two probability distributions,representing the prediction result obtained after the classification layer of the teacher model strengthened by the connecting words,and the prediction result obtained after the multi-task learning student model classification layer 1 is represented.
9. The knowledge distillation and multitask learning based discourse relation identification method according to claim 1, wherein the bidirectional attention mechanism classification model comprises an encoding layer, an interaction layer, an aggregation layer and a classification layer, wherein the encoding layer is used for learning the representation of words in arguments in context, and the encoding layer is represented as:
wherein the content of the first and second substances,are respectively the first in argument 1A word vector of words and its representation in context,are respectively the first in argument 2Word vectors of words and their representation in context,andthe number of words in two arguments respectively,both are bidirectional long-time memory networks.
10. An apparatus for identifying discourse relation based on knowledge distillation and multitask learning, the apparatus comprising:
the training input module is used for taking an implicit discourse relation example marked with the connection words and the implicit discourse relation category as a training example;
the first construction module is used for constructing a teacher model reinforced by connecting words based on a bidirectional attention mechanism classification model, taking the connecting words as additional input, and performing iterative minimization processing on a cost function corresponding to the teacher model reinforced by the connecting words until convergence to obtain a trained teacher model;
the second construction module is used for constructing a multi-task learning student model based on the two-way attention mechanism classification model, introducing connection word classification as an auxiliary task to determine a cost function based on multi-task learning, calculating the characteristics and the prediction result of a training example by using the trained teacher model to determine a cost function based on knowledge distillation, and then determining a total cost function of the student model;
and the training output module is used for iteratively minimizing the total cost function of the student model until convergence so as to output the trained student model and further identify the implicit discourse relation of the test case.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110078740.7A CN112395876B (en) | 2021-01-21 | 2021-01-21 | Knowledge distillation and multitask learning-based chapter relationship identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110078740.7A CN112395876B (en) | 2021-01-21 | 2021-01-21 | Knowledge distillation and multitask learning-based chapter relationship identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112395876A true CN112395876A (en) | 2021-02-23 |
CN112395876B CN112395876B (en) | 2021-04-13 |
Family
ID=74625591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110078740.7A Active CN112395876B (en) | 2021-01-21 | 2021-01-21 | Knowledge distillation and multitask learning-based chapter relationship identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112395876B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113177415A (en) * | 2021-04-30 | 2021-07-27 | 科大讯飞股份有限公司 | Semantic understanding method and device, electronic equipment and storage medium |
CN113377915A (en) * | 2021-06-22 | 2021-09-10 | 厦门大学 | Dialogue chapter analysis method |
CN115271272A (en) * | 2022-09-29 | 2022-11-01 | 华东交通大学 | Click rate prediction method and system for multi-order feature optimization and mixed knowledge distillation |
CN116028630A (en) * | 2023-03-29 | 2023-04-28 | 华东交通大学 | Implicit chapter relation recognition method and system based on contrast learning and Adapter network |
CN116432752A (en) * | 2023-04-27 | 2023-07-14 | 华中科技大学 | Construction method and application of implicit chapter relation recognition model |
CN113177415B (en) * | 2021-04-30 | 2024-06-07 | 科大讯飞股份有限公司 | Semantic understanding method and device, electronic equipment and storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8265925B2 (en) * | 2001-11-15 | 2012-09-11 | Texturgy As | Method and apparatus for textual exploration discovery |
CN107273358A (en) * | 2017-06-18 | 2017-10-20 | 北京理工大学 | A kind of end-to-end English structure of an article automatic analysis method based on pipe modes |
CN107330032A (en) * | 2017-06-26 | 2017-11-07 | 北京理工大学 | A kind of implicit chapter relationship analysis method based on recurrent neural network |
US10303771B1 (en) * | 2018-02-14 | 2019-05-28 | Capital One Services, Llc | Utilizing machine learning models to identify insights in a document |
CN110633473A (en) * | 2019-09-25 | 2019-12-31 | 华东交通大学 | Implicit discourse relation identification method and system based on conditional random field |
US10606952B2 (en) * | 2016-06-24 | 2020-03-31 | Elemental Cognition Llc | Architecture and processes for computer learning and understanding |
CN111428525A (en) * | 2020-06-15 | 2020-07-17 | 华东交通大学 | Implicit discourse relation identification method and system and readable storage medium |
CN111538841A (en) * | 2020-07-09 | 2020-08-14 | 华东交通大学 | Comment emotion analysis method, device and system based on knowledge mutual distillation |
EP3699753A1 (en) * | 2019-02-25 | 2020-08-26 | Atos Syntel, Inc. | Systems and methods for virtual programming by artificial intelligence |
CN111651974A (en) * | 2020-06-23 | 2020-09-11 | 北京理工大学 | Implicit discourse relation analysis method and system |
CN111695341A (en) * | 2020-06-16 | 2020-09-22 | 北京理工大学 | Implicit discourse relation analysis method and system based on discourse structure diagram convolution |
EP3593262A4 (en) * | 2017-03-10 | 2020-12-09 | Eduworks Corporation | Automated tool for question generation |
-
2021
- 2021-01-21 CN CN202110078740.7A patent/CN112395876B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8265925B2 (en) * | 2001-11-15 | 2012-09-11 | Texturgy As | Method and apparatus for textual exploration discovery |
US10606952B2 (en) * | 2016-06-24 | 2020-03-31 | Elemental Cognition Llc | Architecture and processes for computer learning and understanding |
EP3593262A4 (en) * | 2017-03-10 | 2020-12-09 | Eduworks Corporation | Automated tool for question generation |
CN107273358A (en) * | 2017-06-18 | 2017-10-20 | 北京理工大学 | A kind of end-to-end English structure of an article automatic analysis method based on pipe modes |
CN107330032A (en) * | 2017-06-26 | 2017-11-07 | 北京理工大学 | A kind of implicit chapter relationship analysis method based on recurrent neural network |
US10303771B1 (en) * | 2018-02-14 | 2019-05-28 | Capital One Services, Llc | Utilizing machine learning models to identify insights in a document |
EP3699753A1 (en) * | 2019-02-25 | 2020-08-26 | Atos Syntel, Inc. | Systems and methods for virtual programming by artificial intelligence |
CN110633473A (en) * | 2019-09-25 | 2019-12-31 | 华东交通大学 | Implicit discourse relation identification method and system based on conditional random field |
CN111428525A (en) * | 2020-06-15 | 2020-07-17 | 华东交通大学 | Implicit discourse relation identification method and system and readable storage medium |
CN111695341A (en) * | 2020-06-16 | 2020-09-22 | 北京理工大学 | Implicit discourse relation analysis method and system based on discourse structure diagram convolution |
CN111651974A (en) * | 2020-06-23 | 2020-09-11 | 北京理工大学 | Implicit discourse relation analysis method and system |
CN111538841A (en) * | 2020-07-09 | 2020-08-14 | 华东交通大学 | Comment emotion analysis method, device and system based on knowledge mutual distillation |
Non-Patent Citations (1)
Title |
---|
朱珊珊等: "面向不平衡数据的隐式篇章关系分类方法研究", 《中文信息学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113177415A (en) * | 2021-04-30 | 2021-07-27 | 科大讯飞股份有限公司 | Semantic understanding method and device, electronic equipment and storage medium |
CN113177415B (en) * | 2021-04-30 | 2024-06-07 | 科大讯飞股份有限公司 | Semantic understanding method and device, electronic equipment and storage medium |
CN113377915A (en) * | 2021-06-22 | 2021-09-10 | 厦门大学 | Dialogue chapter analysis method |
CN113377915B (en) * | 2021-06-22 | 2022-07-19 | 厦门大学 | Dialogue chapter analysis method |
CN115271272A (en) * | 2022-09-29 | 2022-11-01 | 华东交通大学 | Click rate prediction method and system for multi-order feature optimization and mixed knowledge distillation |
CN115271272B (en) * | 2022-09-29 | 2022-12-27 | 华东交通大学 | Click rate prediction method and system for multi-order feature optimization and mixed knowledge distillation |
CN116028630A (en) * | 2023-03-29 | 2023-04-28 | 华东交通大学 | Implicit chapter relation recognition method and system based on contrast learning and Adapter network |
CN116028630B (en) * | 2023-03-29 | 2023-06-02 | 华东交通大学 | Implicit chapter relation recognition method and system based on contrast learning and Adapter network |
CN116432752A (en) * | 2023-04-27 | 2023-07-14 | 华中科技大学 | Construction method and application of implicit chapter relation recognition model |
CN116432752B (en) * | 2023-04-27 | 2024-02-02 | 华中科技大学 | Construction method and application of implicit chapter relation recognition model |
Also Published As
Publication number | Publication date |
---|---|
CN112395876B (en) | 2021-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112395876B (en) | Knowledge distillation and multitask learning-based chapter relationship identification method and device | |
Wu et al. | Neural metaphor detecting with CNN-LSTM model | |
CN111611810B (en) | Multi-tone word pronunciation disambiguation device and method | |
Joty et al. | Combining intra-and multi-sentential rhetorical parsing for document-level discourse analysis | |
CN111428525B (en) | Implicit discourse relation identification method and system and readable storage medium | |
Urieli | Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit | |
WO2023024412A1 (en) | Visual question answering method and apparatus based on deep learning model, and medium and device | |
Zhang et al. | Aspect-based sentiment analysis for user reviews | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
Zhang et al. | n-BiLSTM: BiLSTM with n-gram Features for Text Classification | |
CN112163429A (en) | Sentence relevancy obtaining method, system and medium combining cycle network and BERT | |
CN111783461A (en) | Named entity identification method based on syntactic dependency relationship | |
CN111651973A (en) | Text matching method based on syntax perception | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114881042A (en) | Chinese emotion analysis method based on graph convolution network fusion syntax dependence and part of speech | |
Chen et al. | Research on automatic essay scoring of composition based on CNN and OR | |
CN114970536A (en) | Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition | |
CN115048936A (en) | Method for extracting aspect-level emotion triple fused with part-of-speech information | |
Hughes | Automatic inference of causal reasoning chains from student essays | |
Neill et al. | Meta-embedding as auxiliary task regularization | |
Zhao | Research and design of automatic scoring algorithm for English composition based on machine learning | |
CN116562291A (en) | Chinese nested named entity recognition method based on boundary detection | |
CN115906854A (en) | Multi-level confrontation-based cross-language named entity recognition model training method | |
Hsiao et al. | [Retracted] Construction of an Artificial Intelligence Writing Model for English Based on Fusion Neural Network Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |