CN112732879B - Downstream task processing method and model of question-answering task - Google Patents
Downstream task processing method and model of question-answering task Download PDFInfo
- Publication number
- CN112732879B CN112732879B CN202011539404.XA CN202011539404A CN112732879B CN 112732879 B CN112732879 B CN 112732879B CN 202011539404 A CN202011539404 A CN 202011539404A CN 112732879 B CN112732879 B CN 112732879B
- Authority
- CN
- China
- Prior art keywords
- context
- question
- representation
- vector
- ckey
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims description 6
- 230000008447 perception Effects 0.000 claims abstract description 41
- 230000007246 mechanism Effects 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 22
- 230000014509 gene expression Effects 0.000 claims abstract description 18
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 9
- 239000002245 particle Substances 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 19
- 230000001419 dependent effect Effects 0.000 claims description 10
- 238000012935 Averaging Methods 0.000 claims description 7
- 150000001875 compounds Chemical class 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000001351 cycling effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 235000019580 granularity Nutrition 0.000 abstract description 33
- 238000002474 experimental method Methods 0.000 description 16
- 241000282414 Homo sapiens Species 0.000 description 7
- 238000011161 development Methods 0.000 description 7
- 230000006399 behavior Effects 0.000 description 5
- 238000002679 ablation Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011143 downstream manufacturing Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method and a model for processing a downstream task of a question-answering task, which are used for obtaining a context expression H of key information perceptionCKeyAnd problem representation of key information perception HQKeyGenerating a problem-aware context representation G; calculating an update vector z and a memory weight G based on G, updating G to obtain an output vector Gg(ii) a Generating a context granularity vector GCAnd sequence granularity vector GCLSGenerating an output vector CoutUsing softmax to calculate the probability of each word in the context as the start-stop position of the answer, and extracting the continuous subsequence with the highest probability as the answer. The invention provides a bidirectional cascade attention mechanism, constructs a mechanism taking accurate reading and skimming as a whole and a multi-granularity module based on a grain calculation idea, so that the model effectively pays attention to and screens effective information, better understands texts under various granularities, gives more accurate answers, and makes new progress in performance on the basis of a baseline model.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method and a model for processing a downstream task of a question and answer task.
Background
Machine-reading understanding is a very challenging task in natural language processing, aiming to determine the correct answer to a question according to a given context. Common machine reading understanding tasks are divided into full-form fill-in, multiple choices, segment extraction and free answer according to answer forms. The newly developed pre-training language model achieves a series of successes in various natural language understanding tasks by virtue of the powerful text representation capability. The pre-trained language models are used as encoders of deep learning language models, and are used for extracting language association characteristics of relevant texts and carrying out fine adjustment in combination with a downstream data processing structure specific to a specific task. With the great success of the development of pre-trained language models, people focus more attention on the encoder end of the deep learning language model, leading to the development of downstream processing technology customized for specific tasks entering the bottleneck. Although one can directly benefit from a variety of powerful coders with similar structures, applying the general knowledge coding implied in large-scale corpora to language models with very large-scale parameters is a time and resource consuming matter. And the current language representation coding technology is slowly developed, so that the further improvement of the performance of the pre-training language model is limited. These have all highlighted the importance of developing downstream processing techniques under specific tasks.
In summary, the existing deep learning language model has the following disadvantages: (1) the unimportant part in the text is emphasized, and the important part is ignored; (2) there is a phenomenon of being unstable, that is, easily affected by an interfering sentence in a text having a plurality of words identical to the question, and only matching by the character itself is possible, and semantic matching is not possible.
Therefore, the preference of how to focus the model on the key information in the text and help the model jump out of the local information too much focused on the text becomes an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the defects in the prior art, the problems to be solved by the invention are as follows: the model is made to focus on key information in the text and a preference to help the model jump out of local information that is too focused on the text.
In order to solve the technical problems, the invention adopts the following technical scheme:
a downstream task processing method of a question-answering task comprises the following steps:
s1, inputting the question and the context into a pre-training language module to obtain the language association characteristics of the context;
s2 obtaining context representation H of key information perception based on language association characteristics of context by utilizing bidirectional attention mechanismCKeyAnd problem representation of key information perception HQKey;
S3 context representation H based on key information perception using bidirectional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
s4, calculating an update vector z and a memory weight G based on the problem-aware context representation G by using a gate mechanism, and updating the problem-aware context representation G by using the update vector z and the memory weight G to obtain an output vector Gg;
S5, calculating context-based by using particlesLanguage-dependent feature generation context granularity vector GCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of contextoutAfter the linear layer processing, the probability of each word in the computation context as the starting and ending position of the answer is calculated by using softmax, and the continuous subsequence with the maximum probability is extracted as the answer.
Preferably, the language-dependent feature of the context is H, H ═ H1,h2,h3,...,hs},h1To hsRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:
s201, intercepting a question intercepting part H based on the context and the position of the question in the HQAnd a context intercept part HC,HQ={h2,h3,h4,...,hn+1},HC={hn+3,hn+4,...,hn+m+2N represents the length of the question word, and m represents the length of the context word;
s202, constructing a similarity matrix S,
S=WS(HC,HQ,HC·HQ)
in the formula, WSIs a trainable matrix;
s203, performing softmax operation on each row and each column of the similarity matrix S to obtain S1And S2,S1Representing for each context word the relevance of all question words to it; s2Representing for each question word the relevance of all context words to it; s1=softmax→(S),S2=softmax↓(S);
S204, highlighting the weight of the question key words and the context key words;
s205, generating a context representation H of key information perception based on the following formulaCKeyAnd problem representation of key information perception HQKey:
HCKey=HC+HC⊙AC
HQKey=HQ+HQ⊙AQ
In the formula, ACRepresenting the context-critical section attention associated with the question keyword, AQRepresenting a question key part attention associated with the context keyword;
AC=S2·SQkey
AQ=S1·SCkey
in the formula, SQkeyQuestion weight, S, representing a salient keywordCkeyA contextual weight representing a salient keyword;
SQkey=mean↓(S1)
SCkey=mean→(S2)。
preferably, step S3 includes:
S′=WS′(HCKey,HQKey,HCKey·HQKey)
in the formula (I), the compound is shown in the specification,indicating for each context word, the relevance of all question words to it,representing the correlation between all the context words and each question word, S' representing the correlation between the question words and the context words after obtaining the key information, and recalculating the correlation between the question words and the context words (same operation as S202), WS′Represented as a trainable matrix;
s302, calculating a context expression A based on the question words (which refer to the expressions of all the question words, wherein the weights of all the question words after averaging are also the weights of the key words in the question words, but the weights of the key words in the question words can be highlighted in an averaging mode) and a question word expression B based on the context words;
s303, adding HCKeySplicing the A and the B to obtain a context expression G of problem perception;
G=W3([HcKey;HcKey⊙A;HcKey⊙B])+b3
W3and b3Trainable matrices and biases, respectively.
Preferably, step S3 further includes:
s304, representing the problem of key information perception HQKeyAnd question-aware context representation G as H in step S2QAnd HCAnd repeating the steps S2 and S3 until a final problem-aware context representation G is obtained after cycling for a preset number of times.
Preferably, in step S4:
z=tanh(Wz·G+bz)
g=sigmoid(Wg[G;A]+bg)
in the formula, Wz,Wg,bzAnd bgIs a trainable matrix and bias.
Preferably, step S5 includes:
s501, removing HCOf (1) [ PAD ]]After filling part, averaging to obtain context granularity vector GC;
S502, extracting [ CLS ] in H]Identifier as a sequence granularity vector GCLS;
S503, calculating language association characteristics based on the context based on the following formula to generate an output vector C of the multi-angle understanding context and the relation between the context and the local partout:
Cout=W4·(Cg+GC+GCLS)+b4
In the formula, W4And b4Trainable matrices and biases, respectively.
A question-answering task downstream task processing model is used for realizing the question-answering task downstream task processing method, and comprises the following steps:
a pre-training language module for generating language-associated features of a context based on a question and the context;
a skimming module for deriving a key information-aware context representation H using a context-based language-dependent feature of a bidirectional attention mechanismCKeyAnd problem representation of key information perception HQKey;
A perusal module for context representation H based on key information perception using bi-directional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
the door mechanism module is used for calculating an update vector z and a memory weight G by utilizing the door mechanism based on the context representation G of the problem perception, and obtaining an output vector G by utilizing the update vector z and the memory weight G to update the context representation G of the problem perceptiong;
A particle computation module for generating a context granularity vector G using particle computation context-based language-dependent featuresCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
An answer prediction module for generating an output vector C of the multi-angle understanding context and the relation between the context and the local part based on the language association characteristics of the contextoutAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.
Preferably, the loss function of the downstream task processing model of the question-answering task in the training process is as follows:
in the formula (I), the compound is shown in the specification,andrespectively representing the start position and the end position of the true answer of the ith sample, N being the total number of samples,the representation represents the predicted probability of the starting position of the real answer at the time of model inference,to representThe predicted probability of the end position of the true answer.
In summary, compared with the prior art, the invention discloses a method for processing the downstream task of the question-answering task, which adds a downstream processing structure on the basis of a pre-training model, comprises a skimming module, a precision reading module and a door mechanism module, and can simulate the behavior of reading and comprehensively screening information for many times when a human reads and understands a task. The skimming module is used for helping the model to determine the context representation of key information perception and the problem representation of key information perception; the precision reading module sends the vector output from the encoder end to a bidirectional attention flow layer to establish a complete incidence relation between the problem and the context; meanwhile, according to the thought of particle calculation, a multi-particle module for calculating context particle size and sequence particle size is added into the model, and a parallel structure is formed by the multi-particle module and the word particle size obtained by the multi-particle module, so that the model can simulate the behavior of human beings from words to sentences and from local parts to whole comprehension of texts.
Drawings
FIG. 1 is a flow chart of a method for processing a downstream task of a question-answering task according to the present invention;
FIG. 2 is a block diagram of a downstream task processing model for a question-answering task as disclosed herein;
FIG. 3 is a skimming module (bidirectional stacked attention mechanism);
FIG. 4 is a comparison of RoBERTA and F1 of the present invention in broken lines;
FIG. 5 is a graph of a polyline comparison of RoBERTA and EM of the present invention;
FIG. 6 is a diagram of a key portion of a question key and its associated context;
FIG. 7 is a diagram of a key portion of a context keyword and associated question words.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the present invention discloses a method for processing a downstream task of a question and answer task, which comprises the following steps:
s1, inputting the question and the context into a pre-training language module to obtain the language association characteristics of the context;
the specific operation is to send the sequence composed of the problem and the context splicing into a pre-training language module, namely an encoder for encoding.
S2 obtaining context representation H of key information perception based on language association characteristics of context by utilizing bidirectional attention mechanismCKeyAnd problem representation of key information perception HQKey;
S3 context representation H based on key information perception using bidirectional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
s4, calculating an update vector z and a memory weight G based on the problem-aware context representation G by using a gate mechanism, and updating the problem-aware context representation G by using the update vector z and the memory weight G to obtain an output vector Gg;
S5, generating context granularity vector G based on language correlation characteristics of context by using granularity calculationCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of contextoutAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.
Aiming at the problem that the unimportant part in the text is regarded as important and the important part is ignored in the prior art, the invention provides a bidirectional cascading attention mechanism, so that a model can sense the keywords of the problem and the related context key part thereof, and the context keywords and the related problem key part thereof. And the door mechanism module is used for the model to automatically screen the parts related to the problems in the context and forget the parts unrelated to the problems. Through the two mechanisms described above, the model is forced to focus on key information in the text.
Aiming at the problem of over-stability in the prior art, the particle calculation module provided by the invention is used for increasing context granularity and sequence granularity in the model, so that the model can understand texts from local and global angles and is helped to jump out the preference of over-paying attention to local information of the texts.
The skimming module is inspired by the idea of a stacked attention mechanism, which is proposed by the present invention, as shown in fig. 3. After the similarity matrix is obtained through the context and the question representation, softmax is respectively solved horizontally and vertically for the obtained matrix, and the highlighting keyword is respectively solved, and the attention of the context key part associated with the question keyword and the attention of the question word key part associated with the context keyword are respectively calculated. The context expression and the problem word expression are respectively multiplied by the corresponding attention moment array and then added with the original expression to obtain the context expression H of key information perceptionCKeyAnd problem representation of key information perception HQKey。
In specific implementation, the language-dependent characteristic of the context is H, H ═ H1,h2,h3,...,hs},h1To hsRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:
s201, intercepting a question intercepting part H based on the context and the position of the question in the HQAnd a context intercept part HC(the sequence is composed of problem and context concatenation, so after encoding, the encoded parts of the problem and context need to be intercepted from the corresponding positions in the sequence for subsequent operations), HQ={h2,h3,h4,...,hn+1},HC={hn+3,hn+4,...,hn+m+2N represents the length of the question word, and m represents the length of the context word;
all questions and context words become n and m in length. If the length is not sufficient, [ PAD ]]And filling, and if the length exceeds the length, performing truncation. When coded by BERT, the sequence needs to be as followsSplicing is performed in such a way that: [ CLS]+HQ+[SEP]+HC+[SEP]Of the form (1) that only requires H for subsequent operationsQAnd HCSo will [ CLS](i.e. h)1) And two [ SEP ]](i.e. h)n+2,hn+m+3) And (4) discarding. A sequence consisting of a question and a context, and [ CLS]And two [ SEP ]]Composition, problem length is n, context length is m, so sequence length is n + m + 3.
S202, constructing a similarity matrix S,
S=WS(HC,HQ,HC·HQ)
in the formula, WSIs a trainable matrix;
in a simplified form, the specific process is S ═ Wa*HC+Wb*HQ+Wc*HC*HQ+ bias, where denotes matrix multiplication, bias. By WsTo represent W hereina、WbAnd WcAnd finally the shape of the S matrix is [ b, m, n ]]And b represents the size of each batch.
S203, performing softmax operation on each row and each column of the similarity matrix S to obtain S1And S2,S1Representing for each context word the relevance of all question words to it; s2Representing for each question word the relevance of all context words to it; s1=softmax→(S),S2=softmax↓(S);
S204, highlighting the weight of the problem keyword and the context keyword;
aiming at each context word, the relevance of all questions to the context word, so the weight of the keywords in the question words can be highlighted by longitudinally averaging, namely the more key a question word is, the greater the relevance to each context word is, the greater the average value is, and the weight S of the question keywords is highlighted by the weightQkey. By averaging the horizontal direction in the same way, the keyword weight S in the context can be highlightedCkey。
S205, generating a context representation H of key information perception based on the following formulaCKeyAnd key informationPerceptual problem representation HQKey:
HCKey=HC+HC⊙AC
HQKey=HQ+HQ⊙AQ
In the formula, ACRepresenting the context-critical section attention associated with the question keyword, AQRepresenting a question key part attention associated with the context keyword;
AC=S2·SQkey
AQ=S1·SCkey
in the formula, SQkeyQuestion weight, S, representing a salient keywordCkeyA contextual weight representing a salient keyword;
SQkey=mean↓(S1)
SCkey=mean→(S2)。
in specific implementation, step S3 includes:
S′=WS′(HCKey,HQKey,HCKey·HQKey)
in the formula (I), the compound is shown in the specification,for each upper partThe following words, the relevance of all the question words to them,representing the correlation between all the context words and each question word, S' representing the correlation between the question words and the context words after obtaining the key information, and recalculating the correlation between the question words and the context words (same operation as S202), WS′Represented as a trainable matrix;
s302, calculating a context expression A based on the question words and a question word expression B based on the context words;
s303, adding HCKeySplicing the A and the B to obtain a context expression G of problem perception;
G=W3([HcKey;HcKey⊙A;HcKey⊙B])+b3
W3and b3Trainable matrices and biases, respectively.
When a human being finishes a reading and understanding task, the human being often reads for many times to deepen the understanding of the text. The model simulates the behavior in the process of passing through the skimming module and the perusal module again for multiple times, grasps key information in the problems and paragraphs by skimming, further grasps the main idea of the text by perusal and screens important information in accordance with the problems. And repeatedly reading, and continuously adjusting the judged key information to obtain more comprehensive context expression, and finally determining the answer of the question. The invention simulates the behavior of reading the text repeatedly by human by using a multi-hop loop mechanism, and helps the model deepen the understanding of the text. The experimental data below also demonstrate that the multi-hop loop mechanism helps to improve the model performance.
In specific implementation, step S3 further includes:
s304, representing the problem of key information perception HQKeyAnd question-aware context representation G as H in step S2QAnd HCAnd repeating the steps S2 and S3 until a final problem-aware context representation G is obtained after cycling for a preset number of times.
In the reading and understanding process, various structures adopt a door-like mechanism to simulate the human beings to screen and memorize important contents after reading for many times, and neglect the behaviors of unimportant contents, such as LSTM, GRU and training Reader. And judging the part needing to be memorized or forgotten by the model, generating an updating vector and updating the memory result of the model. In the invention, a problem perception context expression G and a key information perception problem expression H are expressedQKeyAnd sending the data to a door mechanism to enable the model to judge the part needing to be memorized or forgotten, generating an updating vector z by using G, and updating the memorized result of the model. Merging G and A into a linear layer with sigmoid, that is, when part of G is more relevant to the problem content, the memory weight G approaches 1, and more relevant information is retained
In the specific implementation, in step S4:
z=tanh(Wz·G+bz)
g=sigmoid(Wg[G;A]+bg)
in the formula, Wz,Wg,bzAnd bgIs a trainable matrix and bias.
The invention simultaneously adopts a bidirectional stacked attention mechanism and a door mechanism to force the model to pay more attention to the key information in the text. The hyper-parameters in the experiment can be adjusted automatically according to the current equipment performance. The performance of the model is also different under different hyper-parameter settings. The experimental data of the invention are model performance results established under the hyper-parameter settings of the patent. Under the same super-parameter setting, the performance of the model of the invention is superior to that of other models in a comparison test.
Particle computation is an effective solution to the structuring problem. One recognized feature of artificial intelligence is that people can observe and analyze the same problem at very different granularities, and that people can not only solve problems in different granularity worlds, but also jump from one granularity world to another granularity world quickly, and this ability to deal with different granularity worlds is a powerful manifestation of human problem solving. The particle calculation model divides a research object into a plurality of layers with different particle sizes, each layer is mutually associated to form a unified whole, the different particle sizes represent different angles and ranges of information, and the particle calculation idea is helpful for solving problems of the model under various particle sizes and helps the model to understand the relation between the local part and the whole of the text. The invention provides a method for understanding texts and the relation between the text totality and the text part in terms of word granularity, context granularity and sequence granularity.
In specific implementation, step S5 includes:
s501, removing HCOf (1) [ PAD ]]After filling part, averaging to obtain context granularity vector GC;
S502, extracting [ CLS ] in H]Identifier as a sequence granularity vector GCLS;
S503, calculating language association characteristics based on the context based on the following formula to generate an output vector C of the multi-angle understanding context and the relation between the context and the local partout:
Cout=W4·(Cg+GC+GCLS)+b4
In the formula, W4And b4Trainable matrices and biases, respectively.
In addition, the invention also discloses a question-answering task downstream task processing model, which is used for realizing the question-answering task downstream task processing method and comprises the following steps:
a pre-training language module for generating language-associated features of a context based on a question and the context;
a skimming module for utilizingBidirectional attention mechanism deriving key information-aware context representation based on context-dependent language featuresCKeyAnd problem representation of key information perception HQKey;
A perusal module for context representation H based on key information perception using bi-directional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
the door mechanism module is used for calculating an update vector z and a memory weight G by utilizing the door mechanism based on the context representation G of the problem perception, and obtaining an output vector G by utilizing the update vector z and the memory weight G to update the context representation G of the problem perceptiong;
A particle computation module for generating a context granularity vector G using particle computation context-based language-dependent featuresCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
An answer prediction module for generating an output vector C of the multi-angle understanding context and the relation between the context and the local part based on the language association characteristics of the contextoutAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.
The invention serves a fragment extraction type reading and understanding task, and the main model architecture is shown in figure 2. The downstream structure mainly comprises the following four parts: the device comprises a skimming module, a precision reading module, a door mechanism module and a grain calculation module. Wherein the skimming module and the perusal module are included in a multi-hop mechanism. The skimming module is used for judging keywords of the problem and the related context key part thereof, the context keywords and the related problem word key part thereof, and aligning the problem and the context information through the fine reading module to establish a complete association relationship; screening and memorizing key information in the information through a gate mechanism and updating the key information; the particle computation module is parallel to the structure, so that the model can understand the text from context granularity and sequence granularity in a multi-angle manner.
In specific implementation, the loss function of the downstream task processing model of the question-answering task in the training process is as follows:
in the formula (I), the compound is shown in the specification,andrespectively representing the start position and the end position of the true answer of the ith sample, N being the total number of samples,the representation represents the predicted probability of the starting position of the real answer at the time of model inference,representing the predicted probability of the end position of the true answer.
The effect of the technical scheme disclosed by the invention can be verified through the following experiments:
the pre-training language model RoBERTA is used as the Encoder of the model and is used as a baseline model, and the pre-training models BERT and ALBERT with the same super parameters and ALBERT-Large with larger super parameters are used for carrying out comparison experiments. The experiment was carried out using DuReader2.0 under Tensorflow-1.12.0, SQuADv1.1 under Pytrch 1.0.1 and NVIDIA GTX 1080Ti using the hyper-parameters shown in Table 1.
TABLE 1 Superparameter of this experiment
Hyper Parameters | Values |
|
4 |
|
3 |
max query length(DuReader 2.0) | 16 |
max query length(SQuAD v1.1) | 24 |
max sequence length | 512 |
|
3×10-5 |
doc stride | 384 |
warmup rate | 0.1 |
|
3 |
The fuzzy matching degree (F1) and the accurate matching degree (EM) are used as evaluation indexes in the experiment. The EM measures whether the model predicted answer is a perfect match with the true answer. F1 measures the degree of lexical level matching between the model predicted answer and the real answer, which is calculated from the Precision (Precision) and Recall (Recall) of the lexical level.
Table 2 compares the results of the evaluation of multiple pre-trained models under the DuReader2.0 and SQuAD v1.1 development sets. The experiment has further improved F1(+ 0.94%; 0.526%) and EM (+ 0.918%; 0.464%) on the basis of the baseline model.
TABLE 2 model results for DuReader2.0 and SQuAD 1.1
Table 3 is a comparison of model parameters, where EM boosting is significant, indicating that the model can deepen understanding of the text and help predict more accurate answers.
TABLE 3 comparison of the parameters of the models
Model | Params(M) |
BERT | 110 |
RoBERTa | 110 |
S&IReader | 119 |
In the DuReader2.0 experiment, 10890 steps were trained, and the checkpoint of the model was saved and the performance recorded every 2000 steps in the experiment. The changes in F1 and EM with training steps for this experiment and baseline model are shown in fig. 4 and 5. The experimental performance is slightly lower than that of a baseline model due to the increase of the number of parameters in the initial training period, and the performance is basically superior to that of the baseline model after full training.
Experiments prove that the text semantics can be deeply understood, as shown in table 4, a sample in a DuReader2.0 development set is used for example, a baseline model has wrong understanding on a question and a context text in the sample, and a predicted answer cannot be accurately positioned.
TABLE 4 an example of a comparable machine reading understanding
Meanwhile, the over-stability problem existing before can be solved to a certain extent by the method, as shown in table 5, a sample in a DuReader2.0 development set is selected for example, and the phenomenon that the baseline model is matched only through the word is found, namely, the part marked in table 5, so that the baseline model can be seen to be matched only according to the word at the position where the problem text appears in the context to obtain an incorrect answer, and the method can be matched according to the semantics of the problem and the context to find a correct answer.
TABLE 5 an example of development focus over stabilization
In order to analyze the influence of the skimming module, the fine-reading module, the door mechanism, the multi-granularity module and the multi-hop frequency on the model performance, the ablation experiment is carried out under Dureader 2.0. Table 6 shows the performance of the model under different ablation experiments.
TABLE 6 ablation results
The experiment shows that compared with the 7 th mode in table 6, the first mode shows that the bidirectional stacking attention mechanism helps the model to pay attention to the key contents, and the performance of the model can be improved to a certain extent. The second way shows that further establishing a more complete association of the question and the context helps to improve the performance. The third mode shows that the door mechanism helps the model to filter unimportant information so as to obviously improve the performance of the model. The fourth mode shows that the model processes texts under various granularities, so that the model can understand text information at multiple levels, and the performance of the model is further improved.
Meanwhile, the 5 th to 9 th ways show that increasing the number of multi-hop times properly helps the model to understand the text semantics more deeply so as to solve the problems of insufficient learning and over-stability to improve the accuracy of model predicted answers. However, the increase of the number of multi-hop also causes the increase of the parameter number and the calculation amount of the model, and the performance and the efficiency of the model are influenced. The experiments described above show that the best performance is achieved in this experiment under Multihop-3.
In order to further verify and explain the effectiveness of the skimming module in the invention, a sample in a DuReader2.0 development set is selected, and when the sample data enters the skimming module of the model, the problem key words and the relevant context key parts of the sample, the context key words and the relevant problem word key parts of the context key words are judged.
In the corresponding heat map, as shown in fig. 6 and 7, the horizontal axis and the vertical axis represent the question and the context text, respectively, and it can be seen from fig. 6 and 7 that the keyword- "price" of the question is accurately judged. Also as can be seen in FIG. 6, the skimming module section identifies key sections such as "income", "market", "region", and "company profit" that are highly associated with the question keyword "price" in the context semantics. Thus, as further verified and illustrated by the above-described sample and heat maps, the skimming module in the present invention has the ability to semantically identify key words and corresponding associated key components.
The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several changes and modifications can be made without departing from the technical solution, and the technical solution of the changes and modifications should be considered as falling within the scope of the claims of the present application.
Claims (8)
1. A method for processing a downstream task of a question-answering task is characterized by comprising the following steps:
s1, inputting the question and the context into a pre-training language module to obtain the language association characteristics of the context;
s2 obtaining context representation H of key information perception based on language association characteristics of context by utilizing bidirectional attention mechanismCKeyAnd problem representation of key information perception HQKey;
S3 context representation H based on key information perception using bidirectional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
s4, calculating an update vector z and a memory weight G based on the problem-aware context representation G by using a gate mechanism, and updating the problem-aware context representation G by using the update vector z and the memory weight G to obtain an output vector Gg;
S5, generating context granularity vector G based on language correlation characteristics of context by using granularity calculationCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of contextoutAfter processing through the linear layer, calculating the probability of each word as the starting and stopping position of the answer in the context by using softmax, and extracting the probabilityThe largest continuous subsequence serves as the answer.
2. The method of claim 1, wherein the context has a language-dependent characteristic of H, H ═ H1,h2,h3,...,hs},h1To hsRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:
s201, intercepting a question intercepting part H based on the context and the position of the question in the HQAnd a context intercept part HC,HQ={h2,h3,h4,...,hn+1},HC={hn+3,hn+4,...,hn+m+2N represents the length of the question word, and m represents the length of the context word;
s202, constructing a similarity matrix S,
S=WS(HC,HQ,HC·HQ)
in the formula, WSIs a trainable matrix;
s203, performing softmax operation on each row and each column of the similarity matrix S to obtain S1And S2,S1Representing for each context word the relevance of all question words to it; s2Representing for each question word the relevance of all context words to it; s1=softmax→(S),S2=softmax↓(S);
S204, highlighting the weight of the question key words and the context key words;
s205, generating a context representation H of key information perception based on the following formulaCKeyAnd problem representation of key information perception HQKey:
HCKey=HC+HC⊙AC
HQKey=HQ+HQ⊙AQ
In the formula, ACShow and questionThe context key part attention associated with the topic keyword, AQRepresenting a question key part attention associated with the context keyword;
AC=S2·SQkey
AQ=S1·SCkey
in the formula, SQkeyQuestion weight, S, representing a salient keywordCkeyA contextual weight representing a salient keyword;
SQkey=mean↓(S1)
SCkey=mean→(S2)。
3. the method for processing the task downstream of the question-answering task according to claim 2, wherein the step S3 includes:
S′=WS′(HCKey,HQKey,HCKey·HQKey)
in the formula (I), the compound is shown in the specification,indicating for each context word, the relevance of all question words to it,representing the correlation between all the context words and each question word, S' representing the correlation between the question words and the context words after obtaining the key information, and WS′Expressed as a trainable matrix;
s302, calculating a context expression A based on the question words and a question word expression B based on the context words;
s303, adding HCKeySplicing the A and the B to obtain a context expression G of problem perception;
G=W3([HcKey;HcKey⊙A;HcKey⊙B])+b3
W3and b3Trainable matrices and biases, respectively.
4. The method for processing the downstream task of the question-answering task according to claim 3, wherein the step S3 further includes:
s304, representing the problem of key information perception HQKeyAnd question-aware context representation G as H in step S2QAnd HCAnd repeating the steps S2 and S3 until a final problem-aware context representation G is obtained after cycling for a preset number of times.
6. The method for processing the downstream task of the question-answering task according to claim 5, wherein the step S5 includes:
s501, removing HCOf (1) [ PAD ]]After filling part, averaging to obtain context granularity vector GC;
S502, extracting [ CLS ] in H]Identifier as a sequence granularity vector GCLS;
S503, calculating language association characteristics based on the context based on the following formula to generate an output vector C of multi-angle understanding context and the relation between the context and local partout:
Cout=W4·(Cg+GC+GCLS)+b4
In the formula, W4And b4Trainable matrices and offsets, respectively;
wherein the length of question and context needs to be n and m respectively, if the length is insufficient, using [ PAD [ ]]Filling, and if the length exceeds the length, performing truncation; when encoded by BERT, the sequences need to be spliced in such a way that: [ CLS]+HQ+[SEP]+HC+[SEP]In the form of (a); a sequence consisting of a question and a context, and [ CLS]And two [ SEP ]]Composition, problem length is n, context length is m, so sequence length is n + m + 3.
7. A downstream task processing model of a question-answering task, characterized in that a downstream task processing method for implementing a question-answering task according to any one of claims 1 to 6, comprises:
a pre-training language module for generating language-associated features of a context based on a question and the context;
a skimming module for deriving a key information-aware context representation H using a context-based language-dependent feature of a bidirectional attention mechanismCKeyAnd problem representation of key information perception HQKey;
A perusal module for context representation H based on key information perception using bi-directional attention flowCKeyAnd problem representation of key information perception HQKeyObtaining a problem-aware context representation G;
the door mechanism module is used for calculating an update vector z and a memory weight G by utilizing the door mechanism based on the context representation G of the problem perception, and obtaining an output vector G by utilizing the update vector z and the memory weight G to update the context representation G of the problem perceptiong;
A particle computation module for generating a context granularity vector G using particle computation context-based language-dependent featuresCAnd sequence granularity vector GCLSBased on context granularity vector GCSequence size vector GCLSAnd output vector GgGenerating an output vector C of a multi-angle understanding context and a context global and local relationout;
An answer prediction module for generating an output vector C of the multi-angle understanding context and the relation between the context and the local part based on the language association characteristics of the contextoutAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.
8. The question-answering task downstream task processing model according to claim 7, wherein the loss function of the question-answering task downstream task processing model in the training process is as follows:
in the formula (I), the compound is shown in the specification,andrespectively representing the start position and the end position of the true answer of the ith sample, N being the total number of samples,representing the predicted probability of the start position of the true answer at model inference,representing the predicted probability of the end position of the true answer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011539404.XA CN112732879B (en) | 2020-12-23 | 2020-12-23 | Downstream task processing method and model of question-answering task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011539404.XA CN112732879B (en) | 2020-12-23 | 2020-12-23 | Downstream task processing method and model of question-answering task |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112732879A CN112732879A (en) | 2021-04-30 |
CN112732879B true CN112732879B (en) | 2022-05-10 |
Family
ID=75604645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011539404.XA Active CN112732879B (en) | 2020-12-23 | 2020-12-23 | Downstream task processing method and model of question-answering task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112732879B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115080715B (en) * | 2022-05-30 | 2023-05-30 | 重庆理工大学 | Span extraction reading understanding method based on residual structure and bidirectional fusion attention |
CN114780707B (en) * | 2022-06-21 | 2022-11-22 | 浙江浙里信征信有限公司 | Multi-hop question answering method based on multi-hop reasoning joint optimization |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134771A (en) * | 2019-04-09 | 2019-08-16 | 广东工业大学 | A kind of implementation method based on more attention mechanism converged network question answering systems |
CN111611361A (en) * | 2020-04-01 | 2020-09-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Intelligent reading, understanding, question answering system of extraction type machine |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126596B (en) * | 2016-06-20 | 2019-08-23 | 中国科学院自动化研究所 | A kind of answering method based on stratification memory network |
EP3385862A1 (en) * | 2017-04-03 | 2018-10-10 | Siemens Aktiengesellschaft | A method and apparatus for performing hierarchical entity classification |
CN109947912B (en) * | 2019-01-25 | 2020-06-23 | 四川大学 | Model method based on intra-paragraph reasoning and joint question answer matching |
CN110442675A (en) * | 2019-06-27 | 2019-11-12 | 平安科技(深圳)有限公司 | Question and answer matching treatment, model training method, device, equipment and storage medium |
CN110717431B (en) * | 2019-09-27 | 2023-03-24 | 华侨大学 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
CN110929515B (en) * | 2019-11-21 | 2023-04-18 | 中国民航大学 | Reading understanding method and system based on cooperative attention and adaptive adjustment |
CN111814982B (en) * | 2020-07-15 | 2021-03-16 | 四川大学 | Multi-hop question-answer oriented dynamic reasoning network system and method |
CN112100348A (en) * | 2020-09-01 | 2020-12-18 | 武汉纺织大学 | Knowledge base question-answer relation detection method and system of multi-granularity attention mechanism |
-
2020
- 2020-12-23 CN CN202011539404.XA patent/CN112732879B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134771A (en) * | 2019-04-09 | 2019-08-16 | 广东工业大学 | A kind of implementation method based on more attention mechanism converged network question answering systems |
CN111611361A (en) * | 2020-04-01 | 2020-09-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Intelligent reading, understanding, question answering system of extraction type machine |
Also Published As
Publication number | Publication date |
---|---|
CN112732879A (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210390271A1 (en) | Neural machine translation systems | |
CN110390397B (en) | Text inclusion recognition method and device | |
CN106502985A (en) | A kind of neural network modeling approach and device for generating title | |
US20210125516A1 (en) | Answer training device, answer training method, answer generation device, answer generation method, and program | |
Nagaraj et al. | Kannada to English Machine Translation Using Deep Neural Network. | |
CN108845990A (en) | Answer selection method, device and electronic equipment based on two-way attention mechanism | |
CN111524593B (en) | Medical question-answering method and system based on context language model and knowledge embedding | |
CN109858046B (en) | Learning long-term dependencies in neural networks using assistance loss | |
CN112732879B (en) | Downstream task processing method and model of question-answering task | |
CN114297399B (en) | Knowledge graph generation method, system, storage medium and electronic equipment | |
CN111079018A (en) | Exercise personalized recommendation method, exercise personalized recommendation device, exercise personalized recommendation equipment and computer readable storage medium | |
CN114218379A (en) | Intelligent question-answering system-oriented method for attributing questions which cannot be answered | |
CN110852071A (en) | Knowledge point detection method, device, equipment and readable storage medium | |
Kumari et al. | Context-based question answering system with suggested questions | |
CN117057414B (en) | Text generation-oriented multi-step collaborative prompt learning black box knowledge distillation method and system | |
CN118467706A (en) | Retrieval enhancement question-answering method and system combined with historical data | |
Arifin et al. | Automatic essay scoring for Indonesian short answers using siamese Manhattan long short-term memory | |
CN117235347A (en) | Teenager algorithm code aided learning system and method based on large language model | |
KR20240128104A (en) | Generating output sequences with inline evidence using language model neural networks | |
CN112580365B (en) | Chapter analysis method, electronic equipment and storage device | |
CN115357712A (en) | Aspect level emotion analysis method and device, electronic equipment and storage medium | |
CN114139535A (en) | Keyword sentence making method and device, computer equipment and readable medium | |
CN114358579A (en) | Evaluation method, evaluation device, electronic device, and computer-readable storage medium | |
Ratna et al. | Hybrid deep learning cnn-bidirectional lstm and manhattan distance for japanese automated short answer grading: Use case in japanese language studies | |
Liu et al. | Investigating the Robustness of Natural Language Generation from Logical Forms via Counterfactual Samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230705 Address after: No. 1811, 18th Floor, Building 19, Section 1201, Lushan Avenue, Wan'an Street, Tianfu New District, Chengdu, Sichuan, China (Sichuan) Pilot Free Trade Zone, 610213, China Patentee after: Sichuan Jiulai Technology Co.,Ltd. Address before: No. 69 lijiatuo Chongqing District of Banan City Road 400054 red Patentee before: Chongqing University of Technology |