CN112732879B

CN112732879B - Downstream task processing method and model of question-answering task

Info

Publication number: CN112732879B
Application number: CN202011539404.XA
Authority: CN
Inventors: 王勇; 雷冲; 陈秋怡
Original assignee: Chongqing University of Technology
Current assignee: Sichuan Jiulai Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-05-10
Anticipated expiration: 2040-12-23
Also published as: CN112732879A

Abstract

The invention discloses a method and a model for processing a downstream task of a question-answering task, which are used for obtaining a context expression H of key information perception_CKeyAnd problem representation of key information perception H_QKeyGenerating a problem-aware context representation G; calculating an update vector z and a memory weight G based on G, updating G to obtain an output vector G_g(ii) a Generating a context granularity vector G_CAnd sequence granularity vector G_CLSGenerating an output vector C_outUsing softmax to calculate the probability of each word in the context as the start-stop position of the answer, and extracting the continuous subsequence with the highest probability as the answer. The invention provides a bidirectional cascade attention mechanism, constructs a mechanism taking accurate reading and skimming as a whole and a multi-granularity module based on a grain calculation idea, so that the model effectively pays attention to and screens effective information, better understands texts under various granularities, gives more accurate answers, and makes new progress in performance on the basis of a baseline model.

Description

Downstream task processing method and model of question-answering task

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method and a model for processing a downstream task of a question and answer task.

Background

Machine-reading understanding is a very challenging task in natural language processing, aiming to determine the correct answer to a question according to a given context. Common machine reading understanding tasks are divided into full-form fill-in, multiple choices, segment extraction and free answer according to answer forms. The newly developed pre-training language model achieves a series of successes in various natural language understanding tasks by virtue of the powerful text representation capability. The pre-trained language models are used as encoders of deep learning language models, and are used for extracting language association characteristics of relevant texts and carrying out fine adjustment in combination with a downstream data processing structure specific to a specific task. With the great success of the development of pre-trained language models, people focus more attention on the encoder end of the deep learning language model, leading to the development of downstream processing technology customized for specific tasks entering the bottleneck. Although one can directly benefit from a variety of powerful coders with similar structures, applying the general knowledge coding implied in large-scale corpora to language models with very large-scale parameters is a time and resource consuming matter. And the current language representation coding technology is slowly developed, so that the further improvement of the performance of the pre-training language model is limited. These have all highlighted the importance of developing downstream processing techniques under specific tasks.

In summary, the existing deep learning language model has the following disadvantages: (1) the unimportant part in the text is emphasized, and the important part is ignored; (2) there is a phenomenon of being unstable, that is, easily affected by an interfering sentence in a text having a plurality of words identical to the question, and only matching by the character itself is possible, and semantic matching is not possible.

Therefore, the preference of how to focus the model on the key information in the text and help the model jump out of the local information too much focused on the text becomes an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

Aiming at the defects in the prior art, the problems to be solved by the invention are as follows: the model is made to focus on key information in the text and a preference to help the model jump out of local information that is too focused on the text.

In order to solve the technical problems, the invention adopts the following technical scheme:

a downstream task processing method of a question-answering task comprises the following steps:

s1, inputting the question and the context into a pre-training language module to obtain the language association characteristics of the context;

s2 obtaining context representation H of key information perception based on language association characteristics of context by utilizing bidirectional attention mechanism_CKeyAnd problem representation of key information perception H_QKey；

S3 context representation H based on key information perception using bidirectional attention flow_CKeyAnd problem representation of key information perception H_QKeyObtaining a problem-aware context representation G;

s4, calculating an update vector z and a memory weight G based on the problem-aware context representation G by using a gate mechanism, and updating the problem-aware context representation G by using the update vector z and the memory weight G to obtain an output vector G_g；

S5, calculating context-based by using particlesLanguage-dependent feature generation context granularity vector G_CAnd sequence granularity vector G_CLSBased on context granularity vector G_CSequence size vector G_CLSAnd output vector G_gGenerating an output vector C of a multi-angle understanding context and a context global and local relation_out；

S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of context_outAfter the linear layer processing, the probability of each word in the computation context as the starting and ending position of the answer is calculated by using softmax, and the continuous subsequence with the maximum probability is extracted as the answer.

Preferably, the language-dependent feature of the context is H, H ═ H₁,h₂,h₃,...,h_s}，h₁To h_sRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:

s201, intercepting a question intercepting part H based on the context and the position of the question in the H_QAnd a context intercept part H_C，H_Q＝{h₂,h₃,h₄,...,h_n+1}，H_C＝{h_n+3,h_n+4,...,h_n+m+2N represents the length of the question word, and m represents the length of the context word;

s202, constructing a similarity matrix S,

S＝W_S(H_C,H_Q,H_C·H_Q)

in the formula, W_SIs a trainable matrix;

s203, performing softmax operation on each row and each column of the similarity matrix S to obtain S₁And S₂，S₁Representing for each context word the relevance of all question words to it; s₂Representing for each question word the relevance of all context words to it; s₁＝softmax_→(S)，S₂＝softmax_↓(S)；

S204, highlighting the weight of the question key words and the context key words;

s205, generating a context representation H of key information perception based on the following formula_CKeyAnd problem representation of key information perception H_QKey：

H_CKey＝H_C+H_C⊙A_C

H_QKey＝H_Q+H_Q⊙A_Q

In the formula, A_CRepresenting the context-critical section attention associated with the question keyword, A_QRepresenting a question key part attention associated with the context keyword;

A_C＝S₂·S_Qkey

A_Q＝S₁·S_Ckey

in the formula, S_QkeyQuestion weight, S, representing a salient keyword_CkeyA contextual weight representing a salient keyword;

S_Qkey＝mean_↓(S₁)

S_Ckey＝mean_→(S₂)。

preferably, step S3 includes:

s301, obtaining the following formula

And

S′＝W_S′(H_CKey,H_QKey,H_CKey·H_QKey)

in the formula (I), the compound is shown in the specification,

indicating for each context word, the relevance of all question words to it,

representing the correlation between all the context words and each question word, S' representing the correlation between the question words and the context words after obtaining the key information, and recalculating the correlation between the question words and the context words (same operation as S202), W_S′Represented as a trainable matrix;

s302, calculating a context expression A based on the question words (which refer to the expressions of all the question words, wherein the weights of all the question words after averaging are also the weights of the key words in the question words, but the weights of the key words in the question words can be highlighted in an averaging mode) and a question word expression B based on the context words;

s303, adding H_CKeySplicing the A and the B to obtain a context expression G of problem perception;

G＝W₃([H_cKey；H_cKey⊙A；H_cKey⊙B])+b₃

W₃and b₃Trainable matrices and biases, respectively.

Preferably, step S3 further includes:

s304, representing the problem of key information perception H_QKeyAnd question-aware context representation G as H in step S2_QAnd H_CAnd repeating the steps S2 and S3 until a final problem-aware context representation G is obtained after cycling for a preset number of times.

Preferably, in step S4:

z＝tanh(W_z·G+b_z)

g＝sigmoid(W_g[G；A]+b_g)

in the formula, W_z,W_g,b_zAnd b_gIs a trainable matrix and bias.

Preferably, step S5 includes:

s501, removing H_COf (1) [ PAD ]]After filling part, averaging to obtain context granularity vector G_C；

S502, extracting [ CLS ] in H]Identifier as a sequence granularity vector G_CLS；

S503, calculating language association characteristics based on the context based on the following formula to generate an output vector C of the multi-angle understanding context and the relation between the context and the local part_out：

C_out＝W₄·(C_g+G_C+G_CLS)+b₄

In the formula, W₄And b₄Trainable matrices and biases, respectively.

A question-answering task downstream task processing model is used for realizing the question-answering task downstream task processing method, and comprises the following steps:

a pre-training language module for generating language-associated features of a context based on a question and the context;

a skimming module for deriving a key information-aware context representation H using a context-based language-dependent feature of a bidirectional attention mechanism_CKeyAnd problem representation of key information perception H_QKey；

A perusal module for context representation H based on key information perception using bi-directional attention flow_CKeyAnd problem representation of key information perception H_QKeyObtaining a problem-aware context representation G;

the door mechanism module is used for calculating an update vector z and a memory weight G by utilizing the door mechanism based on the context representation G of the problem perception, and obtaining an output vector G by utilizing the update vector z and the memory weight G to update the context representation G of the problem perception_g；

A particle computation module for generating a context granularity vector G using particle computation context-based language-dependent features_CAnd sequence granularity vector G_CLSBased on context granularity vector G_CSequence size vector G_CLSAnd output vector G_gGenerating an output vector C of a multi-angle understanding context and a context global and local relation_out；

An answer prediction module for generating an output vector C of the multi-angle understanding context and the relation between the context and the local part based on the language association characteristics of the context_outAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.

Preferably, the loss function of the downstream task processing model of the question-answering task in the training process is as follows:

in the formula (I), the compound is shown in the specification,

and

respectively representing the start position and the end position of the true answer of the ith sample, N being the total number of samples,

the representation represents the predicted probability of the starting position of the real answer at the time of model inference,

to representThe predicted probability of the end position of the true answer.

In summary, compared with the prior art, the invention discloses a method for processing the downstream task of the question-answering task, which adds a downstream processing structure on the basis of a pre-training model, comprises a skimming module, a precision reading module and a door mechanism module, and can simulate the behavior of reading and comprehensively screening information for many times when a human reads and understands a task. The skimming module is used for helping the model to determine the context representation of key information perception and the problem representation of key information perception; the precision reading module sends the vector output from the encoder end to a bidirectional attention flow layer to establish a complete incidence relation between the problem and the context; meanwhile, according to the thought of particle calculation, a multi-particle module for calculating context particle size and sequence particle size is added into the model, and a parallel structure is formed by the multi-particle module and the word particle size obtained by the multi-particle module, so that the model can simulate the behavior of human beings from words to sentences and from local parts to whole comprehension of texts.

Drawings

FIG. 1 is a flow chart of a method for processing a downstream task of a question-answering task according to the present invention;

FIG. 2 is a block diagram of a downstream task processing model for a question-answering task as disclosed herein;

FIG. 3 is a skimming module (bidirectional stacked attention mechanism);

FIG. 4 is a comparison of RoBERTA and F1 of the present invention in broken lines;

FIG. 5 is a graph of a polyline comparison of RoBERTA and EM of the present invention;

FIG. 6 is a diagram of a key portion of a question key and its associated context;

FIG. 7 is a diagram of a key portion of a context keyword and associated question words.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention discloses a method for processing a downstream task of a question and answer task, which comprises the following steps:

the specific operation is to send the sequence composed of the problem and the context splicing into a pre-training language module, namely an encoder for encoding.

S5, generating context granularity vector G based on language correlation characteristics of context by using granularity calculation_CAnd sequence granularity vector G_CLSBased on context granularity vector G_CSequence size vector G_CLSAnd output vector G_gGenerating an output vector C of a multi-angle understanding context and a context global and local relation_out；

S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of context_outAfter linear layer processing, the probability of each word in the context as the start-stop position of the answer is calculated by using softmax, and the continuous subsequence with the highest probability is extracted as the answer.

Aiming at the problem that the unimportant part in the text is regarded as important and the important part is ignored in the prior art, the invention provides a bidirectional cascading attention mechanism, so that a model can sense the keywords of the problem and the related context key part thereof, and the context keywords and the related problem key part thereof. And the door mechanism module is used for the model to automatically screen the parts related to the problems in the context and forget the parts unrelated to the problems. Through the two mechanisms described above, the model is forced to focus on key information in the text.

Aiming at the problem of over-stability in the prior art, the particle calculation module provided by the invention is used for increasing context granularity and sequence granularity in the model, so that the model can understand texts from local and global angles and is helped to jump out the preference of over-paying attention to local information of the texts.

The skimming module is inspired by the idea of a stacked attention mechanism, which is proposed by the present invention, as shown in fig. 3. After the similarity matrix is obtained through the context and the question representation, softmax is respectively solved horizontally and vertically for the obtained matrix, and the highlighting keyword is respectively solved, and the attention of the context key part associated with the question keyword and the attention of the question word key part associated with the context keyword are respectively calculated. The context expression and the problem word expression are respectively multiplied by the corresponding attention moment array and then added with the original expression to obtain the context expression H of key information perception_CKeyAnd problem representation of key information perception H_QKey。

In specific implementation, the language-dependent characteristic of the context is H, H ═ H₁,h₂,h₃,...,h_s}，h₁To h_sRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:

s201, intercepting a question intercepting part H based on the context and the position of the question in the H_QAnd a context intercept part H_C(the sequence is composed of problem and context concatenation, so after encoding, the encoded parts of the problem and context need to be intercepted from the corresponding positions in the sequence for subsequent operations), H_Q＝{h₂,h₃,h₄,...,h_n+1}，H_C＝{h_n+3,h_n+4,...,h_n+m+2N represents the length of the question word, and m represents the length of the context word;

all questions and context words become n and m in length. If the length is not sufficient, [ PAD ]]And filling, and if the length exceeds the length, performing truncation. When coded by BERT, the sequence needs to be as followsSplicing is performed in such a way that: [ CLS]+H_Q+[SEP]+H_C+[SEP]Of the form (1) that only requires H for subsequent operations_QAnd H_CSo will [ CLS](i.e. h)₁) And two [ SEP ]](i.e. h)_n+2,h_n+m+3) And (4) discarding. A sequence consisting of a question and a context, and [ CLS]And two [ SEP ]]Composition, problem length is n, context length is m, so sequence length is n + m + 3.

S202, constructing a similarity matrix S,

S＝W_S(H_C,H_Q,H_C·H_Q)

in the formula, W_SIs a trainable matrix;

in a simplified form, the specific process is S ═ W_a*H_C+W_b*H_Q+W_c*H_C*H_Q+ bias, where denotes matrix multiplication, bias. By W_sTo represent W herein_a、W_bAnd W_cAnd finally the shape of the S matrix is [ b, m, n ]]And b represents the size of each batch.

S203, performing softmax operation on each row and each column of the similarity matrix S to obtain S₁And S₂，S₁Representing for each context word the relevance of all question words to it; s₂Representing for each question word the relevance of all context words to it; s₁＝softmax_→(S)，S₂＝softmax↓(S)；

S204, highlighting the weight of the problem keyword and the context keyword;

aiming at each context word, the relevance of all questions to the context word, so the weight of the keywords in the question words can be highlighted by longitudinally averaging, namely the more key a question word is, the greater the relevance to each context word is, the greater the average value is, and the weight S of the question keywords is highlighted by the weight_Qkey. By averaging the horizontal direction in the same way, the keyword weight S in the context can be highlighted_Ckey。

S205, generating a context representation H of key information perception based on the following formula_CKeyAnd key informationPerceptual problem representation H_QKey：

H_CKey＝H_C+H_C⊙A_C

H_QKey＝H_Q+H_Q⊙A_Q

A_C＝S₂·S_Qkey

A_Q＝S₁·S_Ckey

S_Qkey＝mean_↓(S₁)

S_Ckey＝mean_→(S₂)。

in specific implementation, step S3 includes:

s301, obtaining the following formula

And

S′＝W_S′(H_CKey,H_QKey,H_CKey·H_QKey)

in the formula (I), the compound is shown in the specification,

for each upper partThe following words, the relevance of all the question words to them,

s302, calculating a context expression A based on the question words and a question word expression B based on the context words;

G＝W₃([H_cKey；H_cKey⊙A；H_cKey⊙B])+b₃

W₃and b₃Trainable matrices and biases, respectively.

When a human being finishes a reading and understanding task, the human being often reads for many times to deepen the understanding of the text. The model simulates the behavior in the process of passing through the skimming module and the perusal module again for multiple times, grasps key information in the problems and paragraphs by skimming, further grasps the main idea of the text by perusal and screens important information in accordance with the problems. And repeatedly reading, and continuously adjusting the judged key information to obtain more comprehensive context expression, and finally determining the answer of the question. The invention simulates the behavior of reading the text repeatedly by human by using a multi-hop loop mechanism, and helps the model deepen the understanding of the text. The experimental data below also demonstrate that the multi-hop loop mechanism helps to improve the model performance.

In specific implementation, step S3 further includes:

In the reading and understanding process, various structures adopt a door-like mechanism to simulate the human beings to screen and memorize important contents after reading for many times, and neglect the behaviors of unimportant contents, such as LSTM, GRU and training Reader. And judging the part needing to be memorized or forgotten by the model, generating an updating vector and updating the memory result of the model. In the invention, a problem perception context expression G and a key information perception problem expression H are expressed_QKeyAnd sending the data to a door mechanism to enable the model to judge the part needing to be memorized or forgotten, generating an updating vector z by using G, and updating the memorized result of the model. Merging G and A into a linear layer with sigmoid, that is, when part of G is more relevant to the problem content, the memory weight G approaches 1, and more relevant information is retained

In the specific implementation, in step S4:

z＝tanh(W_z·G+b_z)

g＝sigmoid(W_g[G；A]+b_g)

in the formula, W_z,W_g,b_zAnd b_gIs a trainable matrix and bias.

The invention simultaneously adopts a bidirectional stacked attention mechanism and a door mechanism to force the model to pay more attention to the key information in the text. The hyper-parameters in the experiment can be adjusted automatically according to the current equipment performance. The performance of the model is also different under different hyper-parameter settings. The experimental data of the invention are model performance results established under the hyper-parameter settings of the patent. Under the same super-parameter setting, the performance of the model of the invention is superior to that of other models in a comparison test.

Particle computation is an effective solution to the structuring problem. One recognized feature of artificial intelligence is that people can observe and analyze the same problem at very different granularities, and that people can not only solve problems in different granularity worlds, but also jump from one granularity world to another granularity world quickly, and this ability to deal with different granularity worlds is a powerful manifestation of human problem solving. The particle calculation model divides a research object into a plurality of layers with different particle sizes, each layer is mutually associated to form a unified whole, the different particle sizes represent different angles and ranges of information, and the particle calculation idea is helpful for solving problems of the model under various particle sizes and helps the model to understand the relation between the local part and the whole of the text. The invention provides a method for understanding texts and the relation between the text totality and the text part in terms of word granularity, context granularity and sequence granularity.

In specific implementation, step S5 includes:

C_out＝W₄·(C_g+G_C+G_CLS)+b₄

In the formula, W₄And b₄Trainable matrices and biases, respectively.

In addition, the invention also discloses a question-answering task downstream task processing model, which is used for realizing the question-answering task downstream task processing method and comprises the following steps:

a skimming module for utilizingBidirectional attention mechanism deriving key information-aware context representation based on context-dependent language features_CKeyAnd problem representation of key information perception H_QKey；

The invention serves a fragment extraction type reading and understanding task, and the main model architecture is shown in figure 2. The downstream structure mainly comprises the following four parts: the device comprises a skimming module, a precision reading module, a door mechanism module and a grain calculation module. Wherein the skimming module and the perusal module are included in a multi-hop mechanism. The skimming module is used for judging keywords of the problem and the related context key part thereof, the context keywords and the related problem word key part thereof, and aligning the problem and the context information through the fine reading module to establish a complete association relationship; screening and memorizing key information in the information through a gate mechanism and updating the key information; the particle computation module is parallel to the structure, so that the model can understand the text from context granularity and sequence granularity in a multi-angle manner.

In specific implementation, the loss function of the downstream task processing model of the question-answering task in the training process is as follows:

in the formula (I), the compound is shown in the specification,

and

representing the predicted probability of the end position of the true answer.

The effect of the technical scheme disclosed by the invention can be verified through the following experiments:

the pre-training language model RoBERTA is used as the Encoder of the model and is used as a baseline model, and the pre-training models BERT and ALBERT with the same super parameters and ALBERT-Large with larger super parameters are used for carrying out comparison experiments. The experiment was carried out using DuReader2.0 under Tensorflow-1.12.0, SQuADv1.1 under Pytrch 1.0.1 and NVIDIA GTX 1080Ti using the hyper-parameters shown in Table 1.

TABLE 1 Superparameter of this experiment

Hyper Parameters	Values
		batch size	4
epoch	3
		max query length(DuReader 2.0)	16
max query length(SQuAD v1.1)	24
		max sequence length	512
learning rate	3×10^-5
		doc stride	384
warmup rate	0.1
		multi-hop	3

The fuzzy matching degree (F1) and the accurate matching degree (EM) are used as evaluation indexes in the experiment. The EM measures whether the model predicted answer is a perfect match with the true answer. F1 measures the degree of lexical level matching between the model predicted answer and the real answer, which is calculated from the Precision (Precision) and Recall (Recall) of the lexical level.

Table 2 compares the results of the evaluation of multiple pre-trained models under the DuReader2.0 and SQuAD v1.1 development sets. The experiment has further improved F1(+ 0.94%; 0.526%) and EM (+ 0.918%; 0.464%) on the basis of the baseline model.

TABLE 2 model results for DuReader2.0 and SQuAD 1.1

Table 3 is a comparison of model parameters, where EM boosting is significant, indicating that the model can deepen understanding of the text and help predict more accurate answers.

TABLE 3 comparison of the parameters of the models

Model	Params(M)
		BERT	110
RoBERTa	110
		S&IReader	119

In the DuReader2.0 experiment, 10890 steps were trained, and the checkpoint of the model was saved and the performance recorded every 2000 steps in the experiment. The changes in F1 and EM with training steps for this experiment and baseline model are shown in fig. 4 and 5. The experimental performance is slightly lower than that of a baseline model due to the increase of the number of parameters in the initial training period, and the performance is basically superior to that of the baseline model after full training.

Experiments prove that the text semantics can be deeply understood, as shown in table 4, a sample in a DuReader2.0 development set is used for example, a baseline model has wrong understanding on a question and a context text in the sample, and a predicted answer cannot be accurately positioned.

TABLE 4 an example of a comparable machine reading understanding

Meanwhile, the over-stability problem existing before can be solved to a certain extent by the method, as shown in table 5, a sample in a DuReader2.0 development set is selected for example, and the phenomenon that the baseline model is matched only through the word is found, namely, the part marked in table 5, so that the baseline model can be seen to be matched only according to the word at the position where the problem text appears in the context to obtain an incorrect answer, and the method can be matched according to the semantics of the problem and the context to find a correct answer.

TABLE 5 an example of development focus over stabilization

In order to analyze the influence of the skimming module, the fine-reading module, the door mechanism, the multi-granularity module and the multi-hop frequency on the model performance, the ablation experiment is carried out under Dureader 2.0. Table 6 shows the performance of the model under different ablation experiments.

TABLE 6 ablation results

The experiment shows that compared with the 7 th mode in table 6, the first mode shows that the bidirectional stacking attention mechanism helps the model to pay attention to the key contents, and the performance of the model can be improved to a certain extent. The second way shows that further establishing a more complete association of the question and the context helps to improve the performance. The third mode shows that the door mechanism helps the model to filter unimportant information so as to obviously improve the performance of the model. The fourth mode shows that the model processes texts under various granularities, so that the model can understand text information at multiple levels, and the performance of the model is further improved.

Meanwhile, the 5 th to 9 th ways show that increasing the number of multi-hop times properly helps the model to understand the text semantics more deeply so as to solve the problems of insufficient learning and over-stability to improve the accuracy of model predicted answers. However, the increase of the number of multi-hop also causes the increase of the parameter number and the calculation amount of the model, and the performance and the efficiency of the model are influenced. The experiments described above show that the best performance is achieved in this experiment under Multihop-3.

In order to further verify and explain the effectiveness of the skimming module in the invention, a sample in a DuReader2.0 development set is selected, and when the sample data enters the skimming module of the model, the problem key words and the relevant context key parts of the sample, the context key words and the relevant problem word key parts of the context key words are judged.

In the corresponding heat map, as shown in fig. 6 and 7, the horizontal axis and the vertical axis represent the question and the context text, respectively, and it can be seen from fig. 6 and 7 that the keyword- "price" of the question is accurately judged. Also as can be seen in FIG. 6, the skimming module section identifies key sections such as "income", "market", "region", and "company profit" that are highly associated with the question keyword "price" in the context semantics. Thus, as further verified and illustrated by the above-described sample and heat maps, the skimming module in the present invention has the ability to semantically identify key words and corresponding associated key components.

The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several changes and modifications can be made without departing from the technical solution, and the technical solution of the changes and modifications should be considered as falling within the scope of the claims of the present application.

Claims

1. A method for processing a downstream task of a question-answering task is characterized by comprising the following steps:

S6, generating multi-angle understanding context and output vector C of overall and local relation of context based on language association characteristics of context_outAfter processing through the linear layer, calculating the probability of each word as the starting and stopping position of the answer in the context by using softmax, and extracting the probabilityThe largest continuous subsequence serves as the answer.

2. The method of claim 1, wherein the context has a language-dependent characteristic of H, H ═ H₁,h₂,h₃,...,h_s}，h₁To h_sRepresenting the coded representation of the sequence of the concatenation of the context and the question, S representing the length of the sequence of the concatenation of the context and the question, step S2 comprising:

s202, constructing a similarity matrix S,

S＝W_S(H_C,H_Q,H_C·H_Q)

in the formula, W_SIs a trainable matrix;

H_CKey＝H_C+H_C⊙A_C

H_QKey＝H_Q+H_Q⊙A_Q

In the formula, A_CShow and questionThe context key part attention associated with the topic keyword, A_QRepresenting a question key part attention associated with the context keyword;

A_C＝S₂·S_Qkey

A_Q＝S₁·S_Ckey

S_Qkey＝mean_↓(S₁)

S_Ckey＝mean_→(S₂)。

3. the method for processing the task downstream of the question-answering task according to claim 2, wherein the step S3 includes:

s301, obtaining the compound based on the following formula

And

S′＝W_S′(H_CKey,H_QKey,H_CKey·H_QKey)

in the formula (I), the compound is shown in the specification,

indicating for each context word, the relevance of all question words to it,

representing the correlation between all the context words and each question word, S' representing the correlation between the question words and the context words after obtaining the key information, and W_S′Expressed as a trainable matrix;

G＝W₃([H_cKey；H_cKey⊙A；H_cKey⊙B])+b₃

W₃and b₃Trainable matrices and biases, respectively.

4. The method for processing the downstream task of the question-answering task according to claim 3, wherein the step S3 further includes:

5. The method for processing the downstream task of the question-answering task according to claim 4, wherein in step S4:

z＝tanh(W_z·G+b_z)

g＝sigmoid(W_g[G；A]+b_g)

in the formula, W_z,W_g,b_zAnd b_gIs a trainable matrix and bias.

6. The method for processing the downstream task of the question-answering task according to claim 5, wherein the step S5 includes:

S503, calculating language association characteristics based on the context based on the following formula to generate an output vector C of multi-angle understanding context and the relation between the context and local part_out：

C_out＝W₄·(C_g+G_C+G_CLS)+b₄

In the formula, W₄And b₄Trainable matrices and offsets, respectively;

wherein the length of question and context needs to be n and m respectively, if the length is insufficient, using [ PAD [ ]]Filling, and if the length exceeds the length, performing truncation; when encoded by BERT, the sequences need to be spliced in such a way that: [ CLS]+H_Q+[SEP]+H_C+[SEP]In the form of (a); a sequence consisting of a question and a context, and [ CLS]And two [ SEP ]]Composition, problem length is n, context length is m, so sequence length is n + m + 3.

7. A downstream task processing model of a question-answering task, characterized in that a downstream task processing method for implementing a question-answering task according to any one of claims 1 to 6, comprises:

8. The question-answering task downstream task processing model according to claim 7, wherein the loss function of the question-answering task downstream task processing model in the training process is as follows:

in the formula (I), the compound is shown in the specification,

and

representing the predicted probability of the start position of the true answer at model inference,

representing the predicted probability of the end position of the true answer.