CN114238649A

CN114238649A - Common sense concept enhanced language model pre-training method

Info

Publication number: CN114238649A
Application number: CN202111375338.1A
Authority: CN
Inventors: 胡明昊; 罗威; 罗准辰; 谭玉珊; 叶宇铭; 田昌海; 宋宇; 毛彬; 周纤
Original assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Current assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-25

Abstract

The invention discloses a common sense concept enhanced language model pre-training method, which comprises the following steps: step 1) collecting a corpus set, and constructing an unsupervised corpus set based on common sense concepts, wherein the unsupervised corpus set comprises a plurality of sentences, and each sentence comprises a plurality of common sense concepts and the position of each common sense concept in the sentence; step 2) based on the unsupervised corpus, randomly covering the common sense concept to form a training sample, inputting the training sample into a pre-established language model for training, wherein the training target is to predict the covered common sense concept, so as to obtain a pre-training language model with enhanced common sense concept; and 3) obtaining a prediction sequence of language modeling by using the common sense concept enhanced pre-training language model. The invention effectively strengthens the common sense comprehension capability of the pre-training language model, and experiments prove that the pre-training language model with enhanced common sense concepts is finely adjusted on the common sense question-answering task, so that the question-answering accuracy is obviously improved.

Description

Common sense concept enhanced language model pre-training method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a common sense concept enhanced language model pre-training method.

Background

In recent years, the technology of pre-training language models has been rapidly developed in the field of natural language processing, and a new generation natural language processing paradigm represented by pre-training and fine-tuning is formed. The technique aims to perform unsupervised pre-training by using Mask Language Modeling (MLM) loss functions on large-scale unmarked corpus and then perform supervised fine tuning on small-scale annotated dataset. Taking "setup: I am learning a will watch" ", the input sequence" setup: [ MASK ] am learning [ MASK ] will watch "", can be obtained after randomly masking certain words, and the pre-training goal of the model is to predict the masked words as "I" and "a". Currently, methods for fine tuning pre-trained language models have achieved leading performance on most natural language processing tasks such as text classification, named entity recognition, machine reading and understanding, machine translation, text summarization, etc.

However, despite great advances, pre-trained language models remain deficient in common sense understanding, in particular in that these models neglect explicit modeling of common sense concept information during the pre-training process, resulting in poor performance on tasks involving common sense understanding, such as common sense question-and-answer. For example, in the above example, using the MLM penalty function may result in the model masking and predicting simple words such as "I" and "a," which somewhat reduces the pre-training task difficulty. On the other hand, the sentence relates to a plurality of common sense concepts such as "bearing" and "christ watch", and masking and predicting the words not only increases the task difficulty, but also enables the model to learn more prior common sense knowledge, however, the current MLM pre-training function cannot explicitly learn the knowledge of the common sense concepts.

To address the above challenges, some techniques are proposed to improve the MLM pre-training method. For example, a pre-training method for enhancing knowledge is proposed by researchers (see document [1] Sun Y, Wang S, Li Y, et al. ernie: Enhanced representation throughput integration. arxiv preprint 2019.), and a phrase-level covering task and an entity-level covering task are added on the basis of the original MLM task, namely, covering phrases or entities appearing in sentences, so as to achieve the purposes of increasing the difficulty of the pre-training task and enabling models to learn more prior knowledge. However, this approach does not cover common sense concepts that may appear in sentences, resulting in models that still face suboptimal performance on common sense understanding-like tasks.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a language model pre-training method based on common sense concepts. The invention aims to solve the problem that the common sense concept information cannot be displayed and modeled in the conventional pre-training language model, and provides a common sense concept enhanced language model pre-training method.

In order to achieve the above object, the present invention provides a common sense concept enhanced language model pre-training method, which comprises:

step 1) collecting a corpus set, and constructing an unsupervised corpus set based on common sense concepts, wherein the unsupervised corpus set comprises a plurality of sentences, and each sentence comprises a plurality of common sense concepts and the position of each common sense concept in the sentence;

step 2) based on the unsupervised corpus, randomly covering the common sense concept to form a training sample, inputting the training sample into a pre-established language model for training, wherein the training target is to predict the covered common sense concept, so as to obtain a pre-training language model with enhanced common sense concept;

and 3) obtaining a prediction sequence of language modeling by using the common sense concept enhanced pre-training language model.

As an improvement of the above method, the step 1) specifically includes:

step 1-1) obtaining a list C ═ C containing all common sense concepts based on an existing common sense knowledge graph G₁,…,c_i,…,c_n}，c_iThe number is the ith common sense concept, and n is the number of the common sense concepts;

step 1-2) collecting a plurality of linguistic data to form a linguistic data set T ═ { T ═ T₁,…,t_j,…,t_mWhere t is_jFor the jth sentence, m is the number of sentences, each sentence in the corpus is subjected to hard matching with each common sense concept in the common sense concept list to obtain a single training sample u of the jth sentence_j＝(t_j,C_j,S_j,E_j) Wherein, C_jSet of common sense concepts appearing in the jth sentence, S_jAnd E_jRespectively representing a common sense concept set C_jA set of starting positions and a set of ending positions in the jth sentence;

step 1-3) obtaining an unsupervised corpus set U-U based on common sense concept from a single training sample of each sentence in the corpus set₁,…,u_j,…,u_m}。

As an improvement of the above method, the step 2) specifically includes:

step 2-1) traversing each training sample U in unsupervised corpus U_jAccording to the pre-defined probability p of the covering word and the sentence in the sampleNumber of sub-words w_jCalculating the maximum number o of masked words_j＝p*w_j；

Step 2-2) for each training sample u_jRandomly traverse its common sense concept set C_jLocation information based on the ith common sense concept in the sample

And

sentence t_jCovering the words at corresponding positions until reaching the maximum covering word number o_j(ii) a If the common sense concept set C_jThe maximum number o of covering words is not reached at the end of the traversal_jThen, the sentence t is matched_jRandomly covering the uncovered words until reaching the maximum covering word number o_j(ii) a Thereby obtaining a covered sentence x_j；

Covering sentence x with special symbol as separator pair_jSplicing to obtain an input sequence, and obtaining a pre-training word embedded expression H of the sequence according to the sum of character embedding, position embedding and segment embedding₀；

Step 2-3) initializing the parameters of the model M to be trained by using the existing pre-training language model;

step 2-4) using a model M to be trained to encode the covered sentences to obtain a hidden state representation H of the input sequence_KWherein H is_KThe dimension of (l +2) x d, wherein l represents the length of an input sentence, and d represents a hidden state dimension;

step 2-5) representing the hidden state of the input sequence by H_KInputting into a multi-layer perceptron MLP to obtain the probability distribution Y of language modeling_P：

Y_p＝softmax(W_pH_K)

Y_PIs a probability distribution with one dimension of (l +2) x v, v representing the size of the dictionary, W_pIs a trainable parameter with dimension v × d, softmax () represents a normalized exponential function;

step (ii) of2-6) calculating the predicted probability distribution Y_PWith the authentic tag sequence Y_gCross entropy loss function between

Based on the loss function

And training the language model M until the preset condition is met, and finishing training to obtain the pre-training language model M' with the enhanced common sense concept.

As an improvement of the above method, the model M to be trained in step 2-3) includes K layers of sequentially connected pre-training Transformer blocks with the same structure, and is used for embedding the pre-training words to represent H₀Sequentially coding to finally obtain a hidden state representation H_KWherein a certain layer of Transformer blocks is used for receiving the hidden state representation H of the previous layer of Transformer blocks_k-1Outputting the hidden state representation H of the coded transform block of the current layer_kSatisfies the following formula:

H_k＝TransformerBlock(H_k-1)

wherein transformerBlock () represents a Transformer function.

As an improvement of the above method, each of the transform blocks includes a multi-head attention layer, a first residual concatenation layer, a feedforward propagation layer, and a second residual concatenation layer connected in series, and the transform block processing specifically includes:

hidden state representation H of previous layer Transformer block_k-1Through the multi-head attention layer to capture the mutual information between words, the intermediate representation Z is obtained_k-1：

Z_k-1＝Concat(head₁,...,head_j,...,head_h)

head_j＝Attention(Q_j,K_j,V_j)

Wherein Concat () represents a splicing operation, head_jRepresents the output vector of the jth attention head, j is more than or equal to 1 and less than or equal to h, h is the number of the attention heads of a plurality of heads,

and

are all of dimension d_headA parameter matrix of x d, Attention () representing an Attention function for aggregating text information based on the similarity between words, satisfying the following equation:

wherein T represents transpose;

middle represents Z_k-1Inputting the first residual connecting layer for gradient updating to obtain the output Z 'of the first residual connecting layer'_k-1：

Z'_k-1＝LayerNorm(W_ZZ_k-1+H_k-1)

Wherein LayerNorm () represents the layer normalization function, W_ZA parameter matrix of dimension d x d;

feedforward propagation layer and second residual connected layer pair Z'_k-1Encoding is carried out to obtain the hidden state representation H of the transform block of the layer_k：

H_k＝LayerNorm(W_Ggelu(W_OZ'_k-1)+Z'_k-1)

Wherein, W_GAnd W_OAre parameter matrices with dimension d × d, gelu represents the activation function.

As an improvement of the method, the method further comprises the step of performing fine tuning training on the common sense concept-enhanced pre-training language model M 'on a downstream common sense question-answering task to obtain a common sense question-answering model for verifying the M'.

Compared with the prior art, the invention has the advantages that:

1. the invention provides a language model pre-training method for modeling aiming at common sense concepts, aiming at the problem that the common sense comprehension capability of the current pre-training language model is not strong, an unsupervised corpus set based on the common sense concepts is constructed based on a common sense knowledge map, common sense concept covering pre-training is carried out on the corpus set to obtain a pre-training language model with enhanced common sense concepts, and the common sense comprehension capability of the pre-training language model is effectively strengthened;

2. the method solves the problem of poor understanding ability of the common sense of the pre-training language model, and experiments prove that the accuracy of the question and answer can be obviously improved by finely adjusting the pre-training language model with enhanced common sense concept on the common sense question and answer task, and the accuracy of the method reaches 84.1 percent on the common sense QA common sense question and answer evaluation task and is 2.4 percent higher than the performance of a baseline model.

Drawings

FIG. 1 is a flow chart of a common sense concept enhanced language model pre-training method of the present invention;

FIG. 2 is a schematic diagram of a network structure of a common sense concept enhanced pre-trained language model of the present invention;

FIG. 3 is a diagram of a transform block internal network architecture;

FIG. 4 is a schematic diagram of a common sense question-answering model network structure.

Detailed Description

The invention provides a common sense concept enhanced language model pre-training method, which comprises the following steps:

step 1) constructing an unsupervised corpus set based on common sense concepts, giving a common sense knowledge map and a text corpus set, extracting the common sense concepts appearing in the corpus set, wherein the obtained corpus set comprises a series of sentences, each sentence comprises a plurality of common sense concepts, and the positions of the common sense concepts in the original sentence are obtained in a hard matching mode;

step 2) based on the corpus, randomly covering the common sense concept to form a training sample as the input of a language model, and performing language model pre-training by taking a training target as a predicted covered word to obtain a pre-training language model with enhanced common sense concept;

and 3) carrying out fine tuning training on the pre-training language model on a downstream common sense question-answering task to obtain a common sense question-answering model for verifying the effectiveness of the pre-training method.

In the above technical solution, the step 1) specifically includes:

step 1-1) traversing a common sense knowledge graph to obtain a list containing all common sense concepts;

step 1-2) traversing the collected unlabeled corpus, and performing hard matching on each sentence in the corpus with each common sense concept in the common sense concept list to obtain a set formed by all common sense concepts appearing in the sentence and a starting position and an ending position of each common sense concept in the sentence, and finally obtaining a single training sample;

step 1-3) traversing all sentences in the corpus to obtain an unsupervised corpus based on common sense concepts;

in the above technical solution, the step 2) specifically includes:

step 2-1) traversing each sample in the corpus set, and calculating the maximum number of covering words according to the predefined probability of the covering words and the number of words of sentences in the sample;

step 2-2) traversing the common sense concept set of each sample randomly, and covering the words at the corresponding positions in the sentence according to the position information of the common sense concept until the maximum number of covered words is reached; if the common sense concept set is traversed and does not reach the maximum covering word number, carrying out random covering processing on the words which are not covered in the sentence until the maximum covering word number is reached; finally, covered sentences are obtained.

Step 2-3) initializing the parameters of the model M to be trained by using the existing pre-training language model such as BERT;

step 2-4) using the model to encode the covered sentences to obtain the hidden state representation of the input sequence;

step 2-5) inputting the sequence hidden state representation into an MLP (Multi-Layer prediction) Layer to obtain the probability distribution of language modeling;

and 2-6) calculating a cross entropy loss function between the predicted probability distribution and the real label sequence, training the language model M based on the loss function, and obtaining a pre-training language model M' with enhanced common sense concept after training.

The covering process for the word in the step 202) is specifically implemented as follows:

step 2-3-1) the word is replaced by a special identifier [ MASK ] with a probability of 80%;

step 2-3-2) no treatment is carried out for 15% probability;

step 2-3-3) a probability of 15% replaces the word with a random other word.

In the above technical solution, the step 3) specifically includes:

and 3-1) giving a common sense question and answer supervised data set, wherein each sample consists of a question and an answer, traversing each sample, and splicing the question and the answer to obtain an input sequence.

Step 3-2) using a pre-training language model M' with enhanced common sense concept to encode the input sequence to obtain the hidden state representation of the input sequence;

step 3-3) inputting the sequence hidden state representation into an MLP layer to obtain probability distribution of the common sense question and answer;

step 3-4) calculating a cross entropy loss function between the predicted probability distribution and the real answer label, finely adjusting the language model M' based on the loss function, and obtaining the common sense question-answering model M after the fine adjustment is finished^*。

Step 3-5) Using model M on the common sense question-answer test dataset^*And (4) forecasting, calculating forecasting accuracy according to the real label, and evaluating the effect of the pre-training method on the common knowledge question-answering task.

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Examples

As shown in FIG. 1, the embodiment of the present invention provides a pre-training method of a common sense concept enhanced language model, which mainly comprises the following steps:

step 1) constructing an unsupervised corpus set based on common sense concept, which specifically comprises the following steps:firstly, a given common sense knowledge map G is traversed to obtain a list C which contains all common sense concepts₁,…,c_i,…,c_n}，c_iThe number is the ith common sense concept, and n is the number of the common sense concepts; and traversing the collected unlabeled corpus T ═ T₁,…, t_j,…,t_mWhere t is_jJ is the jth sentence, and m is the number of sentences; for each sentence t in the corpus_jAnd each common sense concept C in the common sense concept list C_iCarrying out hard matching to obtain a set of all common sense concepts appearing in the sentence and the starting position and the ending position of each common sense concept in the sentence, thereby obtaining a single training sample u about the sentence_j＝(t_j,C_j,S_j,E_j) Wherein

Is a set formed by common sense concepts appearing in the sentence, r is the number of common sense concepts in the set,

and

respectively representing a starting position set and an ending position set of the common sense concept set in the jth sentence;

get the set C of all common sense concepts appearing in the sentence_jAnd the starting position of each common sense concept in the sentence

And an end position

Obtaining a single training sample

Repeating the above steps to traverse all sentences to obtain the unsupervised corpus U based on the common sense concept{u₁,…,u_j,…,u_m}; taking the sentence "location: I am week a christ watch", after the above processing, 3 common sense concepts "location", "weather" and "christ watch" and the position information in the original text can be obtained by matching, which are detailed in Table 1.

TABLE 1 hard-match schematic of common sense concept

FIG. 2 is a schematic diagram of a network structure of a pre-trained language model with common sense concept enhancement.

Step 2) pre-training the language model with the enhanced common sense concept, which specifically comprises the following steps: traversing each training sample U in unsupervised corpus U_jAccording to a predefined probability p of a masking word and the number w of words of a sentence in the sample_jCalculating the maximum number o of masked words_j＝p*w_j(ii) a Second for sample u_jRandomly traverse its common sense concept set C_jBased on the common sense concept in the sample

Position information of

And

sentence t_jCovering the words at corresponding positions until reaching the maximum covering word number o_j(ii) a If the common sense concept set C_jThe maximum number o of covering words is not reached at the end of the traversal_jThen, the sentence t is matched_jRandomly covering the uncovered words until reaching the maximum covering word number o_j(ii) a Thereby obtaining a covered sentence x_j. Wherein the word is masked with a probability of 80% in particular by replacing the word with a special identifier MASK]The 15% probability does not do any treatment, the 15% probability will be the sheetThe words are replaced with random other words. Taking the sentences in Table 1 as an example, the masked sentence may be "setup: I am learning a [ MASK][MASK].”。

After the processing is finished, a special symbol [ CLS ] is used]And [ SEP ]]The covering sentences are spliced as separators to obtain an input sequence as follows<[CLS],Tok₁,Tok₂,…,Tok_l,[SEP]>Then obtain the pre-training word embedding representation H of the sequence₀The word embedding represents H₀Is the sum of character embedding, position embedding and segment embedding.

Subsequently, the parameters of the model M to be trained are initialized using an existing pre-trained language model, such as BERT, which represents the embedded representation H of the input sequence by several pre-trained Transformer blocks₀And (3) coding in sequence:

wherein H_kThe hidden state output for the kth transform block represents, transformerBlock () represents the transform function; k represents the number of Transformer blocks, H_KIs (l +2) × d, where l represents the input sentence length and d represents the hidden state dimension;

representing hidden states of an input sequence by H_KInputting the multi-layer perceptron layer to obtain the probability distribution Y of language modeling_P：

Y_p＝softmax(W_pH_K)

Y_PIs a probability distribution with one dimension of (n +2) x v, where v represents the dictionary size, W_pA trainable parameter with dimension v × d; the probability distribution is decoded to obtain a prediction sequence of the language modeling.

Calculating a predicted probability distribution Y_PWith the authentic tag sequence Y_gCross entropy loss function between

Training the language model M based on the loss functionAfter finishing, the pre-training language model M' with the reinforced common sense concept is obtained.

The function of the Transformer function is to receive the hidden state representation H of the previous layer_k-1Inputting the coded local hidden state representation H_kThe structure is shown in fig. 3. Specifically, the hidden state of the previous layer transform block represents H_k-1First go through a multi-head attention layer to capture the interaction information between words:

head_j＝Attention(Q_j,K_j,V_j)

Z_k-1＝Concat(head₁,...,head_j,...,head_h)

wherein Concat represents the splicing operation, head_jRepresents the output vector of the jth attention head, j is more than or equal to 1 and less than or equal to h, h is the number of the attention heads of a plurality of heads,

and

is dimension d_headA parameter matrix of xd, Attention being an Attention function for aggregating text information based on similarity between words, satisfying the following equation:

wherein T represents transpose;

Z'_k-1＝LayerNorm(W_ZZ_k-1+H_k-1)

Wherein LayerNorm represents the layer normalization function, W_ZA parameter matrix of dimension d x d;

H_k＝LayerNorm(W_Ggelu(W_OZ'_k-1)+Z'_k-1)

Step 3) verifying the effectiveness of the pre-training method on the common sense question-answer data set, which specifically comprises the following steps: first, a common sense question and answer data set X is given as X₁,…,x_z}，x_iIs the ith sample, z is the number of samples; traverse each of them sample x_i＝(q_i,a_i) Wherein q is_iTo solve the problem for the sample, a_iFor the sample corresponding answer, q_iAnd a_iThe input sequence obtained after splicing is as follows<[CLS],Tok₁,…,Tok_n,[SEP],Tok₁,…,Tok_m,[SEP]>Wherein the question includes n characters: tok₁,…,Tok_nThe answer contains m characters: tok₁,…,Tok_mThen obtaining a word-embedded representation of the input sequence

The input sequence is then encoded using a common sense concept enhanced pre-trained language model M' that utilizes an embedded representation of the input sequence with K pre-trained Transformer blocks

And (3) coding in sequence:

wherein the content of the first and second substances,

the sequence hidden state representation output for the kth transform block;

taking the hidden state representation output by the Kth Transformer block

[ CLS ] of]Bit-hidden state representation

Inputting into multi-layer perceptron layer to obtain probability distribution Y of common sense question and answer_q：

Y_qIs a probability distribution with dimension 2, and the probability distribution is decoded to obtain an input question-answer pair (q)_i,a_i) Whether the prediction is correct.

Calculating a predicted probability distribution Y_qWith true answer labels

Cross entropy loss function between

Fine-tuning the language model M' based on the loss function, and obtaining the common sense question-answering model M after the fine-tuning is finished^*。

Using model M on a common sense question-and-answer test dataset^*And (4) forecasting, calculating forecasting accuracy according to the real label, and evaluating the effect of the pre-training method on the common knowledge question-answering task. During pre-training experiments, the present invention uses the Open Mind Common Sense (OMCS) corpus (see document [2 ]]Havassi C, sper R, Arnold K, et al, open mind common sense: crown-souring for common sense. AAAI2010.) as a tagless corpus containing 82 ten thousand sentences containing several common sense concepts; using ConceptNet (see document [3 ]]Speer R, chi J, Havasi c. concetnet 5.5: An open multilingual graph of general knowledge. aaai2017.). During the course of the fine-tuning experiments, CommonseQA was used (see document [4 ]]Talmor A, Herzig J, Louri N, et al, Commonsense QA, A query Answering Change Targeting Commonsense knowlege. NAACL 2019) dataset, which contains 9741 training samples and 1221 test samples, was used as the evaluation dataset for the fine tuning phase. After the training set is finely tuned by using the common sense concept enhanced pre-training language model, the question and answer accuracy on the test set reaches 84.1%, 2.4% points are improved compared with the performance after the fine tuning by using the reference pre-training language model, and the effectiveness of the method is verified.

The innovation points of the invention mainly comprise:

the invention designs a language model pre-training method with enhanced common sense concept, and the common sense comprehension capability of the model can be effectively improved by pre-training the language model by using the method; the method is specifically characterized in that the language model is pre-trained by using the method, and after the model is subjected to fine tuning on a downstream common knowledge question-answering task, the question-answering accuracy of the model can be remarkably improved.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of pre-training a common sense concept enhanced language model, the method comprising:

2. The common sense concept-enhanced language model pre-training method according to claim 1, wherein the step 1) specifically comprises:

3. The common sense concept-enhanced language model pre-training method according to claim 2, wherein the step 2) specifically comprises:

step 2-1) traversing each training sample U in unsupervised corpus U_jAccording to a predefined probability p of a masking word and the number w of words of a sentence in the sample_jCalculating the maximum number o of masked words_j＝p*w_j；

And

Y_p＝softmax(W_pH_K)

step 2-6) calculating the predicted probability distribution Y_PAnd trueTag sequence Y_gCross entropy loss function between

Based on the loss function

4. The common sense concept enhanced language model pre-training method as claimed in claim 3, wherein the model M to be trained in the step 2-3) comprises K layers of sequentially connected pre-training Transformer blocks with the same structure, and is used for embedding the pre-training words into the representation H₀Sequentially coding to finally obtain a hidden state representation H_KWherein a certain layer of Transformer blocks is used for receiving the hidden state representation H of the previous layer of Transformer blocks_k-1Outputting the hidden state representation H of the coded transform block of the current layer_kSatisfies the following formula:

H_k＝TransformerBlock(H_k-1)

wherein transformerBlock () represents a Transformer function.

5. The method of claim 4, wherein each of the Transformer blocks comprises a multi-head attention layer, a first residual linkage layer, a feedforward propagation layer and a second residual linkage layer connected in series, and the Transformer block processing procedure specifically comprises:

Z_k-1＝Concat(head₁,...,head_j,...,head_h)

head_j＝Attention(Q_j,K_j,V_j)

and

wherein T represents transpose;

Z'_k-1＝LayerNorm(W_ZZ_k-1+H_k-1)

Wherein LayerNorm () represents the layer normalization function, W_ZA parameter matrix of dimension d x d; feedforward propagation layer and second residual connected layer pair Z'_k-1Encoding is carried out to obtain the hidden state representation H of the transform block of the layer_k：

H_k＝LayerNorm(W_Ggelu(W_OZ'_k-1)+Z'_k-1)

Wherein, W_GAnd W_OAre all parameter matrices with dimensions d x d, gelu represents the activation function.

6. The method for pre-training the common sense concept-enhanced language model according to claim 1, further comprising performing fine tuning training on the common sense concept-enhanced pre-training language model M 'on a downstream common sense question and answer task to obtain a common sense question and answer model for verifying M'.