CN114238649A - Common sense concept enhanced language model pre-training method - Google Patents

Common sense concept enhanced language model pre-training method Download PDF

Info

Publication number
CN114238649A
CN114238649A CN202111375338.1A CN202111375338A CN114238649A CN 114238649 A CN114238649 A CN 114238649A CN 202111375338 A CN202111375338 A CN 202111375338A CN 114238649 A CN114238649 A CN 114238649A
Authority
CN
China
Prior art keywords
common sense
training
language model
concept
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111375338.1A
Other languages
Chinese (zh)
Inventor
胡明昊
罗威
罗准辰
谭玉珊
叶宇铭
田昌海
宋宇
毛彬
周纤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Military Science Information Research Center Of Military Academy Of Chinese Pla
Original Assignee
Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Military Science Information Research Center Of Military Academy Of Chinese Pla filed Critical Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority to CN202111375338.1A priority Critical patent/CN114238649A/en
Publication of CN114238649A publication Critical patent/CN114238649A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a common sense concept enhanced language model pre-training method, which comprises the following steps: step 1) collecting a corpus set, and constructing an unsupervised corpus set based on common sense concepts, wherein the unsupervised corpus set comprises a plurality of sentences, and each sentence comprises a plurality of common sense concepts and the position of each common sense concept in the sentence; step 2) based on the unsupervised corpus, randomly covering the common sense concept to form a training sample, inputting the training sample into a pre-established language model for training, wherein the training target is to predict the covered common sense concept, so as to obtain a pre-training language model with enhanced common sense concept; and 3) obtaining a prediction sequence of language modeling by using the common sense concept enhanced pre-training language model. The invention effectively strengthens the common sense comprehension capability of the pre-training language model, and experiments prove that the pre-training language model with enhanced common sense concepts is finely adjusted on the common sense question-answering task, so that the question-answering accuracy is obviously improved.

Description

Common sense concept enhanced language model pre-training method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a common sense concept enhanced language model pre-training method.
Background
In recent years, the technology of pre-training language models has been rapidly developed in the field of natural language processing, and a new generation natural language processing paradigm represented by pre-training and fine-tuning is formed. The technique aims to perform unsupervised pre-training by using Mask Language Modeling (MLM) loss functions on large-scale unmarked corpus and then perform supervised fine tuning on small-scale annotated dataset. Taking "setup: I am learning a will watch" ", the input sequence" setup: [ MASK ] am learning [ MASK ] will watch "", can be obtained after randomly masking certain words, and the pre-training goal of the model is to predict the masked words as "I" and "a". Currently, methods for fine tuning pre-trained language models have achieved leading performance on most natural language processing tasks such as text classification, named entity recognition, machine reading and understanding, machine translation, text summarization, etc.
However, despite great advances, pre-trained language models remain deficient in common sense understanding, in particular in that these models neglect explicit modeling of common sense concept information during the pre-training process, resulting in poor performance on tasks involving common sense understanding, such as common sense question-and-answer. For example, in the above example, using the MLM penalty function may result in the model masking and predicting simple words such as "I" and "a," which somewhat reduces the pre-training task difficulty. On the other hand, the sentence relates to a plurality of common sense concepts such as "bearing" and "christ watch", and masking and predicting the words not only increases the task difficulty, but also enables the model to learn more prior common sense knowledge, however, the current MLM pre-training function cannot explicitly learn the knowledge of the common sense concepts.
To address the above challenges, some techniques are proposed to improve the MLM pre-training method. For example, a pre-training method for enhancing knowledge is proposed by researchers (see document [1] Sun Y, Wang S, Li Y, et al. ernie: Enhanced representation throughput integration. arxiv preprint 2019.), and a phrase-level covering task and an entity-level covering task are added on the basis of the original MLM task, namely, covering phrases or entities appearing in sentences, so as to achieve the purposes of increasing the difficulty of the pre-training task and enabling models to learn more prior knowledge. However, this approach does not cover common sense concepts that may appear in sentences, resulting in models that still face suboptimal performance on common sense understanding-like tasks.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a language model pre-training method based on common sense concepts. The invention aims to solve the problem that the common sense concept information cannot be displayed and modeled in the conventional pre-training language model, and provides a common sense concept enhanced language model pre-training method.
In order to achieve the above object, the present invention provides a common sense concept enhanced language model pre-training method, which comprises:
step 1) collecting a corpus set, and constructing an unsupervised corpus set based on common sense concepts, wherein the unsupervised corpus set comprises a plurality of sentences, and each sentence comprises a plurality of common sense concepts and the position of each common sense concept in the sentence;
step 2) based on the unsupervised corpus, randomly covering the common sense concept to form a training sample, inputting the training sample into a pre-established language model for training, wherein the training target is to predict the covered common sense concept, so as to obtain a pre-training language model with enhanced common sense concept;
and 3) obtaining a prediction sequence of language modeling by using the common sense concept enhanced pre-training language model.
As an improvement of the above method, the step 1) specifically includes:
step 1-1) obtaining a list C ═ C containing all common sense concepts based on an existing common sense knowledge graph G1,…,ci,…,cn},ciThe number is the ith common sense concept, and n is the number of the common sense concepts;
step 1-2) collecting a plurality of linguistic data to form a linguistic data set T ═ { T ═ T1,…,tj,…,tmWhere t isjFor the jth sentence, m is the number of sentences, each sentence in the corpus is subjected to hard matching with each common sense concept in the common sense concept list to obtain a single training sample u of the jth sentencej=(tj,Cj,Sj,Ej) Wherein, CjSet of common sense concepts appearing in the jth sentence, SjAnd EjRespectively representing a common sense concept set CjA set of starting positions and a set of ending positions in the jth sentence;
step 1-3) obtaining an unsupervised corpus set U-U based on common sense concept from a single training sample of each sentence in the corpus set1,…,uj,…,um}。
As an improvement of the above method, the step 2) specifically includes:
step 2-1) traversing each training sample U in unsupervised corpus UjAccording to the pre-defined probability p of the covering word and the sentence in the sampleNumber of sub-words wjCalculating the maximum number o of masked wordsj=p*wj
Step 2-2) for each training sample ujRandomly traverse its common sense concept set CjLocation information based on the ith common sense concept in the sample
Figure RE-GDA0003485644900000021
And
Figure RE-GDA0003485644900000022
sentence tjCovering the words at corresponding positions until reaching the maximum covering word number oj(ii) a If the common sense concept set CjThe maximum number o of covering words is not reached at the end of the traversaljThen, the sentence t is matchedjRandomly covering the uncovered words until reaching the maximum covering word number oj(ii) a Thereby obtaining a covered sentence xj
Covering sentence x with special symbol as separator pairjSplicing to obtain an input sequence, and obtaining a pre-training word embedded expression H of the sequence according to the sum of character embedding, position embedding and segment embedding0
Step 2-3) initializing the parameters of the model M to be trained by using the existing pre-training language model;
step 2-4) using a model M to be trained to encode the covered sentences to obtain a hidden state representation H of the input sequenceKWherein H isKThe dimension of (l +2) x d, wherein l represents the length of an input sentence, and d represents a hidden state dimension;
step 2-5) representing the hidden state of the input sequence by HKInputting into a multi-layer perceptron MLP to obtain the probability distribution Y of language modelingP
Yp=softmax(WpHK)
YPIs a probability distribution with one dimension of (l +2) x v, v representing the size of the dictionary, WpIs a trainable parameter with dimension v × d, softmax () represents a normalized exponential function;
step (ii) of2-6) calculating the predicted probability distribution YPWith the authentic tag sequence YgCross entropy loss function between
Figure RE-GDA0003485644900000031
Based on the loss function
Figure RE-GDA0003485644900000032
And training the language model M until the preset condition is met, and finishing training to obtain the pre-training language model M' with the enhanced common sense concept.
As an improvement of the above method, the model M to be trained in step 2-3) includes K layers of sequentially connected pre-training Transformer blocks with the same structure, and is used for embedding the pre-training words to represent H0Sequentially coding to finally obtain a hidden state representation HKWherein a certain layer of Transformer blocks is used for receiving the hidden state representation H of the previous layer of Transformer blocksk-1Outputting the hidden state representation H of the coded transform block of the current layerkSatisfies the following formula:
Hk=TransformerBlock(Hk-1)
wherein transformerBlock () represents a Transformer function.
As an improvement of the above method, each of the transform blocks includes a multi-head attention layer, a first residual concatenation layer, a feedforward propagation layer, and a second residual concatenation layer connected in series, and the transform block processing specifically includes:
hidden state representation H of previous layer Transformer blockk-1Through the multi-head attention layer to capture the mutual information between words, the intermediate representation Z is obtainedk-1
Zk-1=Concat(head1,...,headj,...,headh)
headj=Attention(Qj,Kj,Vj)
Figure RE-GDA0003485644900000041
Wherein Concat () represents a splicing operation, headjRepresents the output vector of the jth attention head, j is more than or equal to 1 and less than or equal to h, h is the number of the attention heads of a plurality of heads,
Figure RE-GDA0003485644900000042
and
Figure RE-GDA0003485644900000043
are all of dimension dheadA parameter matrix of x d, Attention () representing an Attention function for aggregating text information based on the similarity between words, satisfying the following equation:
Figure RE-GDA0003485644900000044
wherein T represents transpose;
middle represents Zk-1Inputting the first residual connecting layer for gradient updating to obtain the output Z 'of the first residual connecting layer'k-1
Z'k-1=LayerNorm(WZZk-1+Hk-1)
Wherein LayerNorm () represents the layer normalization function, WZA parameter matrix of dimension d x d;
feedforward propagation layer and second residual connected layer pair Z'k-1Encoding is carried out to obtain the hidden state representation H of the transform block of the layerk
Hk=LayerNorm(WGgelu(WOZ'k-1)+Z'k-1)
Wherein, WGAnd WOAre parameter matrices with dimension d × d, gelu represents the activation function.
As an improvement of the method, the method further comprises the step of performing fine tuning training on the common sense concept-enhanced pre-training language model M 'on a downstream common sense question-answering task to obtain a common sense question-answering model for verifying the M'.
Compared with the prior art, the invention has the advantages that:
1. the invention provides a language model pre-training method for modeling aiming at common sense concepts, aiming at the problem that the common sense comprehension capability of the current pre-training language model is not strong, an unsupervised corpus set based on the common sense concepts is constructed based on a common sense knowledge map, common sense concept covering pre-training is carried out on the corpus set to obtain a pre-training language model with enhanced common sense concepts, and the common sense comprehension capability of the pre-training language model is effectively strengthened;
2. the method solves the problem of poor understanding ability of the common sense of the pre-training language model, and experiments prove that the accuracy of the question and answer can be obviously improved by finely adjusting the pre-training language model with enhanced common sense concept on the common sense question and answer task, and the accuracy of the method reaches 84.1 percent on the common sense QA common sense question and answer evaluation task and is 2.4 percent higher than the performance of a baseline model.
Drawings
FIG. 1 is a flow chart of a common sense concept enhanced language model pre-training method of the present invention;
FIG. 2 is a schematic diagram of a network structure of a common sense concept enhanced pre-trained language model of the present invention;
FIG. 3 is a diagram of a transform block internal network architecture;
FIG. 4 is a schematic diagram of a common sense question-answering model network structure.
Detailed Description
The invention provides a common sense concept enhanced language model pre-training method, which comprises the following steps:
step 1) constructing an unsupervised corpus set based on common sense concepts, giving a common sense knowledge map and a text corpus set, extracting the common sense concepts appearing in the corpus set, wherein the obtained corpus set comprises a series of sentences, each sentence comprises a plurality of common sense concepts, and the positions of the common sense concepts in the original sentence are obtained in a hard matching mode;
step 2) based on the corpus, randomly covering the common sense concept to form a training sample as the input of a language model, and performing language model pre-training by taking a training target as a predicted covered word to obtain a pre-training language model with enhanced common sense concept;
and 3) carrying out fine tuning training on the pre-training language model on a downstream common sense question-answering task to obtain a common sense question-answering model for verifying the effectiveness of the pre-training method.
In the above technical solution, the step 1) specifically includes:
step 1-1) traversing a common sense knowledge graph to obtain a list containing all common sense concepts;
step 1-2) traversing the collected unlabeled corpus, and performing hard matching on each sentence in the corpus with each common sense concept in the common sense concept list to obtain a set formed by all common sense concepts appearing in the sentence and a starting position and an ending position of each common sense concept in the sentence, and finally obtaining a single training sample;
step 1-3) traversing all sentences in the corpus to obtain an unsupervised corpus based on common sense concepts;
in the above technical solution, the step 2) specifically includes:
step 2-1) traversing each sample in the corpus set, and calculating the maximum number of covering words according to the predefined probability of the covering words and the number of words of sentences in the sample;
step 2-2) traversing the common sense concept set of each sample randomly, and covering the words at the corresponding positions in the sentence according to the position information of the common sense concept until the maximum number of covered words is reached; if the common sense concept set is traversed and does not reach the maximum covering word number, carrying out random covering processing on the words which are not covered in the sentence until the maximum covering word number is reached; finally, covered sentences are obtained.
Step 2-3) initializing the parameters of the model M to be trained by using the existing pre-training language model such as BERT;
step 2-4) using the model to encode the covered sentences to obtain the hidden state representation of the input sequence;
step 2-5) inputting the sequence hidden state representation into an MLP (Multi-Layer prediction) Layer to obtain the probability distribution of language modeling;
and 2-6) calculating a cross entropy loss function between the predicted probability distribution and the real label sequence, training the language model M based on the loss function, and obtaining a pre-training language model M' with enhanced common sense concept after training.
The covering process for the word in the step 202) is specifically implemented as follows:
step 2-3-1) the word is replaced by a special identifier [ MASK ] with a probability of 80%;
step 2-3-2) no treatment is carried out for 15% probability;
step 2-3-3) a probability of 15% replaces the word with a random other word.
In the above technical solution, the step 3) specifically includes:
and 3-1) giving a common sense question and answer supervised data set, wherein each sample consists of a question and an answer, traversing each sample, and splicing the question and the answer to obtain an input sequence.
Step 3-2) using a pre-training language model M' with enhanced common sense concept to encode the input sequence to obtain the hidden state representation of the input sequence;
step 3-3) inputting the sequence hidden state representation into an MLP layer to obtain probability distribution of the common sense question and answer;
step 3-4) calculating a cross entropy loss function between the predicted probability distribution and the real answer label, finely adjusting the language model M' based on the loss function, and obtaining the common sense question-answering model M after the fine adjustment is finished*
Step 3-5) Using model M on the common sense question-answer test dataset*And (4) forecasting, calculating forecasting accuracy according to the real label, and evaluating the effect of the pre-training method on the common knowledge question-answering task.
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Examples
As shown in FIG. 1, the embodiment of the present invention provides a pre-training method of a common sense concept enhanced language model, which mainly comprises the following steps:
step 1) constructing an unsupervised corpus set based on common sense concept, which specifically comprises the following steps:firstly, a given common sense knowledge map G is traversed to obtain a list C which contains all common sense concepts1,…,ci,…,cn},ciThe number is the ith common sense concept, and n is the number of the common sense concepts; and traversing the collected unlabeled corpus T ═ T1,…, tj,…,tmWhere t isjJ is the jth sentence, and m is the number of sentences; for each sentence t in the corpusjAnd each common sense concept C in the common sense concept list CiCarrying out hard matching to obtain a set of all common sense concepts appearing in the sentence and the starting position and the ending position of each common sense concept in the sentence, thereby obtaining a single training sample u about the sentencej=(tj,Cj,Sj,Ej) Wherein
Figure RE-GDA0003485644900000061
Is a set formed by common sense concepts appearing in the sentence, r is the number of common sense concepts in the set,
Figure RE-GDA0003485644900000062
and
Figure RE-GDA0003485644900000063
respectively representing a starting position set and an ending position set of the common sense concept set in the jth sentence;
get the set C of all common sense concepts appearing in the sentencejAnd the starting position of each common sense concept in the sentence
Figure RE-GDA0003485644900000071
And an end position
Figure RE-GDA0003485644900000072
Obtaining a single training sample
Figure RE-GDA0003485644900000073
Repeating the above steps to traverse all sentences to obtain the unsupervised corpus U based on the common sense concept{u1,…,uj,…,um}; taking the sentence "location: I am week a christ watch", after the above processing, 3 common sense concepts "location", "weather" and "christ watch" and the position information in the original text can be obtained by matching, which are detailed in Table 1.
TABLE 1 hard-match schematic of common sense concept
Figure RE-GDA0003485644900000074
FIG. 2 is a schematic diagram of a network structure of a pre-trained language model with common sense concept enhancement.
Step 2) pre-training the language model with the enhanced common sense concept, which specifically comprises the following steps: traversing each training sample U in unsupervised corpus UjAccording to a predefined probability p of a masking word and the number w of words of a sentence in the samplejCalculating the maximum number o of masked wordsj=p*wj(ii) a Second for sample ujRandomly traverse its common sense concept set CjBased on the common sense concept in the sample
Figure RE-GDA0003485644900000075
Position information of
Figure RE-GDA0003485644900000076
And
Figure RE-GDA0003485644900000077
sentence tjCovering the words at corresponding positions until reaching the maximum covering word number oj(ii) a If the common sense concept set CjThe maximum number o of covering words is not reached at the end of the traversaljThen, the sentence t is matchedjRandomly covering the uncovered words until reaching the maximum covering word number oj(ii) a Thereby obtaining a covered sentence xj. Wherein the word is masked with a probability of 80% in particular by replacing the word with a special identifier MASK]The 15% probability does not do any treatment, the 15% probability will be the sheetThe words are replaced with random other words. Taking the sentences in Table 1 as an example, the masked sentence may be "setup: I am learning a [ MASK][MASK].”。
After the processing is finished, a special symbol [ CLS ] is used]And [ SEP ]]The covering sentences are spliced as separators to obtain an input sequence as follows<[CLS],Tok1,Tok2,…,Tokl,[SEP]>Then obtain the pre-training word embedding representation H of the sequence0The word embedding represents H0Is the sum of character embedding, position embedding and segment embedding.
Subsequently, the parameters of the model M to be trained are initialized using an existing pre-trained language model, such as BERT, which represents the embedded representation H of the input sequence by several pre-trained Transformer blocks0And (3) coding in sequence:
Figure RE-GDA0003485644900000081
wherein HkThe hidden state output for the kth transform block represents, transformerBlock () represents the transform function; k represents the number of Transformer blocks, HKIs (l +2) × d, where l represents the input sentence length and d represents the hidden state dimension;
representing hidden states of an input sequence by HKInputting the multi-layer perceptron layer to obtain the probability distribution Y of language modelingP
Yp=softmax(WpHK)
YPIs a probability distribution with one dimension of (n +2) x v, where v represents the dictionary size, WpA trainable parameter with dimension v × d; the probability distribution is decoded to obtain a prediction sequence of the language modeling.
Calculating a predicted probability distribution YPWith the authentic tag sequence YgCross entropy loss function between
Figure RE-GDA0003485644900000082
Training the language model M based on the loss functionAfter finishing, the pre-training language model M' with the reinforced common sense concept is obtained.
The function of the Transformer function is to receive the hidden state representation H of the previous layerk-1Inputting the coded local hidden state representation HkThe structure is shown in fig. 3. Specifically, the hidden state of the previous layer transform block represents Hk-1First go through a multi-head attention layer to capture the interaction information between words:
Figure RE-GDA0003485644900000083
headj=Attention(Qj,Kj,Vj)
Zk-1=Concat(head1,...,headj,...,headh)
wherein Concat represents the splicing operation, headjRepresents the output vector of the jth attention head, j is more than or equal to 1 and less than or equal to h, h is the number of the attention heads of a plurality of heads,
Figure RE-GDA0003485644900000084
and
Figure RE-GDA0003485644900000085
is dimension dheadA parameter matrix of xd, Attention being an Attention function for aggregating text information based on similarity between words, satisfying the following equation:
Figure RE-GDA0003485644900000086
wherein T represents transpose;
middle represents Zk-1Inputting the first residual connecting layer for gradient updating to obtain the output Z 'of the first residual connecting layer'k-1
Z'k-1=LayerNorm(WZZk-1+Hk-1)
Wherein LayerNorm represents the layer normalization function, WZA parameter matrix of dimension d x d;
feedforward propagation layer and second residual connected layer pair Z'k-1Encoding is carried out to obtain the hidden state representation H of the transform block of the layerk
Hk=LayerNorm(WGgelu(WOZ'k-1)+Z'k-1)
Wherein, WGAnd WOAre parameter matrices with dimension d × d, gelu represents the activation function.
FIG. 4 is a schematic diagram of a common sense question-answering model network structure.
Step 3) verifying the effectiveness of the pre-training method on the common sense question-answer data set, which specifically comprises the following steps: first, a common sense question and answer data set X is given as X1,…,xz},xiIs the ith sample, z is the number of samples; traverse each of them sample xi=(qi,ai) Wherein q isiTo solve the problem for the sample, aiFor the sample corresponding answer, qiAnd aiThe input sequence obtained after splicing is as follows<[CLS],Tok1,…,Tokn,[SEP],Tok1,…,Tokm,[SEP]>Wherein the question includes n characters: tok1,…,ToknThe answer contains m characters: tok1,…,TokmThen obtaining a word-embedded representation of the input sequence
Figure RE-GDA0003485644900000091
The input sequence is then encoded using a common sense concept enhanced pre-trained language model M' that utilizes an embedded representation of the input sequence with K pre-trained Transformer blocks
Figure RE-GDA0003485644900000092
And (3) coding in sequence:
Figure RE-GDA0003485644900000093
wherein the content of the first and second substances,
Figure RE-GDA0003485644900000094
the sequence hidden state representation output for the kth transform block;
taking the hidden state representation output by the Kth Transformer block
Figure RE-GDA0003485644900000095
[ CLS ] of]Bit-hidden state representation
Figure RE-GDA0003485644900000096
Inputting into multi-layer perceptron layer to obtain probability distribution Y of common sense question and answerq
Figure RE-GDA0003485644900000097
YqIs a probability distribution with dimension 2, and the probability distribution is decoded to obtain an input question-answer pair (q)i,ai) Whether the prediction is correct.
Calculating a predicted probability distribution YqWith true answer labels
Figure RE-GDA0003485644900000098
Cross entropy loss function between
Figure RE-GDA0003485644900000099
Fine-tuning the language model M' based on the loss function, and obtaining the common sense question-answering model M after the fine-tuning is finished*
Using model M on a common sense question-and-answer test dataset*And (4) forecasting, calculating forecasting accuracy according to the real label, and evaluating the effect of the pre-training method on the common knowledge question-answering task. During pre-training experiments, the present invention uses the Open Mind Common Sense (OMCS) corpus (see document [2 ]]Havassi C, sper R, Arnold K, et al, open mind common sense: crown-souring for common sense. AAAI2010.) as a tagless corpus containing 82 ten thousand sentences containing several common sense concepts; using ConceptNet (see document [3 ]]Speer R, chi J, Havasi c. concetnet 5.5: An open multilingual graph of general knowledge. aaai2017.). During the course of the fine-tuning experiments, CommonseQA was used (see document [4 ]]Talmor A, Herzig J, Louri N, et al, Commonsense QA, A query Answering Change Targeting Commonsense knowlege. NAACL 2019) dataset, which contains 9741 training samples and 1221 test samples, was used as the evaluation dataset for the fine tuning phase. After the training set is finely tuned by using the common sense concept enhanced pre-training language model, the question and answer accuracy on the test set reaches 84.1%, 2.4% points are improved compared with the performance after the fine tuning by using the reference pre-training language model, and the effectiveness of the method is verified.
The innovation points of the invention mainly comprise:
the invention designs a language model pre-training method with enhanced common sense concept, and the common sense comprehension capability of the model can be effectively improved by pre-training the language model by using the method; the method is specifically characterized in that the language model is pre-trained by using the method, and after the model is subjected to fine tuning on a downstream common knowledge question-answering task, the question-answering accuracy of the model can be remarkably improved.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (6)

1. A method of pre-training a common sense concept enhanced language model, the method comprising:
step 1) collecting a corpus set, and constructing an unsupervised corpus set based on common sense concepts, wherein the unsupervised corpus set comprises a plurality of sentences, and each sentence comprises a plurality of common sense concepts and the position of each common sense concept in the sentence;
step 2) based on the unsupervised corpus, randomly covering the common sense concept to form a training sample, inputting the training sample into a pre-established language model for training, wherein the training target is to predict the covered common sense concept, so as to obtain a pre-training language model with enhanced common sense concept;
and 3) obtaining a prediction sequence of language modeling by using the common sense concept enhanced pre-training language model.
2. The common sense concept-enhanced language model pre-training method according to claim 1, wherein the step 1) specifically comprises:
step 1-1) obtaining a list C ═ C containing all common sense concepts based on an existing common sense knowledge graph G1,…,ci,…,cn},ciThe number is the ith common sense concept, and n is the number of the common sense concepts;
step 1-2) collecting a plurality of linguistic data to form a linguistic data set T ═ { T ═ T1,…,tj,…,tmWhere t isjFor the jth sentence, m is the number of sentences, each sentence in the corpus is subjected to hard matching with each common sense concept in the common sense concept list to obtain a single training sample u of the jth sentencej=(tj,Cj,Sj,Ej) Wherein, CjSet of common sense concepts appearing in the jth sentence, SjAnd EjRespectively representing a common sense concept set CjA set of starting positions and a set of ending positions in the jth sentence;
step 1-3) obtaining an unsupervised corpus set U-U based on common sense concept from a single training sample of each sentence in the corpus set1,…,uj,…,um}。
3. The common sense concept-enhanced language model pre-training method according to claim 2, wherein the step 2) specifically comprises:
step 2-1) traversing each training sample U in unsupervised corpus UjAccording to a predefined probability p of a masking word and the number w of words of a sentence in the samplejCalculating the maximum number o of masked wordsj=p*wj
Step 2-2) for each training sample ujRandomly traverse its common sense concept set CjLocation information based on the ith common sense concept in the sample
Figure RE-FDA0003485644890000011
And
Figure RE-FDA0003485644890000012
sentence tjCovering the words at corresponding positions until reaching the maximum covering word number oj(ii) a If the common sense concept set CjThe maximum number o of covering words is not reached at the end of the traversaljThen, the sentence t is matchedjRandomly covering the uncovered words until reaching the maximum covering word number oj(ii) a Thereby obtaining a covered sentence xj
Covering sentence x with special symbol as separator pairjSplicing to obtain an input sequence, and obtaining a pre-training word embedded expression H of the sequence according to the sum of character embedding, position embedding and segment embedding0
Step 2-3) initializing the parameters of the model M to be trained by using the existing pre-training language model;
step 2-4) using a model M to be trained to encode the covered sentences to obtain a hidden state representation H of the input sequenceKWherein H isKThe dimension of (l +2) x d, wherein l represents the length of an input sentence, and d represents a hidden state dimension;
step 2-5) representing the hidden state of the input sequence by HKInputting into a multi-layer perceptron MLP to obtain the probability distribution Y of language modelingP
Yp=softmax(WpHK)
YPIs a probability distribution with one dimension of (l +2) x v, v representing the size of the dictionary, WpIs a trainable parameter with dimension v × d, softmax () represents a normalized exponential function;
step 2-6) calculating the predicted probability distribution YPAnd trueTag sequence YgCross entropy loss function between
Figure RE-FDA0003485644890000022
Based on the loss function
Figure RE-FDA0003485644890000023
And training the language model M until the preset condition is met, and finishing training to obtain the pre-training language model M' with the enhanced common sense concept.
4. The common sense concept enhanced language model pre-training method as claimed in claim 3, wherein the model M to be trained in the step 2-3) comprises K layers of sequentially connected pre-training Transformer blocks with the same structure, and is used for embedding the pre-training words into the representation H0Sequentially coding to finally obtain a hidden state representation HKWherein a certain layer of Transformer blocks is used for receiving the hidden state representation H of the previous layer of Transformer blocksk-1Outputting the hidden state representation H of the coded transform block of the current layerkSatisfies the following formula:
Hk=TransformerBlock(Hk-1)
wherein transformerBlock () represents a Transformer function.
5. The method of claim 4, wherein each of the Transformer blocks comprises a multi-head attention layer, a first residual linkage layer, a feedforward propagation layer and a second residual linkage layer connected in series, and the Transformer block processing procedure specifically comprises:
hidden state representation H of previous layer Transformer blockk-1Through the multi-head attention layer to capture the mutual information between words, the intermediate representation Z is obtainedk-1
Zk-1=Concat(head1,...,headj,...,headh)
headj=Attention(Qj,Kj,Vj)
Figure RE-FDA0003485644890000021
Wherein Concat () represents a splicing operation, headjRepresents the output vector of the jth attention head, j is more than or equal to 1 and less than or equal to h, h is the number of the attention heads of a plurality of heads,
Figure RE-FDA0003485644890000031
and
Figure RE-FDA0003485644890000032
are all of dimension dheadA parameter matrix of x d, Attention () representing an Attention function for aggregating text information based on the similarity between words, satisfying the following equation:
Figure RE-FDA0003485644890000033
wherein T represents transpose;
middle represents Zk-1Inputting the first residual connecting layer for gradient updating to obtain the output Z 'of the first residual connecting layer'k-1
Z'k-1=LayerNorm(WZZk-1+Hk-1)
Wherein LayerNorm () represents the layer normalization function, WZA parameter matrix of dimension d x d; feedforward propagation layer and second residual connected layer pair Z'k-1Encoding is carried out to obtain the hidden state representation H of the transform block of the layerk
Hk=LayerNorm(WGgelu(WOZ'k-1)+Z'k-1)
Wherein, WGAnd WOAre all parameter matrices with dimensions d x d, gelu represents the activation function.
6. The method for pre-training the common sense concept-enhanced language model according to claim 1, further comprising performing fine tuning training on the common sense concept-enhanced pre-training language model M 'on a downstream common sense question and answer task to obtain a common sense question and answer model for verifying M'.
CN202111375338.1A 2021-11-19 2021-11-19 Common sense concept enhanced language model pre-training method Pending CN114238649A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111375338.1A CN114238649A (en) 2021-11-19 2021-11-19 Common sense concept enhanced language model pre-training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111375338.1A CN114238649A (en) 2021-11-19 2021-11-19 Common sense concept enhanced language model pre-training method

Publications (1)

Publication Number Publication Date
CN114238649A true CN114238649A (en) 2022-03-25

Family

ID=80750180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111375338.1A Pending CN114238649A (en) 2021-11-19 2021-11-19 Common sense concept enhanced language model pre-training method

Country Status (1)

Country Link
CN (1) CN114238649A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024087754A1 (en) * 2022-10-27 2024-05-02 中国电子科技集团公司第十研究所 Multi-dimensional comprehensive text identification method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024087754A1 (en) * 2022-10-27 2024-05-02 中国电子科技集团公司第十研究所 Multi-dimensional comprehensive text identification method

Similar Documents

Publication Publication Date Title
CN111626063B (en) Text intention identification method and system based on projection gradient descent and label smoothing
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN110020438B (en) Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN109918671A (en) Electronic health record entity relation extraction method based on convolution loop neural network
CN112733533B (en) Multi-modal named entity recognition method based on BERT model and text-image relation propagation
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN109635124A (en) A kind of remote supervisory Relation extraction method of combination background knowledge
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN114926150A (en) Digital intelligent auditing method and device for transformer technology conformance assessment
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN113239663B (en) Multi-meaning word Chinese entity relation identification method based on Hopkinson
Mozafari et al. BAS: an answer selection method using BERT language model
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN114429132A (en) Named entity identification method and device based on mixed lattice self-attention network
CN115048511A (en) Bert-based passport layout analysis method
CN115496072A (en) Relation extraction method based on comparison learning
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN114238649A (en) Common sense concept enhanced language model pre-training method
CN117113937A (en) Electric power field reading and understanding method and system based on large-scale language model
CN114881038B (en) Chinese entity and relation extraction method and device based on span and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination