CN110543639A

CN110543639A - english sentence simplification algorithm based on pre-training Transformer language model

Info

Publication number: CN110543639A
Application number: CN201910863529.9A
Authority: CN
Inventors: 强继朋
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2019-12-06
Anticipated expiration: 2039-09-12
Also published as: CN110543639B

Abstract

the invention discloses an English sentence simplification algorithm based on a pre-training Transformer language model, which is carried out according to the following steps: step 1, utilizing the public Wikipedia corpus to count word frequency; step 2, obtaining vectorization expression of the words by using a public pre-trained word embedding model; step 3, preprocessing sentences needing to be simplified to obtain content words; step 4, acquiring a candidate alternative word set of the content word in the sentence by utilizing a public pre-training Transformer language model Bert; step 5, sorting the candidate alternative word set of each content word by using a plurality of characteristics; step 6, comparing the word frequency of the candidate word with the highest ranking with the word frequency of the original content word, and determining a final alternative word; and 7, processing other content words in the sentence according to the steps 4 to 6 in sequence to obtain a final simplified sentence.

Description

english sentence simplification algorithm based on pre-training Transformer language model

Technical Field

The invention relates to the field of English text simplification, in particular to an English sentence simplification algorithm based on a pre-trained Transformer language model.

background

in recent years, more and more English data are available on the Internet, for example, many professional papers are written in English and published in English journals, and many people begin to like to read English materials directly, rather than translating the English materials into Chinese and reading Chinese materials. However, because english is not our native language, insufficient vocabulary severely impacts the understanding of english material. Many studies have confirmed that if 90% of words in a text can be understood, even a long and complicated text, the meaning of the text is more easily understood. In addition, English text simplification also helps people who are in the mother language of English, especially people who have low literacy rate, cognitive or language disorder or limited knowledge of text languages.

the vocabulary simplification algorithm in sentence simplification aims to replace complex words in a sentence by simple synonyms, and the vocabulary requirement on a user can be greatly reduced. The vocabulary simplification algorithm in the existing sentence simplification can be roughly divided into the following steps: the method comprises the steps of complex word recognition, candidate replacement words for generating the complex words, candidate replacement word sorting and candidate replacement word selection. Vocabulary reduction algorithms can be roughly classified into three categories according to the generation of candidate alternatives: the first type is a simplified algorithm based on a dictionary, and the algorithm mainly utilizes a dictionary (such as WortNet) to generate synonyms of complex words as candidate alternatives; the second type of algorithm is based on a parallel corpus algorithm, the most common parallel corpus is obtained from normal wikipedia and child-version English wikipedia, sentences are selected from two different wikipedia respectively through a matching algorithm to serve as parallel sentence pairs, then, rules are obtained from the parallel sentence pairs, and the rules are used for generating candidate alternative words of complex words; and a third algorithm is based on a word embedding model, obtains the vector representation of the words from the word submerging model, and searches a word set with the most similar complex words as candidate substitute words by using a word similarity calculation method. The first two types of algorithms have great constraint, firstly, the construction cost of a dictionary is great, and the extraction of high-quality parallel corpora is very difficult, and secondly, the coverage of the two types of algorithms on complex words is limited. The most important problem of the three algorithms is that in the process of generating candidate words, only the complex words are considered, the context of the complex words is ignored, a lot of unsuitable candidate words are inevitably generated, and great interference is brought to the later steps of the system.

Disclosure of Invention

the invention aims to overcome the defects that the existing vocabulary simplifying algorithm only utilizes the complex word to generate candidate substitute words and ignores the context of the complex word, and provides an English sentence simplifying algorithm based on a pre-training Transformer language model.

the purpose of the invention is realized as follows: an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:

step 1, utilizing a public English Wikipedia corpus D to count the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D;

step 2, obtaining a public word embedding model adopting a word vector model fastText for pre-training; by utilizing the word embedding model, the vector representation vw of the word w can be obtained;

step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation tool to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;

Step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert;

Step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;

Step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept;

step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.

As a further limitation of the present invention, step 4 specifically includes:

step 4.1, obtaining a public pre-training Transformer language model Bert;

step 4.2, replacing content words wi in the sentences s by using the ' MASK ' symbol, wherein the replaced sentences are defined as s ';

step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S;

Step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;

step 4.5, converting T into corresponding ID characteristics by using a BertTokenizer;

step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics;

step 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature;

Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";

and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.

as a further limitation of the present invention, step 5 specifically comprises:

step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };

step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; adding rank1 to the set all _ ranks;

step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; selecting a context of a content word W from a sentence s, constituting a new sequence W = W-m, …, W-1, W, W1, …, wm; let the initial value of j be 1;

step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; let the initial value of j be 1;

step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks;

And 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.

as a further limitation of the present invention, step 5.3 specifically comprises:

step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;

step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';

step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;

and 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.

as a further limitation of the present invention, step 5.4 specifically comprises:

Step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;

,v _wStep 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:

cosine( ,v _w) = (1)

in the formula (1), g is the dimension of the vector in the word vector model;

step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;

and 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.

Compared with the prior art, the invention has the beneficial effects that:

1. The invention simplifies the steps, does not adopt the identification of complex words, and simply stops the words and divides the words for the sentences needing to be simplified; each content word (noun, verb, adjective and adverb) in the sentence is used as a complex word, and then candidate substitute words are generated and selected for each complex word. The original word is replaced only if the frequency of the finally generated replacement word is greater than that of the original word. The method can simplify vocabulary simplification steps and improve the efficiency of the model.

2. The method generates candidate words of the words by using a pre-training-based Transformer language model Bert. Bert is trained by Mask Language Modeling (MLM) using a massive corpus of text. The MLM randomly masks a small part of words in the sentence, predicts the masked words, and performs optimization training. For the vocabulary simplification algorithm, the MLM predicts the probability of the hidden words in the vocabulary table by hiding the complex words, and then selects the words with the highest probability as candidate substitutes. Compared with the existing algorithm, the method generates the candidate alternative words of the complex words on the basis of the original sentences instead of only utilizing the complex words, and can better acquire the candidate alternative words of the complex words, thereby overcoming the defect that the traditional method only generates the candidates aiming at the complex words.

3. according to the method, the candidate substitute words generated by Bert are utilized, the generated candidate words already take the context of the complex words into consideration, the candidate words are generated also taking the language environment of the context into consideration, the ordering of the candidate words and the morphological change of the substitute words can be omitted, and the sentence simplification algorithm is greatly simplified.

4. the method selects the candidate words by utilizing the characteristics of the Bert output, the Bert mask language model, the word frequency and the semantic similarity, not only considers the correlation between the candidate words and the complex words and the continuity between the candidate words and the original context, but also considers the simplification of the candidate words, thereby being capable of more accurately finding the most suitable alternative words.

Detailed Description

The present invention is further illustrated by the following specific examples.

an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:

tStep 1, by utilizing a public English Wikipedia corpus D, downloading from https:// dumps.wikimedia.org/enwiki/, counting the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D; in the field of text simplification, word complexity measures are determined by considering the frequency of words; in general, the higher the frequency of a word, the easier the word is to understand; thus, word frequency can be used to find the most easily understood word from a highly similar set of words for word t.

https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M- subword.zip wstep 2, obtaining a public Word embedding model pre-trained by adopting a Word vector model fastText, and downloading the model from https:// dl.fbaipublictifiles.com/fastText/Vectors-english/crawl-300 d-2M-subword.zip, wherein fastText is an open-source algorithm capable of training the Word embedding model, and the algorithm can refer to an article 'engineering Word Vectors with sub Word Information' written by Bojanowski and the like, and the publication time is 2017 years; using this word embedding model, a vector representation vw of the word w can be obtained, where each vector is 300-dimensional.

step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, then using a word segmentation tool to perform word segmentation and part-of-speech tagging on s, and acquiring a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }, wherein the stop words and the English participles select nltk packages of python language; the initial value given to i is 1.

step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert; bert is a Pre-trained Transformer language model, and the algorithm can be referred to the paper "Bert: Pre-training of deep bidirectional transformations for language understating" by Devlin et al, published for 2018; bert is trained by using a common massive text corpus through a Mask Language Model (MLM); the MLM predicts a small part of words in a text through random hiding, and performs optimization training; for the vocabulary simplification algorithm in text simplification, the MLM predicts the probability of all words in a vocabulary table belonging to a hidden word by hiding the complex word, and then selects the word with the highest probability as a candidate substitute;

step 4.1, obtaining a public pre-trained transform language model Bert, wherein a Bert algorithm realized by a pytorch is selected, and the pre-trained Bert model can download ' BERT-Large, Unmeasured ' (bolt work masking ') from https:// githu.com/google-research/Bert;

step 4.2, replacing content words wi in sentences s by using the ' MASK ' symbol, defining the replaced sentences as s ', wherein the ' MASK ' symbol represents a hiding symbol, and the MLM optimizes the Bert model by predicting the symbol and comparing the predicted value with the original words;

Step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S, wherein "[ CLS ]" and "[ SEP ]" are two special symbols in Bert, the "[ CLS ]" is generally added at the top and used as a category identification number, and the "[ SEP ]" is used as a sentence separator; here, S' is not used directly, but is used, with two benefits: the first benefit is that the influence of the complex word on the prediction result of the 'MASK', and the second benefit is that the Bert is good at processing the problem of double sentences, and the Bert also adopts a Next sentence prediction (Next sense prediction) optimization model;

step 4.5, converting the T into a corresponding ID characteristic by using the BertTokenizer, wherein the ID characteristic is a number corresponding to each word in the Bert;

Step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics; mask features are used for identifying the position of useful information;

lenstep 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature; the Type feature is used to distinguish two sentences;

step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics including four characteristics of Bert output, Bert mask language model, word frequency and semantic similarity; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;

step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; the Bert output characteristics contain the relation between the candidate words, the original complex words and the context; adding rank1 to the set all _ ranks;

Step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; the context coherence of the candidate word and the complex word is mainly examined; selecting context of content words W from a sentence s to form a new sequence W = W-m, …, W-1, W, W1, …, wm, where m is selected to be 5, i.e. taking a maximum of 5 words before and after a complex word; let the initial value of j be 1;

step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; the higher the similarity is, the closer the semantics of the candidate word and the original complex word is; let the initial value of j be 1;

cosine( ,v _w)= (1)

In the formula (1), g is the dimension of the vector in the word vector model;

step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks; the word frequency feature is utilized again, and the higher the word frequency is, the more frequent the use is, the easier the understanding is;

step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept; the frequency of the words is also utilized here.

By utilizing the big data pre-trained Bert model and combining a plurality of effective characteristics, the example can effectively acquire synonymy simplified words of complex words, so that the purpose of sentence simplification is achieved.

the present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.

Claims

1. An English sentence simplification algorithm based on a pre-training Transformer language model is characterized by comprising the following steps:

step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation worker to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;

2. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 4 specifically comprises:

step 4.1, obtaining a public pre-training Transformer language model Bert;

3. The English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 5 specifically comprises:

4. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.3 specifically comprises:

5. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.4 specifically comprises:

cosine( ,v _w)= (1)

in the formula (1), g is the dimension of the vector in the word vector model;