CN110543639B

CN110543639B - English sentence simplification algorithm based on pre-training transducer language model

Info

Publication number: CN110543639B
Application number: CN201910863529.9A
Authority: CN
Inventors: 强继朋
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2023-06-02
Anticipated expiration: 2039-09-12
Also published as: CN110543639A

Abstract

The invention discloses an English sentence simplification algorithm based on a pre-training transducer language model, which comprises the following steps: step 1, counting word frequency by using the disclosed wikipedia corpus; step 2, utilizing a disclosed pre-trained word embedding model to obtain vectorized expression of words; step 3, preprocessing sentences to be simplified to obtain content words; step 4, obtaining a candidate substituted word set of the content word in the sentence by using a published pre-training transducer language model Bert; step 5, sequencing the candidate alternative word sets of each content word by utilizing a plurality of characteristics; step 6, comparing the word frequency of the highest-ranking candidate word with that of the original content word to determine a final alternative word; and 7, processing other content words in the sentence according to the steps 4 to 6 in turn to obtain a final simplified sentence.

Description

English sentence simplification algorithm based on pre-training transducer language model

Technical Field

The invention relates to the field of English text simplification, in particular to an English sentence simplification algorithm based on a pre-training transducer language model.

Background

In recent years, more and more English materials on the Internet are written in English, for example, many professional papers are published in English journals, and many people start to like to directly read English materials, rather than firstly translating English materials into Chinese and reading Chinese materials. Many studies have demonstrated that if 90% of the words in text can be understood, even long and complex text, the meaning of the text can be understood more easily.

The vocabulary simplification algorithm in sentence simplification aims to replace complex words in sentences with simple synonyms, so that the vocabulary requirement on users can be greatly reduced. The steps of the vocabulary simplification algorithm in the existing sentence simplification can be roughly divided into: complex word recognition, candidate substitution words generating complex words, candidate substitution word sequencing and candidate substitution word selection. Lexical reduction algorithms can be broadly divided into three categories according to the generation of candidate surrogate words: the first class is a dictionary-based simplified algorithm, which mainly utilizes a dictionary (such as WortNet) to generate synonyms of complex words as candidate alternative words; the second class of algorithms are algorithms based on parallel corpus, the most common parallel corpus is obtained from normal wikipedia and children version of English wikipedia, sentences are selected from two different wikipedia respectively to be used as parallel sentence pairs through a matching algorithm, and then rules are obtained from the parallel sentence pairs and are used for generating candidate substituted words of complex words; the third class algorithm is based on a word embedding model, obtains vector representation of words from the word submerging model, and searches a word set with the most similar complex words as candidate substituent words by using a word similarity calculation method. The first two types of algorithms have great constraint, firstly, the construction cost of a dictionary is great, and the parallel corpus extraction with high quality is very difficult, and secondly, the coverage of complex words by the two types of algorithms is limited. The most main problem of the three algorithms is that in the process of generating candidate words, only complex words are considered, the context of the complex words is ignored, a plurality of unsuitable candidate words are inevitably generated, and great interference is brought to the following steps of a system.

Disclosure of Invention

The invention aims to overcome the defects of the existing vocabulary simplification algorithm, only utilizes complex words to generate candidate substituted words, ignores the context of the complex words, and provides an English sentence simplification algorithm based on a pre-training transducer language model.

The purpose of the invention is realized in the following way: an English sentence simplification algorithm based on a pre-training transducer language model is carried out according to the following steps:

step 1, counting the frequency f (w) of each word w by using a public English wikipedia corpus D, wherein f (w) represents the occurrence times of the word w in the D;

step 2, acquiring a disclosed word embedding model which adopts a word vector model fastText to perform pre-training; by using the word embedding model, the vector representation v of the word w can be obtained _w ；

Step 3, assuming that the sentence to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the part of speech, and obtaining a content word (noun, verb, adjective and adverb) set { w } ₁ ,…,w _i ,…,w _n -a }; the initial value of i is 1;

step 4, obtaining the content word w in the sentence s by using the disclosed pre-training transducer language model Bert _i (1≤i.ltoreq.n) set of candidate surrogate words CS _i ；

Step 5, adopting a plurality of characteristics to CS _i Ranking the candidate words in the list; selecting the top candidate word c by averaging the plurality of ranked results _i ；

Step 6, if candidate word c _i Frequency f (c) _i ) Is larger than the original content word w _i Frequency f (w) _i ) Then select candidate word c _i As a surrogate; otherwise, the original content word w is still reserved _i ；

Step 7, let i=i+1, and sequentially execute step 4 to step 6; when all the content words in the sentence s are processed, the original content words are replaced, and a simplified sentence of the sentence s is obtained.

As a further definition of the present invention, step 4 specifically includes:

step 4.1, obtaining a published pre-training transducer language model Bert;

step 4.2, use "[ MASK ]]"symbol replaces content word w in sentence s _i The substituted sentence is defined as s';

step 4.3, sequentially connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", and defining a combined sequence as S;

step 4.4, utilizing a word segmentation device BertTokenizer in Bert to segment the S, wherein the segmented set is called T;

step 4.5, then converting T into corresponding ID features by using a BertTokenizer;

step 4.6, acquiring the length len of the set T, defining an array with the length len, and enabling all values to be 1, namely Mask features;

step 4.7, defining an array with the length of len, wherein the content before the corresponding position of the first symbol [ SEP ] is assigned 0, and the rest content is assigned 1, which is called as Type feature;

step 4.8, transmitting the three features (ID feature, mask feature and Type feature) to a Mask language model (Masked Language Model) of the Bert, and obtaining the scores SC of all words in a vocabulary corresponding to the symbol "[ MASK ]";

step 4.9, excluding the original content word w and the corresponding morphological derivative words thereof, and selecting 10 words with high scores from the SC as candidate substitution word set CS _i 。

As a further definition of the present invention, step 5 specifically includes:

step 5.1, using four features for each candidate surrogate CS _i Sorting, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all_ranges, wherein an initial value is an empty set; assume CS _i ＝{c ₁ ,c ₂ ,…,c _j ,…,c ₁₀ }；

Step 5.2, scoring SC of all words contained in the Bert output feature, and comparing CS according to the score _i The midwords being ordered, i.e. rank ₁ = {1,2, …,10}; increasing rank ₁ Into the set all_ranks;

step 5.3, calculating CS by using the mask language model of Bert _i The sequence probability after the original word w is replaced by the word in the sequence probability, and the rank is obtained ₂ And added to the set all_ranks; selecting the context of the content word W from the sentence s, composing a new sequence w=w _-m ,…,w _-1 ,w,w ₁ ,…,w _m The method comprises the steps of carrying out a first treatment on the surface of the Let the initial value of j be 1;

step 5.4, adopting semantic similarity characteristics to CS _i All words in the sequence are sequenced to obtain rank ₃ And to the set all_ranks; let the initial value of j be 1;

step 5.5, adopting word frequency characteristics to CS _i Ordering all words in the list; acquiring CS in the process of acquiring by using the word frequency acquired in the step 1 _i Frequency { f (c) ₁ ),f(c ₂ ),...,f(c ₁₀ ) -a }; sequencing according to word frequency to obtain rank ₄ Wherein the largest value ranks first, and so on; obtaining rank ₄ And to the set all_ranks;

step 5.6, calculating CS using the ranking of the four features in all_ranks _i ＝{c ₁ ,c ₂ ,…,c _j ,…,c ₁₀ An average ranking value for each word in,the top ranked word is selected as the candidate word.

As a further definition of the invention, step 5.3 specifically comprises:

step 5.3.1, use c _j Instead of the content word W in W, a new sequence W' =w is composed _-m ,…,w _-1 ,c _j ,w ₁ ,…,w _m ；

Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating a cross entropy loss value of the hidden sequence by using a mask language model of the Bert; finally, the average value loss of the cross entropy loss values of all the words of the W' is calculated _i ；

Step 5.3.3, j=j+1, steps 5.3.1 and 5.3.2 are repeated until CS _i The calculation of all words in the list is completed;

step 5.3.4 for all loss values { loss ₁ ,loss ₂ ,...,loss ₁₀ Sequencing to obtain rank ₂ Where the smallest value ranks first, and so on.

As a further definition of the invention, step 5.4 specifically comprises:

step 5.4.1, obtaining c from the word vector model _j Vector representation of sum w

And v _w ；

Step 5.4.2, calculating by adopting a cosine similarity calculation method

And v _w Similarity value of (2)

In the formula (1), g is the dimension of a vector in a word vector model;

step 5.4.3, j=j+1, steps 5.4.1 and 5.4.2 are repeated until CS _i The calculation of all words in the list is completed;

step 5.4.4 for all loss values { cos ₁ ,cos ₂ ,...,cos ₁₀ Sequencing to obtain rank ₃ Where the largest value is ranked first, and so on.

Compared with the prior art, the invention has the beneficial effects that:

1. the method simplifies the steps, does not adopt the recognition of complex words, and only carries out simple stop word removal and word segmentation on sentences needing to be simplified; each content word (noun, verb, adjective and adverb) in the sentence serves as a complex word, and then candidate replacement words are generated and selected for each complex word. Only if the frequency of the finally generated substitution word is larger than the frequency of the original word, the original word is replaced. The method can simplify the vocabulary simplification steps and improve the model efficiency.

2. The invention utilizes word candidates that generate words based on a pre-trained transducer language model Bert. Bert is trained with a masking language model (Masked Language modeling, MLM) using a massive corpus of text. The MLM performs optimization training by randomly hiding a small portion of the vocabulary in the sentence, predicting the hidden vocabulary. For the vocabulary simplification algorithm, the MLM predicts the probability of the hidden words in the vocabulary by hiding the complex words, and then selects the word with the highest probability as the candidate substitute. Compared with the existing algorithm, the method does not only utilize the complex word, but also generates candidate substituted words of the complex word on the basis of the original sentence, and can better acquire the candidate substituted words of the complex word, thereby overcoming the defect that the traditional method only aims at the complex word to generate candidates.

3. The invention utilizes the candidate substitution word generated by Bert, the generated candidate word already considers the context of the complex word, the generation of the candidate word also considers the language environment of the context, the ordering of the candidate word and the morphological change of the substitution word can be omitted, and the sentence simplification algorithm is greatly simplified.

4. The invention selects the candidate words by utilizing four characteristics of the Bert output, the Bert mask language model, the word frequency and the semantic similarity, not only considers the relevance of the candidate words and the complex words and the consistency between the candidate words and the original context, but also considers the simplification of the candidate words, thereby being capable of more accurately finding the most suitable substitute words.

Detailed Description

The invention will be further illustrated with reference to specific examples.

An English sentence simplification algorithm based on a pre-training transducer language model is carried out according to the following steps:

step 1, using the disclosed English wikipedia corpus D, downloading from 'https:// dump. Wikipedia. Org/enwiki/', and counting the frequency f (w) of each word w, wherein f (w) represents the occurrence times of the word w in the D; in the text reduction field, word complexity measurement is performed by considering the frequency of words; in general, the higher the frequency of a word, the more easily the word is understood; thus, word frequency can be used to find the most easily understood word from a set of highly similar words of word t.

Step 2, obtaining a disclosed word embedding model which adopts a word vector model fastText to conduct pre-training, wherein the word embedding model can be obtained from

"https:// dl.fbaipfublicfiles.com/fastbots/vectors-englist/crawl-300 d-2M-subword.zip" download, where fastbext is an open source algorithm that can be used to train word embedding models, which can be referred to the paper "Enriching Word Vectors with Subword Information" written by Bojanowski et al, publication time 2017; by using the word embedding model, the vector representation v of the word w can be obtained _w Wherein each vector is 300 dimensions.

Step 3, assuming that the sentence to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the part of speech, and obtaining a content word (noun, verb, adjective and adverb) set { w } ₁ ,…,w _i ,…,w _n Here, stop words and english segmentation both select the ntk package of the python language; the initial value of i is 1.

Step 4, utilizing the disclosed pre-trained transducer language model Bert, obtaining content word w in sentence s _i (1. Ltoreq.i.ltoreq.n) candidate set of surrogate words CS _i The method comprises the steps of carrying out a first treatment on the surface of the Bert is a Pre-trained transducer language model, and the algorithm can refer to the paper "Bert: pre-training of deep bidirectional transformers for language understanding" written by Devlin et al, publication time 2018; bert is trained by masking language models (Masked Language modeling, MLM) using a common massive text corpus; the MLM randomly masks a small part of vocabulary in the text, predicts the masked vocabulary and performs optimization training; for a vocabulary simplification algorithm in text simplification, the MLM predicts the probability that all words in the vocabulary belong to the hidden words by hiding complex words, and then selects the word with the highest probability as a candidate substitution.

Step 4.1, obtaining a published pretrained converter language model Bert, wherein a Bert algorithm realized by a pyrach is selected, and the pretrained Bert model can be downloaded from https:// github.com/google-research/Bert as BERT-Larger Uncased (Whole Word Masking);

step 4.2, use "[ MASK ]]"symbol replaces content word w in sentence s _i The substituted sentence is defined as s', here "[ MASK"]"symbol represents a hidden symbol, and the MLM model optimizes the Bert model by predicting the symbol and comparing the predicted value with the original word;

step 4.3, sequentially connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", and defining the combined sequence as S, wherein "[ CLS ]" and "[ SEP ]" are two special symbols in Bert, "[ CLS ]" is generally added at the forefront and is used as a category identification number, and "[ SEP ]" is used as a sentence separator; there are two benefits to using S 'without directly using S' here: the first benefit is that the influence of the complex word on the prediction result of the 'MASK' is considered, and the second benefit is that Bert is good at handling the double sentence problem, and is attributed to the fact that Bert also adopts the next sentence prediction (Next sentence prediction) to perform an optimization model;

step 4.5, converting T into corresponding ID features by using the BertTokenizer, wherein the ID features are numerical numbers corresponding to each word in the Bert;

step 4.6, acquiring the length len of the set T, defining an array with the length len, and enabling all values to be 1, namely Mask features; mask features are used to identify the location of useful information;

step 4.7, defining an array with the length of len, wherein the content before the corresponding position of the first symbol [ SEP ] is assigned 0, and the rest content is assigned 1, which is called as Type feature; the Type feature is used to distinguish between two sentences;

Step 5, adopting a plurality of characteristics including Bert output, bert mask language model, word frequency and semantic similarity, and CS _i Ranking the candidate words in the list; selecting the top candidate word c by averaging the plurality of ranked results _i 。

Step 5.2, scoring SC of all words contained in the Bert output feature, and comparing CS according to the score _i The midwords being ordered, i.e. rank ₁ = {1,2, …,10}; the Bert output feature itself contains the relationship between the candidate word and the original complex word and context; increasing rank ₁ Into the set all_ranks;

step 5.3, calculating CS by using the mask language model of Bert _i In (a) and (b)Sequence probability after the word replaces the original word w is obtained ₂ And added to the set all_ranks; here, mainly consider the consistency of the context of candidate words and complex words; selecting the context of the content word W from the sentence s, composing a new sequence w=w _-m ,…,w _-1 ,w,w ₁ ,…,w _m Where m is selected to be 5, i.e., a maximum of 5 words before and after the complex word is taken; let the initial value of j be 1;

step 5.3.4 for all loss values { loss ₁ ,loss ₂ ,...,loss ₁₀ Sequencing to obtain rank ₂ Wherein the smallest value ranks first, and so on;

step 5.4, adopting semantic similarity characteristics to CS _i All words in the sequence are sequenced to obtain rank ₃ And to the set all_ranks; the higher the similarity here means that the semantics of the candidate word and the original complex word are closer; let the initial value of j be 1;

And v _w ；

Step 5.4.2, calculating by adopting a cosine similarity calculation method

And v _w Similarity value of (2)

In the formula (1), g is the dimension of a vector in a word vector model;

Step 5.5, adopting word frequency characteristics to CS _i Ordering all words in the list; acquiring CS in the process of acquiring by using the word frequency acquired in the step 1 _i Frequency { f (c) ₁ ),f(c ₂ ),...,f(c ₁₀ ) -a }; sequencing according to word frequency to obtain rank ₄ Wherein the largest value ranks first, and so on; obtaining rank ₄ And to the set all_ranks; here, the word frequency characteristic is utilized again, and the higher the word frequency, the more frequently the use is, the easier the understanding is;

step 5.6, calculating CS using the ranking of the four features in all_ranks _i ＝{c ₁ ,c ₂ ,…,c _j ,…,c ₁₀ The average ranking value of each word in the list, and the top word is selected as the candidate word.

Step 6, if candidate word c _i Frequency f (c) _i ) Is larger than the original content word w _i Frequency f (w) _i ) Then select candidate word c _i As a surrogate; otherwise, the original content word w is still reserved _i The method comprises the steps of carrying out a first treatment on the surface of the Here again the frequency of the words is utilized.

Step 7, let i=i+1, and sequentially execute step 4 to step 6; when all the content words in the sentence s are processed, the original content words are replaced, and a simplified sentence of the sentence s is obtained; by utilizing the Bert model pre-trained by big data, a plurality of effective features are combined, and the example can effectively acquire synonymous simplified words of complex words, thereby achieving the purpose of sentence simplification.

The invention is not limited to the above embodiments, and based on the technical solution disclosed in the invention, a person skilled in the art may make some substitutions and modifications to some technical features thereof without creative effort according to the technical content disclosed, and all the substitutions and modifications are within the protection scope of the invention.

Claims

1. A method for simplifying an English sentence based on a pre-training transducer language model is characterized by comprising the following steps:

step 2, acquiring a disclosed word embedding model which adopts a word vector model fastText to perform pre-training; obtaining a vector representation v of a word w using the word embedding model _w ；

Step 3, assuming that a sentence needing to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the parts of speech, and obtaining content words, wherein the content words comprise nouns, verbs, adjectives and adverbs, and the set { w } ₁ ,…,w _i ,…,w _n -a }; the initial value of i is 1;

step 4, obtaining the content word w in the sentence s by using the disclosed pre-training transducer language model Bert _i Wherein, i is more than or equal to 1 and n is less than or equal to the candidate substitution word set CS _i ；

Step 6, if candidate word c _i Frequency f (c) _i ) Is larger than the original content word w _i Frequency f (w) _i ) Then select candidate word c _i As a surrogate; otherwise, the original content word w is still reserved;

2. The method of claim 1, wherein the step 4 specifically comprises:

step 4.1, obtaining a published pre-training transducer language model Bert;

step 4.8, transmitting the ID feature, mask feature and Type feature to a Mask language model of Bert, and obtaining scores SC of all words in a vocabulary corresponding to a symbol "[ MASK ]";

3. The method of claim 1, wherein the step 5 specifically comprises:

step 5.1, using four features for each candidate surrogate CS _i Sequencing, namely Bert output, language model characteristics, semantic similarity and word frequency characteristicsThe method comprises the steps of carrying out a first treatment on the surface of the Defining a variable all_ranges, wherein an initial value is an empty set; assume CS _i ＝{c ₁ ,c ₂ ,…,c _j ,…,c ₁₀ }；

4. A method according to claim 3, wherein the step 5.3 specifically comprises:

5. A method according to claim 3, wherein the step 5.4 specifically comprises:

And v _w ；

Step 5.4.2, calculating by adopting a cosine similarity calculation method

And v _w Similarity value +.>

In the formula (1), g is the dimension of a vector in a word vector model;