CN110543639A - english sentence simplification algorithm based on pre-training Transformer language model - Google Patents

english sentence simplification algorithm based on pre-training Transformer language model Download PDF

Info

Publication number
CN110543639A
CN110543639A CN201910863529.9A CN201910863529A CN110543639A CN 110543639 A CN110543639 A CN 110543639A CN 201910863529 A CN201910863529 A CN 201910863529A CN 110543639 A CN110543639 A CN 110543639A
Authority
CN
China
Prior art keywords
word
words
sentence
csi
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910863529.9A
Other languages
Chinese (zh)
Other versions
CN110543639B (en
Inventor
强继朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201910863529.9A priority Critical patent/CN110543639B/en
Publication of CN110543639A publication Critical patent/CN110543639A/en
Application granted granted Critical
Publication of CN110543639B publication Critical patent/CN110543639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

the invention discloses an English sentence simplification algorithm based on a pre-training Transformer language model, which is carried out according to the following steps: step 1, utilizing the public Wikipedia corpus to count word frequency; step 2, obtaining vectorization expression of the words by using a public pre-trained word embedding model; step 3, preprocessing sentences needing to be simplified to obtain content words; step 4, acquiring a candidate alternative word set of the content word in the sentence by utilizing a public pre-training Transformer language model Bert; step 5, sorting the candidate alternative word set of each content word by using a plurality of characteristics; step 6, comparing the word frequency of the candidate word with the highest ranking with the word frequency of the original content word, and determining a final alternative word; and 7, processing other content words in the sentence according to the steps 4 to 6 in sequence to obtain a final simplified sentence.

Description

english sentence simplification algorithm based on pre-training Transformer language model
Technical Field
The invention relates to the field of English text simplification, in particular to an English sentence simplification algorithm based on a pre-trained Transformer language model.
background
in recent years, more and more English data are available on the Internet, for example, many professional papers are written in English and published in English journals, and many people begin to like to read English materials directly, rather than translating the English materials into Chinese and reading Chinese materials. However, because english is not our native language, insufficient vocabulary severely impacts the understanding of english material. Many studies have confirmed that if 90% of words in a text can be understood, even a long and complicated text, the meaning of the text is more easily understood. In addition, English text simplification also helps people who are in the mother language of English, especially people who have low literacy rate, cognitive or language disorder or limited knowledge of text languages.
the vocabulary simplification algorithm in sentence simplification aims to replace complex words in a sentence by simple synonyms, and the vocabulary requirement on a user can be greatly reduced. The vocabulary simplification algorithm in the existing sentence simplification can be roughly divided into the following steps: the method comprises the steps of complex word recognition, candidate replacement words for generating the complex words, candidate replacement word sorting and candidate replacement word selection. Vocabulary reduction algorithms can be roughly classified into three categories according to the generation of candidate alternatives: the first type is a simplified algorithm based on a dictionary, and the algorithm mainly utilizes a dictionary (such as WortNet) to generate synonyms of complex words as candidate alternatives; the second type of algorithm is based on a parallel corpus algorithm, the most common parallel corpus is obtained from normal wikipedia and child-version English wikipedia, sentences are selected from two different wikipedia respectively through a matching algorithm to serve as parallel sentence pairs, then, rules are obtained from the parallel sentence pairs, and the rules are used for generating candidate alternative words of complex words; and a third algorithm is based on a word embedding model, obtains the vector representation of the words from the word submerging model, and searches a word set with the most similar complex words as candidate substitute words by using a word similarity calculation method. The first two types of algorithms have great constraint, firstly, the construction cost of a dictionary is great, and the extraction of high-quality parallel corpora is very difficult, and secondly, the coverage of the two types of algorithms on complex words is limited. The most important problem of the three algorithms is that in the process of generating candidate words, only the complex words are considered, the context of the complex words is ignored, a lot of unsuitable candidate words are inevitably generated, and great interference is brought to the later steps of the system.
Disclosure of Invention
the invention aims to overcome the defects that the existing vocabulary simplifying algorithm only utilizes the complex word to generate candidate substitute words and ignores the context of the complex word, and provides an English sentence simplifying algorithm based on a pre-training Transformer language model.
the purpose of the invention is realized as follows: an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:
step 1, utilizing a public English Wikipedia corpus D to count the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D;
step 2, obtaining a public word embedding model adopting a word vector model fastText for pre-training; by utilizing the word embedding model, the vector representation vw of the word w can be obtained;
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation tool to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;
Step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert;
Step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
Step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept;
step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
As a further limitation of the present invention, step 4 specifically includes:
step 4.1, obtaining a public pre-training Transformer language model Bert;
step 4.2, replacing content words wi in the sentences s by using the ' MASK ' symbol, wherein the replaced sentences are defined as s ';
step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S;
Step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
step 4.5, converting T into corresponding ID characteristics by using a BertTokenizer;
step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics;
step 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
as a further limitation of the present invention, step 5 specifically comprises:
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; adding rank1 to the set all _ ranks;
step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; selecting a context of a content word W from a sentence s, constituting a new sequence W = W-m, …, W-1, W, W1, …, wm; let the initial value of j be 1;
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; let the initial value of j be 1;
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks;
And 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
as a further limitation of the present invention, step 5.3 specifically comprises:
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
and 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
as a further limitation of the present invention, step 5.4 specifically comprises:
Step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w Step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w ) = (1)
in the formula (1), g is the dimension of the vector in the word vector model;
step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
and 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
Compared with the prior art, the invention has the beneficial effects that:
1. The invention simplifies the steps, does not adopt the identification of complex words, and simply stops the words and divides the words for the sentences needing to be simplified; each content word (noun, verb, adjective and adverb) in the sentence is used as a complex word, and then candidate substitute words are generated and selected for each complex word. The original word is replaced only if the frequency of the finally generated replacement word is greater than that of the original word. The method can simplify vocabulary simplification steps and improve the efficiency of the model.
2. The method generates candidate words of the words by using a pre-training-based Transformer language model Bert. Bert is trained by Mask Language Modeling (MLM) using a massive corpus of text. The MLM randomly masks a small part of words in the sentence, predicts the masked words, and performs optimization training. For the vocabulary simplification algorithm, the MLM predicts the probability of the hidden words in the vocabulary table by hiding the complex words, and then selects the words with the highest probability as candidate substitutes. Compared with the existing algorithm, the method generates the candidate alternative words of the complex words on the basis of the original sentences instead of only utilizing the complex words, and can better acquire the candidate alternative words of the complex words, thereby overcoming the defect that the traditional method only generates the candidates aiming at the complex words.
3. according to the method, the candidate substitute words generated by Bert are utilized, the generated candidate words already take the context of the complex words into consideration, the candidate words are generated also taking the language environment of the context into consideration, the ordering of the candidate words and the morphological change of the substitute words can be omitted, and the sentence simplification algorithm is greatly simplified.
4. the method selects the candidate words by utilizing the characteristics of the Bert output, the Bert mask language model, the word frequency and the semantic similarity, not only considers the correlation between the candidate words and the complex words and the continuity between the candidate words and the original context, but also considers the simplification of the candidate words, thereby being capable of more accurately finding the most suitable alternative words.
Detailed Description
The present invention is further illustrated by the following specific examples.
an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:
tStep 1, by utilizing a public English Wikipedia corpus D, downloading from https:// dumps.wikimedia.org/enwiki/, counting the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D; in the field of text simplification, word complexity measures are determined by considering the frequency of words; in general, the higher the frequency of a word, the easier the word is to understand; thus, word frequency can be used to find the most easily understood word from a highly similar set of words for word t.
https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M- subword.zip wstep 2, obtaining a public Word embedding model pre-trained by adopting a Word vector model fastText, and downloading the model from https:// dl.fbaipublictifiles.com/fastText/Vectors-english/crawl-300 d-2M-subword.zip, wherein fastText is an open-source algorithm capable of training the Word embedding model, and the algorithm can refer to an article 'engineering Word Vectors with sub Word Information' written by Bojanowski and the like, and the publication time is 2017 years; using this word embedding model, a vector representation vw of the word w can be obtained, where each vector is 300-dimensional.
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, then using a word segmentation tool to perform word segmentation and part-of-speech tagging on s, and acquiring a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }, wherein the stop words and the English participles select nltk packages of python language; the initial value given to i is 1.
step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert; bert is a Pre-trained Transformer language model, and the algorithm can be referred to the paper "Bert: Pre-training of deep bidirectional transformations for language understating" by Devlin et al, published for 2018; bert is trained by using a common massive text corpus through a Mask Language Model (MLM); the MLM predicts a small part of words in a text through random hiding, and performs optimization training; for the vocabulary simplification algorithm in text simplification, the MLM predicts the probability of all words in a vocabulary table belonging to a hidden word by hiding the complex word, and then selects the word with the highest probability as a candidate substitute;
step 4.1, obtaining a public pre-trained transform language model Bert, wherein a Bert algorithm realized by a pytorch is selected, and the pre-trained Bert model can download ' BERT-Large, Unmeasured ' (bolt work masking ') from https:// githu.com/google-research/Bert;
step 4.2, replacing content words wi in sentences s by using the ' MASK ' symbol, defining the replaced sentences as s ', wherein the ' MASK ' symbol represents a hiding symbol, and the MLM optimizes the Bert model by predicting the symbol and comparing the predicted value with the original words;
Step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S, wherein "[ CLS ]" and "[ SEP ]" are two special symbols in Bert, the "[ CLS ]" is generally added at the top and used as a category identification number, and the "[ SEP ]" is used as a sentence separator; here, S' is not used directly, but is used, with two benefits: the first benefit is that the influence of the complex word on the prediction result of the 'MASK', and the second benefit is that the Bert is good at processing the problem of double sentences, and the Bert also adopts a Next sentence prediction (Next sense prediction) optimization model;
step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
step 4.5, converting the T into a corresponding ID characteristic by using the BertTokenizer, wherein the ID characteristic is a number corresponding to each word in the Bert;
Step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics; mask features are used for identifying the position of useful information;
lenstep 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature; the Type feature is used to distinguish two sentences;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics including four characteristics of Bert output, Bert mask language model, word frequency and semantic similarity; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; the Bert output characteristics contain the relation between the candidate words, the original complex words and the context; adding rank1 to the set all _ ranks;
Step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; the context coherence of the candidate word and the complex word is mainly examined; selecting context of content words W from a sentence s to form a new sequence W = W-m, …, W-1, W, W1, …, wm, where m is selected to be 5, i.e. taking a maximum of 5 words before and after a complex word; let the initial value of j be 1;
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
And 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; the higher the similarity is, the closer the semantics of the candidate word and the original complex word is; let the initial value of j be 1;
Step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w )= (1)
In the formula (1), g is the dimension of the vector in the word vector model;
step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
And 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks; the word frequency feature is utilized again, and the higher the word frequency is, the more frequent the use is, the easier the understanding is;
And 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept; the frequency of the words is also utilized here.
Step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
By utilizing the big data pre-trained Bert model and combining a plurality of effective characteristics, the example can effectively acquire synonymy simplified words of complex words, so that the purpose of sentence simplification is achieved.
the present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.

Claims (5)

1. An English sentence simplification algorithm based on a pre-training Transformer language model is characterized by comprising the following steps:
step 1, utilizing a public English Wikipedia corpus D to count the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D;
step 2, obtaining a public word embedding model adopting a word vector model fastText for pre-training; by utilizing the word embedding model, the vector representation vw of the word w can be obtained;
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation worker to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;
step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert;
Step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept;
step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
2. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 4 specifically comprises:
step 4.1, obtaining a public pre-training Transformer language model Bert;
step 4.2, replacing content words wi in the sentences s by using the ' MASK ' symbol, wherein the replaced sentences are defined as s ';
step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S;
step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
Step 4.5, converting T into corresponding ID characteristics by using a BertTokenizer;
Step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics;
step 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
3. The English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 5 specifically comprises:
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
Step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; adding rank1 to the set all _ ranks;
step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; selecting a context of a content word W from a sentence s, constituting a new sequence W = W-m, …, W-1, W, W1, …, wm; let the initial value of j be 1;
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; let the initial value of j be 1;
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks;
and 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
4. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.3 specifically comprises:
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
and 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
5. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.4 specifically comprises:
step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w Step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w )= (1)
in the formula (1), g is the dimension of the vector in the word vector model;
Step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
and 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
CN201910863529.9A 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model Active CN110543639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910863529.9A CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910863529.9A CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Publications (2)

Publication Number Publication Date
CN110543639A true CN110543639A (en) 2019-12-06
CN110543639B CN110543639B (en) 2023-06-02

Family

ID=68713486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910863529.9A Active CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Country Status (1)

Country Link
CN (1) CN110543639B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125350A (en) * 2019-12-17 2020-05-08 语联网(武汉)信息技术有限公司 Method and device for generating LDA topic model based on bilingual parallel corpus
CN111144131A (en) * 2019-12-25 2020-05-12 北京中科研究院 Network rumor detection method based on pre-training language model
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN111611790A (en) * 2020-04-13 2020-09-01 华为技术有限公司 Data processing method and device
CN111651986A (en) * 2020-04-28 2020-09-11 银江股份有限公司 Event keyword extraction method, device, equipment and medium
CN111768001A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Language model training method and device and computer equipment
CN112016319A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Pre-training model obtaining method, disease entity labeling method, device and storage medium
CN112528669A (en) * 2020-12-01 2021-03-19 北京百度网讯科技有限公司 Multi-language model training method and device, electronic equipment and readable storage medium
CN112949284A (en) * 2019-12-11 2021-06-11 上海大学 Text semantic similarity prediction method based on Transformer model
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN113177402A (en) * 2021-04-26 2021-07-27 平安科技(深圳)有限公司 Word replacement method and device, electronic equipment and storage medium
WO2021218028A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Artificial intelligence-based interview content refining method, apparatus and device, and medium
WO2022098719A1 (en) * 2020-11-03 2022-05-12 Salesforce.Com, Inc. System and methods for training task-oriented dialogue (tod) language models
WO2022174804A1 (en) * 2021-02-20 2022-08-25 北京有竹居网络技术有限公司 Text simplification method and apparatus, and device and storage medium
CN115329784A (en) * 2022-10-12 2022-11-11 之江实验室 Sentence rephrasing generation system based on pre-training model
CN116227484A (en) * 2023-05-09 2023-06-06 腾讯科技(深圳)有限公司 Model training method, apparatus, device, storage medium and computer program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949284A (en) * 2019-12-11 2021-06-11 上海大学 Text semantic similarity prediction method based on Transformer model
CN111125350B (en) * 2019-12-17 2023-05-12 传神联合(北京)信息技术有限公司 Method and device for generating LDA topic model based on bilingual parallel corpus
CN111125350A (en) * 2019-12-17 2020-05-08 语联网(武汉)信息技术有限公司 Method and device for generating LDA topic model based on bilingual parallel corpus
CN111144131A (en) * 2019-12-25 2020-05-12 北京中科研究院 Network rumor detection method based on pre-training language model
CN111144131B (en) * 2019-12-25 2021-04-30 北京中科研究院 Network rumor detection method based on pre-training language model
CN111611790B (en) * 2020-04-13 2022-09-16 华为技术有限公司 Data processing method and device
CN111611790A (en) * 2020-04-13 2020-09-01 华为技术有限公司 Data processing method and device
CN111651986B (en) * 2020-04-28 2024-04-02 银江技术股份有限公司 Event keyword extraction method, device, equipment and medium
CN111651986A (en) * 2020-04-28 2020-09-11 银江股份有限公司 Event keyword extraction method, device, equipment and medium
WO2021218028A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Artificial intelligence-based interview content refining method, apparatus and device, and medium
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111444721B (en) * 2020-05-27 2022-09-23 南京大学 Chinese text key information extraction method based on pre-training language model
CN111563166B (en) * 2020-05-28 2024-02-13 浙江学海教育科技有限公司 Pre-training model method for classifying mathematical problems
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN111768001A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Language model training method and device and computer equipment
CN111768001B (en) * 2020-06-30 2024-01-23 平安国际智慧城市科技股份有限公司 Language model training method and device and computer equipment
CN112016319A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Pre-training model obtaining method, disease entity labeling method, device and storage medium
CN112016319B (en) * 2020-09-08 2023-12-15 平安科技(深圳)有限公司 Pre-training model acquisition and disease entity labeling method, device and storage medium
WO2022098719A1 (en) * 2020-11-03 2022-05-12 Salesforce.Com, Inc. System and methods for training task-oriented dialogue (tod) language models
US11749264B2 (en) 2020-11-03 2023-09-05 Salesforce, Inc. System and methods for training task-oriented dialogue (TOD) language models
CN112528669B (en) * 2020-12-01 2023-08-11 北京百度网讯科技有限公司 Training method and device for multilingual model, electronic equipment and readable storage medium
CN112528669A (en) * 2020-12-01 2021-03-19 北京百度网讯科技有限公司 Multi-language model training method and device, electronic equipment and readable storage medium
WO2022174804A1 (en) * 2021-02-20 2022-08-25 北京有竹居网络技术有限公司 Text simplification method and apparatus, and device and storage medium
WO2022227166A1 (en) * 2021-04-26 2022-11-03 平安科技(深圳)有限公司 Word replacement method and apparatus, electronic device, and storage medium
CN113177402A (en) * 2021-04-26 2021-07-27 平安科技(深圳)有限公司 Word replacement method and device, electronic equipment and storage medium
CN113177402B (en) * 2021-04-26 2024-03-01 平安科技(深圳)有限公司 Word replacement method, device, electronic equipment and storage medium
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN115329784A (en) * 2022-10-12 2022-11-11 之江实验室 Sentence rephrasing generation system based on pre-training model
CN116227484A (en) * 2023-05-09 2023-06-06 腾讯科技(深圳)有限公司 Model training method, apparatus, device, storage medium and computer program product

Also Published As

Publication number Publication date
CN110543639B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN110543639A (en) english sentence simplification algorithm based on pre-training Transformer language model
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
US10614106B2 (en) Automated tool for question generation
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN110674252A (en) High-precision semantic search system for judicial domain
JP2006244262A (en) Retrieval system, method and program for answer to question
CN110209818B (en) Semantic sensitive word and sentence oriented analysis method
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN110909116B (en) Entity set expansion method and system for social media
CN112270188A (en) Questioning type analysis path recommendation method, system and storage medium
Huda et al. A multi-label classification on topics of quranic verses (english translation) using backpropagation neural network with stochastic gradient descent and adam optimizer
CN112214989A (en) Chinese sentence simplification method based on BERT
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
CN111428031A (en) Graph model filtering method fusing shallow semantic information
Guo et al. Selective text augmentation with word roles for low-resource text classification
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
Smadja et al. Translating collocations for use in bilingual lexicons
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN116070620A (en) Information processing method and system based on big data
Prelevikj et al. Multilingual named entity recognition and matching using BERT and dedupe for Slavic languages
CN114969324A (en) Chinese news title classification method based on subject word feature expansion
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
KR20050033852A (en) Apparatus, method, and program for text classification using frozen pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant