CN110543639A - english sentence simplification algorithm based on pre-training Transformer language model - Google Patents
english sentence simplification algorithm based on pre-training Transformer language model Download PDFInfo
- Publication number
- CN110543639A CN110543639A CN201910863529.9A CN201910863529A CN110543639A CN 110543639 A CN110543639 A CN 110543639A CN 201910863529 A CN201910863529 A CN 201910863529A CN 110543639 A CN110543639 A CN 110543639A
- Authority
- CN
- China
- Prior art keywords
- word
- words
- sentence
- csi
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
the invention discloses an English sentence simplification algorithm based on a pre-training Transformer language model, which is carried out according to the following steps: step 1, utilizing the public Wikipedia corpus to count word frequency; step 2, obtaining vectorization expression of the words by using a public pre-trained word embedding model; step 3, preprocessing sentences needing to be simplified to obtain content words; step 4, acquiring a candidate alternative word set of the content word in the sentence by utilizing a public pre-training Transformer language model Bert; step 5, sorting the candidate alternative word set of each content word by using a plurality of characteristics; step 6, comparing the word frequency of the candidate word with the highest ranking with the word frequency of the original content word, and determining a final alternative word; and 7, processing other content words in the sentence according to the steps 4 to 6 in sequence to obtain a final simplified sentence.
Description
Technical Field
The invention relates to the field of English text simplification, in particular to an English sentence simplification algorithm based on a pre-trained Transformer language model.
background
in recent years, more and more English data are available on the Internet, for example, many professional papers are written in English and published in English journals, and many people begin to like to read English materials directly, rather than translating the English materials into Chinese and reading Chinese materials. However, because english is not our native language, insufficient vocabulary severely impacts the understanding of english material. Many studies have confirmed that if 90% of words in a text can be understood, even a long and complicated text, the meaning of the text is more easily understood. In addition, English text simplification also helps people who are in the mother language of English, especially people who have low literacy rate, cognitive or language disorder or limited knowledge of text languages.
the vocabulary simplification algorithm in sentence simplification aims to replace complex words in a sentence by simple synonyms, and the vocabulary requirement on a user can be greatly reduced. The vocabulary simplification algorithm in the existing sentence simplification can be roughly divided into the following steps: the method comprises the steps of complex word recognition, candidate replacement words for generating the complex words, candidate replacement word sorting and candidate replacement word selection. Vocabulary reduction algorithms can be roughly classified into three categories according to the generation of candidate alternatives: the first type is a simplified algorithm based on a dictionary, and the algorithm mainly utilizes a dictionary (such as WortNet) to generate synonyms of complex words as candidate alternatives; the second type of algorithm is based on a parallel corpus algorithm, the most common parallel corpus is obtained from normal wikipedia and child-version English wikipedia, sentences are selected from two different wikipedia respectively through a matching algorithm to serve as parallel sentence pairs, then, rules are obtained from the parallel sentence pairs, and the rules are used for generating candidate alternative words of complex words; and a third algorithm is based on a word embedding model, obtains the vector representation of the words from the word submerging model, and searches a word set with the most similar complex words as candidate substitute words by using a word similarity calculation method. The first two types of algorithms have great constraint, firstly, the construction cost of a dictionary is great, and the extraction of high-quality parallel corpora is very difficult, and secondly, the coverage of the two types of algorithms on complex words is limited. The most important problem of the three algorithms is that in the process of generating candidate words, only the complex words are considered, the context of the complex words is ignored, a lot of unsuitable candidate words are inevitably generated, and great interference is brought to the later steps of the system.
Disclosure of Invention
the invention aims to overcome the defects that the existing vocabulary simplifying algorithm only utilizes the complex word to generate candidate substitute words and ignores the context of the complex word, and provides an English sentence simplifying algorithm based on a pre-training Transformer language model.
the purpose of the invention is realized as follows: an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:
step 1, utilizing a public English Wikipedia corpus D to count the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D;
step 2, obtaining a public word embedding model adopting a word vector model fastText for pre-training; by utilizing the word embedding model, the vector representation vw of the word w can be obtained;
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation tool to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;
Step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert;
Step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
Step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept;
step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
As a further limitation of the present invention, step 4 specifically includes:
step 4.1, obtaining a public pre-training Transformer language model Bert;
step 4.2, replacing content words wi in the sentences s by using the ' MASK ' symbol, wherein the replaced sentences are defined as s ';
step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S;
Step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
step 4.5, converting T into corresponding ID characteristics by using a BertTokenizer;
step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics;
step 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
as a further limitation of the present invention, step 5 specifically comprises:
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; adding rank1 to the set all _ ranks;
step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; selecting a context of a content word W from a sentence s, constituting a new sequence W = W-m, …, W-1, W, W1, …, wm; let the initial value of j be 1;
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; let the initial value of j be 1;
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks;
And 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
as a further limitation of the present invention, step 5.3 specifically comprises:
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
and 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
as a further limitation of the present invention, step 5.4 specifically comprises:
Step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w Step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w ) = (1)
in the formula (1), g is the dimension of the vector in the word vector model;
step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
and 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
Compared with the prior art, the invention has the beneficial effects that:
1. The invention simplifies the steps, does not adopt the identification of complex words, and simply stops the words and divides the words for the sentences needing to be simplified; each content word (noun, verb, adjective and adverb) in the sentence is used as a complex word, and then candidate substitute words are generated and selected for each complex word. The original word is replaced only if the frequency of the finally generated replacement word is greater than that of the original word. The method can simplify vocabulary simplification steps and improve the efficiency of the model.
2. The method generates candidate words of the words by using a pre-training-based Transformer language model Bert. Bert is trained by Mask Language Modeling (MLM) using a massive corpus of text. The MLM randomly masks a small part of words in the sentence, predicts the masked words, and performs optimization training. For the vocabulary simplification algorithm, the MLM predicts the probability of the hidden words in the vocabulary table by hiding the complex words, and then selects the words with the highest probability as candidate substitutes. Compared with the existing algorithm, the method generates the candidate alternative words of the complex words on the basis of the original sentences instead of only utilizing the complex words, and can better acquire the candidate alternative words of the complex words, thereby overcoming the defect that the traditional method only generates the candidates aiming at the complex words.
3. according to the method, the candidate substitute words generated by Bert are utilized, the generated candidate words already take the context of the complex words into consideration, the candidate words are generated also taking the language environment of the context into consideration, the ordering of the candidate words and the morphological change of the substitute words can be omitted, and the sentence simplification algorithm is greatly simplified.
4. the method selects the candidate words by utilizing the characteristics of the Bert output, the Bert mask language model, the word frequency and the semantic similarity, not only considers the correlation between the candidate words and the complex words and the continuity between the candidate words and the original context, but also considers the simplification of the candidate words, thereby being capable of more accurately finding the most suitable alternative words.
Detailed Description
The present invention is further illustrated by the following specific examples.
an English sentence simplification algorithm based on a pre-training Transformer language model is carried out according to the following steps:
tStep 1, by utilizing a public English Wikipedia corpus D, downloading from https:// dumps.wikimedia.org/enwiki/, counting the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D; in the field of text simplification, word complexity measures are determined by considering the frequency of words; in general, the higher the frequency of a word, the easier the word is to understand; thus, word frequency can be used to find the most easily understood word from a highly similar set of words for word t.
https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M- subword.zip wstep 2, obtaining a public Word embedding model pre-trained by adopting a Word vector model fastText, and downloading the model from https:// dl.fbaipublictifiles.com/fastText/Vectors-english/crawl-300 d-2M-subword.zip, wherein fastText is an open-source algorithm capable of training the Word embedding model, and the algorithm can refer to an article 'engineering Word Vectors with sub Word Information' written by Bojanowski and the like, and the publication time is 2017 years; using this word embedding model, a vector representation vw of the word w can be obtained, where each vector is 300-dimensional.
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, then using a word segmentation tool to perform word segmentation and part-of-speech tagging on s, and acquiring a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }, wherein the stop words and the English participles select nltk packages of python language; the initial value given to i is 1.
step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert; bert is a Pre-trained Transformer language model, and the algorithm can be referred to the paper "Bert: Pre-training of deep bidirectional transformations for language understating" by Devlin et al, published for 2018; bert is trained by using a common massive text corpus through a Mask Language Model (MLM); the MLM predicts a small part of words in a text through random hiding, and performs optimization training; for the vocabulary simplification algorithm in text simplification, the MLM predicts the probability of all words in a vocabulary table belonging to a hidden word by hiding the complex word, and then selects the word with the highest probability as a candidate substitute;
step 4.1, obtaining a public pre-trained transform language model Bert, wherein a Bert algorithm realized by a pytorch is selected, and the pre-trained Bert model can download ' BERT-Large, Unmeasured ' (bolt work masking ') from https:// githu.com/google-research/Bert;
step 4.2, replacing content words wi in sentences s by using the ' MASK ' symbol, defining the replaced sentences as s ', wherein the ' MASK ' symbol represents a hiding symbol, and the MLM optimizes the Bert model by predicting the symbol and comparing the predicted value with the original words;
Step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S, wherein "[ CLS ]" and "[ SEP ]" are two special symbols in Bert, the "[ CLS ]" is generally added at the top and used as a category identification number, and the "[ SEP ]" is used as a sentence separator; here, S' is not used directly, but is used, with two benefits: the first benefit is that the influence of the complex word on the prediction result of the 'MASK', and the second benefit is that the Bert is good at processing the problem of double sentences, and the Bert also adopts a Next sentence prediction (Next sense prediction) optimization model;
step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
step 4.5, converting the T into a corresponding ID characteristic by using the BertTokenizer, wherein the ID characteristic is a number corresponding to each word in the Bert;
Step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics; mask features are used for identifying the position of useful information;
lenstep 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature; the Type feature is used to distinguish two sentences;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics including four characteristics of Bert output, Bert mask language model, word frequency and semantic similarity; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; the Bert output characteristics contain the relation between the candidate words, the original complex words and the context; adding rank1 to the set all _ ranks;
Step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; the context coherence of the candidate word and the complex word is mainly examined; selecting context of content words W from a sentence s to form a new sequence W = W-m, …, W-1, W, W1, …, wm, where m is selected to be 5, i.e. taking a maximum of 5 words before and after a complex word; let the initial value of j be 1;
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
And 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; the higher the similarity is, the closer the semantics of the candidate word and the original complex word is; let the initial value of j be 1;
Step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w )= (1)
In the formula (1), g is the dimension of the vector in the word vector model;
step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
And 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks; the word frequency feature is utilized again, and the higher the word frequency is, the more frequent the use is, the easier the understanding is;
And 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept; the frequency of the words is also utilized here.
Step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
By utilizing the big data pre-trained Bert model and combining a plurality of effective characteristics, the example can effectively acquire synonymy simplified words of complex words, so that the purpose of sentence simplification is achieved.
the present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.
Claims (5)
1. An English sentence simplification algorithm based on a pre-training Transformer language model is characterized by comprising the following steps:
step 1, utilizing a public English Wikipedia corpus D to count the frequency f (w) of each word w, wherein f (w) represents the occurrence frequency of the word w in D;
step 2, obtaining a public word embedding model adopting a word vector model fastText for pre-training; by utilizing the word embedding model, the vector representation vw of the word w can be obtained;
step 3, supposing that the sentence needing to be simplified is s, removing stop words in the sentence s, and then performing word segmentation and part-of-speech tagging on the s by using a word segmentation worker to obtain a set of content words (nouns, verbs, adjectives and adverbs) { w1, …, wi, …, wn }; setting the initial value of i as 1;
step 4, acquiring a candidate substitutive word set CSi of a content word wi (i is more than or equal to 1 and less than or equal to n) in a sentence s by using a public pre-training Transformer language model Bert;
Step 5, sorting the candidate words in the CSi by adopting a plurality of characteristics; selecting the candidate word ci ranked most front by averaging the plurality of sorting results;
step 6, if the frequency f (ci) of the candidate word ci is greater than the frequency f (wi) of the original content word wi, selecting the candidate word ci as a substitute word; otherwise, the original content word wi is still kept;
step 7, letting i = i +1, and executing steps 4 to 6 in sequence; and when all the content words in the sentence s are processed, replacing the original content words to obtain a simplified sentence of the sentence s.
2. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 4 specifically comprises:
step 4.1, obtaining a public pre-training Transformer language model Bert;
step 4.2, replacing content words wi in the sentences s by using the ' MASK ' symbol, wherein the replaced sentences are defined as s ';
step 4.3, connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", wherein the combined sequence is defined as S;
step 4.4, utilizing a word splitter BertTokenizer in the Bert to perform word splitting on the S, wherein a set after word splitting is called T;
Step 4.5, converting T into corresponding ID characteristics by using a BertTokenizer;
Step 4.6, acquiring the length len of the set T, defining an array with the length len, wherein all values are 1 and are called Mask characteristics;
step 4.7, defining an array with length len, wherein the content before the corresponding position of the first symbol "[ SEP ]" is assigned as 0, and the rest content is assigned as 1, which is called Type feature;
Step 4.8, transmitting the three characteristics (the ID characteristic, the Mask characteristic and the Type characteristic) to a Mask Language Model (Masked Language Model) of the Bert, and acquiring scores SC of all words in a vocabulary table corresponding to the symbol "[ MASK ]";
and 4.9, excluding the original content word w and the corresponding morphological derivative word, and selecting 10 words with high scores from the SC as a candidate substitutive word set CSi.
3. The English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 1, wherein the step 5 specifically comprises:
step 5.1, sequencing each candidate substitutive word CSi by adopting four characteristics, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all _ ranks, wherein the initial value is an empty set; let CSi = { c1, c2, …, cj, …, c10 };
Step 5.2, the output characteristics of the Bert include scores SC of all words, and the words in the CSi are sorted according to the scores, namely rank1= {1,2, …,10 }; adding rank1 to the set all _ ranks;
step 5.3, respectively calculating the sequence probability after the word in the CSi replaces the original word w by using a mask language model of Bert, acquiring rank2, and adding the rank to a set all _ ranks; selecting a context of a content word W from a sentence s, constituting a new sequence W = W-m, …, W-1, W, W1, …, wm; let the initial value of j be 1;
step 5.4, sequencing all words in the CSi by adopting semantic similarity characteristics to obtain rank3, and adding the rank to a set all _ ranks; let the initial value of j be 1;
step 5.5, sequencing all words in the CSi by using the word frequency characteristics; acquiring the frequencies { f (c1), f (c2),.., f (c10) } of all words in the CSi by using the word frequencies acquired in the step 1; sorting according to word frequency to obtain rank4, wherein the largest value is sorted first, and so on; obtaining rank4 and adding to the set all _ ranks;
and 5.6, calculating the average ranking value of each word in CSi = { c1, c2, …, cj, …, c10} by using the ranking of the four features in all _ ranks, and selecting the word with the highest ranking as the candidate word.
4. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.3 specifically comprises:
step 5.3.1, replacing the content word W in W with cj to form a new sequence W' = W-m, …, W-1, cj, W1, …, wm;
step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating the cross entropy loss value of the sequence after hiding by using a mask language model of Bert; finally, solving the mean lossi of the cross entropy loss values of all the words W';
step 5.3.3, j = j +1, and repeating steps 5.3.1 and 5.3.2 until all words in the CSi are calculated;
and 5.3.4, sequencing all loss values { loss1, loss 2.., loss10} to obtain rank2, wherein the smallest value is sequenced first, and the like.
5. the English sentence simplification algorithm based on the pre-trained Transformer language model according to claim 3, wherein the step 5.4 specifically comprises:
step 5.4.1, obtaining vector representation sum vw of cj and w from the word vector model;
,v w Step 5.4.2, calculating a similarity value cosj = cosine (, vw) of the sum vw by adopting a cosine similarity calculation method:
cosine( ,v w )= (1)
in the formula (1), g is the dimension of the vector in the word vector model;
Step 5.4.3, j = j +1, and repeating steps 5.4.1 and 5.4.2 until all words in the CSi are calculated;
and 5.4.4, sequencing all loss values { cos1, cos 2.., cos10} to obtain rank3, wherein the largest value is sequenced first, and the like.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910863529.9A CN110543639B (en) | 2019-09-12 | 2019-09-12 | English sentence simplification algorithm based on pre-training transducer language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910863529.9A CN110543639B (en) | 2019-09-12 | 2019-09-12 | English sentence simplification algorithm based on pre-training transducer language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110543639A true CN110543639A (en) | 2019-12-06 |
CN110543639B CN110543639B (en) | 2023-06-02 |
Family
ID=68713486
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910863529.9A Active CN110543639B (en) | 2019-09-12 | 2019-09-12 | English sentence simplification algorithm based on pre-training transducer language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110543639B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125350A (en) * | 2019-12-17 | 2020-05-08 | 语联网(武汉)信息技术有限公司 | Method and device for generating LDA topic model based on bilingual parallel corpus |
CN111144131A (en) * | 2019-12-25 | 2020-05-12 | 北京中科研究院 | Network rumor detection method based on pre-training language model |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN111611790A (en) * | 2020-04-13 | 2020-09-01 | 华为技术有限公司 | Data processing method and device |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111768001A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Language model training method and device and computer equipment |
CN112016319A (en) * | 2020-09-08 | 2020-12-01 | 平安科技(深圳)有限公司 | Pre-training model obtaining method, disease entity labeling method, device and storage medium |
CN112528669A (en) * | 2020-12-01 | 2021-03-19 | 北京百度网讯科技有限公司 | Multi-language model training method and device, electronic equipment and readable storage medium |
CN112949284A (en) * | 2019-12-11 | 2021-06-11 | 上海大学 | Text semantic similarity prediction method based on Transformer model |
CN113158695A (en) * | 2021-05-06 | 2021-07-23 | 上海极链网络科技有限公司 | Semantic auditing method and system for multi-language mixed text |
CN113177402A (en) * | 2021-04-26 | 2021-07-27 | 平安科技(深圳)有限公司 | Word replacement method and device, electronic equipment and storage medium |
WO2021218028A1 (en) * | 2020-04-29 | 2021-11-04 | 平安科技(深圳)有限公司 | Artificial intelligence-based interview content refining method, apparatus and device, and medium |
WO2022098719A1 (en) * | 2020-11-03 | 2022-05-12 | Salesforce.Com, Inc. | System and methods for training task-oriented dialogue (tod) language models |
WO2022174804A1 (en) * | 2021-02-20 | 2022-08-25 | 北京有竹居网络技术有限公司 | Text simplification method and apparatus, and device and storage medium |
CN115329784A (en) * | 2022-10-12 | 2022-11-11 | 之江实验室 | Sentence rephrasing generation system based on pre-training model |
CN116227484A (en) * | 2023-05-09 | 2023-06-06 | 腾讯科技(深圳)有限公司 | Model training method, apparatus, device, storage medium and computer program product |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106257441A (en) * | 2016-06-30 | 2016-12-28 | 电子科技大学 | A kind of training method of skip language model based on word frequency |
CN108509474A (en) * | 2017-09-15 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Search for the synonym extended method and device of information |
-
2019
- 2019-09-12 CN CN201910863529.9A patent/CN110543639B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106257441A (en) * | 2016-06-30 | 2016-12-28 | 电子科技大学 | A kind of training method of skip language model based on word frequency |
CN108509474A (en) * | 2017-09-15 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Search for the synonym extended method and device of information |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949284A (en) * | 2019-12-11 | 2021-06-11 | 上海大学 | Text semantic similarity prediction method based on Transformer model |
CN111125350B (en) * | 2019-12-17 | 2023-05-12 | 传神联合(北京)信息技术有限公司 | Method and device for generating LDA topic model based on bilingual parallel corpus |
CN111125350A (en) * | 2019-12-17 | 2020-05-08 | 语联网(武汉)信息技术有限公司 | Method and device for generating LDA topic model based on bilingual parallel corpus |
CN111144131A (en) * | 2019-12-25 | 2020-05-12 | 北京中科研究院 | Network rumor detection method based on pre-training language model |
CN111144131B (en) * | 2019-12-25 | 2021-04-30 | 北京中科研究院 | Network rumor detection method based on pre-training language model |
CN111611790B (en) * | 2020-04-13 | 2022-09-16 | 华为技术有限公司 | Data processing method and device |
CN111611790A (en) * | 2020-04-13 | 2020-09-01 | 华为技术有限公司 | Data processing method and device |
CN111651986B (en) * | 2020-04-28 | 2024-04-02 | 银江技术股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
WO2021218028A1 (en) * | 2020-04-29 | 2021-11-04 | 平安科技(深圳)有限公司 | Artificial intelligence-based interview content refining method, apparatus and device, and medium |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111444721B (en) * | 2020-05-27 | 2022-09-23 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111563166B (en) * | 2020-05-28 | 2024-02-13 | 浙江学海教育科技有限公司 | Pre-training model method for classifying mathematical problems |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN111768001A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Language model training method and device and computer equipment |
CN111768001B (en) * | 2020-06-30 | 2024-01-23 | 平安国际智慧城市科技股份有限公司 | Language model training method and device and computer equipment |
CN112016319A (en) * | 2020-09-08 | 2020-12-01 | 平安科技(深圳)有限公司 | Pre-training model obtaining method, disease entity labeling method, device and storage medium |
CN112016319B (en) * | 2020-09-08 | 2023-12-15 | 平安科技(深圳)有限公司 | Pre-training model acquisition and disease entity labeling method, device and storage medium |
WO2022098719A1 (en) * | 2020-11-03 | 2022-05-12 | Salesforce.Com, Inc. | System and methods for training task-oriented dialogue (tod) language models |
US11749264B2 (en) | 2020-11-03 | 2023-09-05 | Salesforce, Inc. | System and methods for training task-oriented dialogue (TOD) language models |
CN112528669B (en) * | 2020-12-01 | 2023-08-11 | 北京百度网讯科技有限公司 | Training method and device for multilingual model, electronic equipment and readable storage medium |
CN112528669A (en) * | 2020-12-01 | 2021-03-19 | 北京百度网讯科技有限公司 | Multi-language model training method and device, electronic equipment and readable storage medium |
WO2022174804A1 (en) * | 2021-02-20 | 2022-08-25 | 北京有竹居网络技术有限公司 | Text simplification method and apparatus, and device and storage medium |
WO2022227166A1 (en) * | 2021-04-26 | 2022-11-03 | 平安科技(深圳)有限公司 | Word replacement method and apparatus, electronic device, and storage medium |
CN113177402A (en) * | 2021-04-26 | 2021-07-27 | 平安科技(深圳)有限公司 | Word replacement method and device, electronic equipment and storage medium |
CN113177402B (en) * | 2021-04-26 | 2024-03-01 | 平安科技(深圳)有限公司 | Word replacement method, device, electronic equipment and storage medium |
CN113158695A (en) * | 2021-05-06 | 2021-07-23 | 上海极链网络科技有限公司 | Semantic auditing method and system for multi-language mixed text |
CN115329784A (en) * | 2022-10-12 | 2022-11-11 | 之江实验室 | Sentence rephrasing generation system based on pre-training model |
CN116227484A (en) * | 2023-05-09 | 2023-06-06 | 腾讯科技(深圳)有限公司 | Model training method, apparatus, device, storage medium and computer program product |
Also Published As
Publication number | Publication date |
---|---|
CN110543639B (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543639A (en) | english sentence simplification algorithm based on pre-training Transformer language model | |
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
US10614106B2 (en) | Automated tool for question generation | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
JP2006244262A (en) | Retrieval system, method and program for answer to question | |
CN110209818B (en) | Semantic sensitive word and sentence oriented analysis method | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN110909116B (en) | Entity set expansion method and system for social media | |
CN112270188A (en) | Questioning type analysis path recommendation method, system and storage medium | |
Huda et al. | A multi-label classification on topics of quranic verses (english translation) using backpropagation neural network with stochastic gradient descent and adam optimizer | |
CN112214989A (en) | Chinese sentence simplification method based on BERT | |
Nugraha et al. | Typographic-based data augmentation to improve a question retrieval in short dialogue system | |
CN111428031A (en) | Graph model filtering method fusing shallow semantic information | |
Guo et al. | Selective text augmentation with word roles for low-resource text classification | |
CN112711944B (en) | Word segmentation method and system, and word segmentation device generation method and system | |
Smadja et al. | Translating collocations for use in bilingual lexicons | |
CN115204143B (en) | Method and system for calculating text similarity based on prompt | |
CN116070620A (en) | Information processing method and system based on big data | |
Prelevikj et al. | Multilingual named entity recognition and matching using BERT and dedupe for Slavic languages | |
CN114969324A (en) | Chinese news title classification method based on subject word feature expansion | |
CN107729509A (en) | The chapter similarity decision method represented based on recessive higher-dimension distributed nature | |
KR20050033852A (en) | Apparatus, method, and program for text classification using frozen pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |