CN110096705A

CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically

Info

Publication number: CN110096705A
Application number: CN201910354246.1A
Authority: CN
Inventors: 强继朋; 李云; 袁运浩
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-06
Anticipated expiration: 2039-04-29
Also published as: CN110096705B

Abstract

本发明公开了互联网领域内的一种无监督的英文句子自动简化算法，按如下步骤进行：步骤1、训练词的向量表示；步骤2、获取词的频率；步骤3、分别获取简化句子集合和复杂句子集合；步骤4、填充短语表；步骤5、分别训练简化句子语言模型和复杂句子语言模型；步骤6、构建基于短语的句子简化模型；步骤7、迭代执行回译的策略，训练更优的句子简化模型，本发明在没有利用任何标注的并行语料，充分利用英文维基百科语料，有效提高了英文句子简化的准确性。The invention discloses an unsupervised automatic simplification algorithm for English sentences in the Internet field, which is carried out according to the following steps: step 1, vector representation of training words; step 2, obtaining the frequency of words; step 3, obtaining the simplified sentence set and A collection of complex sentences; step 4, filling the phrase table; step 5, training the simplified sentence language model and the complex sentence language model respectively; step 6, building a simplified sentence model based on phrases; step 7, iteratively executing the back-translation strategy for better training The sentence simplification model of the present invention effectively improves the accuracy of English sentence simplification by making full use of the English Wikipedia corpus without using any marked parallel corpus.

Description

An Unsupervised Algorithm for Automatic Simplification of English Sentences

技术领域technical field

本发明涉及一种互联网文本算法，特别涉及一种无监督的英文句子自动简化算法。The invention relates to an Internet text algorithm, in particular to an unsupervised automatic English sentence simplification algorithm.

背景技术Background technique

近年来，互联网上的文本资料向更广泛的用户提供了很多有用的知识和信息。然后，对于许多人来说，网上文本的撰写方式，如词汇和句法结果，可能难以阅读和理解，特别是对那些识字率低、认知或语言障碍、或者文本语言知识有限的人。包含非常用词或长而复杂句子的文本不仅很难被人们阅读和理解，也同样很难被机器进行分析。自动文本简化是在保留原有文本信息的情况下，尽可能简化原有文本的内容，从而达到更容易被更广泛的观众阅读和理解。In recent years, textual materials on the Internet have provided a lot of useful knowledge and information to a wider range of users. Then, the way texts are written online, such as lexical and syntactic results, can be difficult to read and understand for many people, especially those with low literacy, cognitive or language disabilities, or limited knowledge of the language of the text. Texts that contain very wordy words or long and complex sentences are not only difficult to read and understand by humans, but also difficult for machines to analyze. Automatic text simplification is to simplify the content of the original text as much as possible while retaining the original text information, so as to make it easier to be read and understood by a wider audience.

现有的文本简化算法利用机器翻译的算法，从一种语言下的复杂句子和简化句子的并行语料对中学习简化句子。这种文本简化算法是一种有监督的学习任务，它的有效性严重依赖大量的并行简化语料。可是，现在已有的英文并行简化语料主要是从普通英语的维基百科和儿童版的英语维基百科中获取，通过匹配算法分别两个不同维基百科中选择句子作为并行句子对。目前能够获取的并行简化语料，不仅数量少，而且包含很多非简化的句子对和错误的句子对，主要因为儿童版的维基百科由非专业人士编写，并不是和普通的维基百科一一对应，导致很难选择合适的句子匹配算法。因为简化并行语料的问题，导致已有文本简化算法效果并不是很理想。Existing text simplification algorithms use machine translation algorithms to learn simplified sentences from parallel corpus pairs of complex sentences and simplified sentences in one language. This text reduction algorithm is a supervised learning task, and its effectiveness relies heavily on a large number of parallel reduced corpora. However, the existing English parallel simplified corpus is mainly obtained from the general English Wikipedia and the English Wikipedia for children, and sentences are selected from two different Wikipedias as parallel sentence pairs through matching algorithms. The parallel simplified corpus currently available is not only small in number, but also contains many non-simplified sentence pairs and wrong sentence pairs, mainly because the children's version of Wikipedia is written by non-professionals and does not correspond to ordinary Wikipedia one-to-one. It makes it difficult to choose a suitable sentence matching algorithm. Because of the problem of simplifying the parallel corpus, the effect of the existing text simplification algorithm is not very ideal.

发明内容Contents of the invention

本发明的目的是提供一种无监督的英文句子自动简化算法，在无需要任何并行简化语料，只利用公开下载的维基百科语料，实现对英文句子的自动简化，从而能让用户更容易阅读和理解英文句子，特别是认知或者语言障碍的人。The purpose of the present invention is to provide an unsupervised automatic simplification algorithm for English sentences. Without the need for any parallel simplification corpus, only the publicly downloaded Wikipedia corpus is used to realize the automatic simplification of English sentences, thereby allowing users to read and read more easily. Comprehend English sentences, especially for those with cognitive or language disabilities.

本发明的目的是这样实现的：一种无监督的英文句子自动简化算法，按如下步骤进行：The purpose of the present invention is achieved in that a kind of unsupervised English sentence automatic simplification algorithm, carries out as follows:

步骤1、把公开的英文维基百科语料库D作为训练语料，采用词嵌入算法Word2vec获取词语t的向量表示v_t；通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征；采用Skip-Gram模型学习词嵌入算法Word2vec；给定语料库D和词语t，考虑一个以t为中心的滑动窗口，用W_t表示出现在t上下文窗口中的词语集合；观察上下文词语集合的对数概率定义如下：Step 1. Use the public English Wikipedia corpus D as the training corpus, and use the word embedding algorithm Word2vec to obtain the vector representation v _t of the word t; the word vector representation obtained by the Word2vec algorithm can well capture the semantic features of the word; use Skip -Gram model learning word embedding algorithm Word2vec; given corpus D and word t, consider a sliding window centered on t, and use W _t to represent the set of words appearing in the context window of t; observe the logarithmic probability definition of the context word set as follows:

式(1)中，v'_w是词语w的上下文向量表示，V是D的词汇表；然后，Skig-Gram的整体目标函数被定义如下：In Equation (1), v'w is the context vector representation of word _w , and V is the vocabulary of D; then, the overall objective function of Skig-Gram is defined as follows:

式(2)中，词的向量表示可以通过最大化该目标函数进行学习；In formula (2), the word vector representation can be learned by maximizing the objective function;

步骤2、利用维基百科语料D，统计每个词语t的频率f(t)，f(t)表示词语t在D中的出现次数；Step 2, using Wikipedia corpus D, count the frequency f(t) of each word t, f(t) represents the number of occurrences of word t in D;

步骤3、利用维基百科语料D，获取简化句子集合S和复杂句子集合C；Step 3, using the Wikipedia corpus D to obtain a simplified sentence set S and a complex sentence set C;

步骤4、利用词的向量表示和词的频率，填充表示词翻译为另一个词语概率的短语表PT(Phrase Table)；在PT中，词语t_i到词语t_j的翻译概率p(t_j|t_i)的计算公式如下：Step 4, using the vector representation of the word and the frequency of the word, populate the phrase table PT (Phrase Table) _representing the probability of word translation into another word; in PT, the translation probability p( _t _j | The calculation formula of t _i ) is as follows:

式(4)中，cos表示余弦相似度计算公式；In formula (4), cos represents the calculation formula of cosine similarity;

步骤5、针对简化句子集合S和复杂句子集合C，分别采用语言模型KenLM算法进行训练，获取简化语言模型LM_S和复杂语言模型LM_C；LM_S和LM_C在后面的迭代学习过程中保持不变；Step 5. For the simplified sentence set _S and the complex sentence set _C , the language model _KenLM algorithm is used for training respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and LMC remain unchanged in the subsequent iterative learning process _. Change;

步骤6、利用短语表PT、简化语言模型LM_S和复杂语言模型LM_C，采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation)，构建复杂句子到简化句子的简化算法给定复杂句子c，算法利用式(5)，分别计算不同词的组合组成的句子s的得分，最后选择得分做高的句子s’将作为简化句子：Step 6. Using the phrase table PT, the simplified language model _LMS and the complex language model LMC, the phrase _- based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences Given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:

s'＝argmax_sp(c|s)p(s) (5)s'=argmax _s p(c|s)p(s) (5)

式(5)中，PBMT算法分解p(c|s)作为短语表PT的内积，p(s)是句子s的概率，是从语言模型LM_S获得；In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence _s , which is obtained from the language model LMS;

步骤7、利用初始的PBMT算法迭代执行回译(Back-translation)的策略，生成更优的文本简化算法。Step 7. Use the initial PBMT algorithm Iteratively execute the Back-translation strategy to generate a better text simplification algorithm.

作为本发明的进一步限定，步骤3具体包括：As a further limitation of the present invention, step 3 specifically includes:

步骤3.1、针对维基百科语料D中的每个句子s，采用Flesch Reading Ease(FRE)算法进行打分，如式(3)，并按分值从高到低进行排序；Step 3.1, for each sentence s in the Wikipedia corpus D, use the Flesch Reading Ease (FRE) algorithm to score, such as formula (3), and sort by the score from high to low;

式(3)中，FRE(s)表示句子s的FRE得分，tw(s)表示句子s中所有词的数目，ts(s)表示句子s中所有音节的数目；In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, and ts(s) represents the number of all syllables in sentence s;

步骤3.2、去除得分超过100的句子集合，去除得到低于20分的句子集合，去除中间得分的句子集合；最后，选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C。Step 3.2, remove the sentence set with a score of more than 100, remove the sentence set with a score lower than 20, and remove the sentence set with an intermediate score; finally, select the sentence set with a high score as the simplified sentence set S and the sentence set with a low score as a complex sentence collection C.

作为本发明的进一步限定，所述步骤7具体包括：As a further limitation of the present invention, said step 7 specifically includes:

步骤7.1、首先利用算法，翻译复杂句子集合C，得到新的合成的简化句子集合S₀，然后，循环执行步骤7.2到7.5，迭代次数i从1到N；Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S ₀ , and then perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;

步骤7.2、利用合成的并行语料(S_i-1,C)、简化语言模型LM_S和复杂语言模型LM_C，训练新的从简化句子到复杂句子的PBMT算法 Step 7.2. Use the synthesized parallel corpus ( _S _i-1 , _C ), simplified language model LMS and complex language model LMC to train a new PBMT algorithm from simplified sentences to complex sentences

步骤7.3、利用翻译简化句子集合S，得到新的合成的复杂句子集合C_i；Step 7.3, use Translate the simplified sentence set S to obtain a new synthetic complex sentence set C _i ;

步骤7.4、利用合成的并行语料(C_i,S)、简化语言模型LM_C和复杂语言模型LM_S，训练新的从复杂句子到简化句子的PBMT算法 Step 7.4, use the synthesized parallel corpus ( _C _i , _S ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from complex sentences to simplified sentences

步骤7.5、利用翻译复杂句子集合C，得到新的合成的简化句子集合S_i；重新回到步骤7.2重复执行，直到迭代N次。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S _i ; go back to step 7.2 and repeat until N iterations.

与现有技术相比，本发明的有益效果在于：Compared with prior art, the beneficial effect of the present invention is:

1、本发明在填充短语表的过程中，结合了从维基百科语料中获取的词向量表示和词频率，能够抓住词语的语义信息和词语的使用频率，克服了传统的基于短语的机器翻译PBMT算法需要利用并行语料填充短语表；1. In the process of filling the phrase table, the present invention combines the word vector representation and word frequency obtained from the Wikipedia corpus, can capture the semantic information of words and the frequency of use of words, and overcome the traditional machine translation based on phrases The PBMT algorithm needs to use parallel corpus to fill the phrase table;

2、本发明将维基百科语料库作为知识库，利用Flesch Reading Ease(FRE)算法对句子进行打分，从而获取简化句子集合和复杂句子集合，从而能够更为准确的训练复杂句子语言模型和简化句子语言模型；2. The present invention uses the Wikipedia corpus as a knowledge base, and uses the Flesch Reading Ease (FRE) algorithm to score sentences, thereby obtaining simplified sentence sets and complex sentence sets, thereby enabling more accurate training of complex sentence language models and simplified sentence languages Model;

3、本发明利用获得的短语表、复杂句子语言模型和简化句子语言模型，基于PBMT算法构建了初始的无监督的文本简化算法；该文本简化算法不仅是无监督的算法，更是简单、容易解释和能够快速的进行训练；3. The present invention uses the obtained phrase table, complex sentence language model and simplified sentence language model to construct an initial unsupervised text simplification algorithm based on the PBMT algorithm; this text simplification algorithm is not only an unsupervised algorithm, but also simple and easy Explain and be able to train quickly;

4、本发明在构建初始的简化算法之后，利用简化算法生成并行语料，从而采用回译的策略对已有的文本简化模型进行优化，修正了初始的短语表中可能错误的条目，进一步提升算法型性能。4. After constructing the initial simplification algorithm, the present invention uses the simplification algorithm to generate parallel corpus, thereby adopting the strategy of back-translation to optimize the existing text simplification model, correcting possible wrong entries in the initial phrase table, and further improving the algorithm type performance.

具体实施方式Detailed ways

下面结合具体实施例对本发明做进一步说明。The present invention will be further described below in conjunction with specific embodiments.

一种无监督的英文句子自动简化算法，按如下步骤进行：An unsupervised automatic simplification algorithm for English sentences is carried out as follows:

步骤1、把公开的英文维基百科语料库D作为训练语料，可以从“https:// dumps.wikimedia.org/enwiki/”下载，采用词嵌入算法Word2vec获取词语t的向量表示v_t；通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征；获取词的向量表示后，可以获取词语的相似度，帮助寻找每个词的高相似的词语集合；本实例中，每个向量的维数设置为300，采用Skip-Gram模型学习词嵌入算法Word2vec；给定语料库D和词语t，考虑一个以t为中心的滑动窗口，用W_t表示出现在t上下文窗口中的词语集合；滑动窗口设置为t前面5个词和后面5个词；观察上下文词语集合的对数概率定义如下：Step 1. Use the public English Wikipedia corpus D as the training corpus, which can be downloaded from " https://dumps.wikimedia.org/enwiki/ ", and use the word embedding algorithm Word2vec to obtain the vector representation v _{t of the word t} ; through the Word2vec algorithm The obtained word vector representation can well capture the semantic features of the word; after obtaining the word vector representation, the similarity of the word can be obtained to help find the highly similar word set of each word; in this example, each vector’s The dimension is set to 300, and the Skip-Gram model is used to learn the word embedding algorithm Word2vec; given a corpus D and a word t, consider a sliding window centered on t, and use W _t to represent the set of words appearing in the context window of t; sliding The window is set to 5 words before and 5 words after t; the logarithmic probability of observing the context word set is defined as follows:

式(2)中，词的向量表示可以通过采用随机的梯度下降算法和负抽样，最大化该目标函数进行学习。In Equation (2), the word vector representation can be learned by maximizing the objective function by using stochastic gradient descent algorithm and negative sampling.

步骤2、利用维基百科语料D，统计每个词语t的频率f(t)，f(t)表示词语t在D中的出现次数；在文本简化领域中，词的复杂度测量通过会考虑词语的频率；一般说来，词的频率越高，该词越容易理解；因此，词频可以用来从词语t的高相似的词语集合中寻找最容易理解的词。Step 2. Use the Wikipedia corpus D to count the frequency f(t) of each word t, f(t) represents the number of occurrences of the word t in D; in the field of text simplification, the complexity measurement of words will take into account words Generally speaking, the higher the frequency of a word, the easier it is to understand the word; therefore, word frequency can be used to find the most understandable word from the highly similar word set of word t.

步骤3、维基百科语料D中一个超大的语料库，包含了大量的复杂句子集合和简单句子集合；利用维基百科语料D，获取简化句子集合S和复杂句子集合C；Step 3, a very large corpus in the Wikipedia corpus D contains a large number of complex sentence sets and simple sentence sets; using the Wikipedia corpus D, obtain the simplified sentence set S and the complex sentence set C;

步骤3.1、针对维基百科语料D中的每个句子s，采用FRE(Flesch Reading Ease)算法进行打分，如式(3)，并按分值从高到低进行排序；分值越高意味着句子越简单，分值越低意味着句子越困难；Step 3.1, for each sentence s in the Wikipedia corpus D, use the FRE (Flesch Reading Ease) algorithm to score, such as formula (3), and sort according to the score from high to low; the higher the score, the sentence The simpler, the lower the score means the more difficult the sentence;

式(3)中，FRE(s)表示句子s的FRE得分，tw(s)表示句子s中所有词的数目，ts(s)表示句子s中所有音节的数目；FRE算法通常被用来评价文本简化模型最后简化结果的好坏；In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, ts(s) represents the number of all syllables in sentence s; FRE algorithm is usually used to evaluate The quality of the final simplification result of the text simplification model;

步骤3.2、去除得分超过100的句子集合，去除得到低于20分的句子集合，去除中间得分的句子集合；去除高分和低分的句子，是为了去除特别极端的句子；去除中间得分的句子是为了在S和C之间建立明显的界限；最后，选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C；本实例中，S和C都分别选择了1千万个句子。Step 3.2, remove the set of sentences with a score of more than 100, remove the set of sentences with a score of less than 20, and remove the set of sentences with an intermediate score; remove sentences with high and low scores to remove particularly extreme sentences; remove sentences with an intermediate score In order to establish a clear boundary between S and C; finally, select the high-scoring sentence set as the simplified sentence set S and the low-scoring sentence set as the complex sentence set C; in this example, S and C are both selected 1 Thousands of sentences.

步骤4、利用词的向量表示和词的频率，填充表示词翻译为另一个词语概率的短语表PT(Phrase Table)。在PT中，词语t_i到词语t_j的翻译概率p(t_j|t_i)的计算公式如下：Step 4, using the vector representation of the word and the frequency of the word to fill the phrase table PT (Phrase Table) representing the probability of the word being translated into another word. In PT, the calculation formula of the translation probability p(t _j |t _i ) from word t _i to word t _j is as follows:

式(4)中，cos表示余弦相似度计算公式；考虑到学习所有词的概率转换是不可行的，在本实例中，选择了最频繁的30万个词语，并只计算到最相似的200个词语的概率；对词语中的专有名词，只计算到自己本身的概率。In formula (4), cos represents the cosine similarity calculation formula; considering that it is not feasible to learn the probability conversion of all words, in this example, the most frequent 300,000 words are selected, and only the most similar 200 words are calculated. The probability of each word; for the proper nouns in the words, only the probability of itself is calculated.

步骤5、针对步骤3获取的简化句子集合S和复杂句子集合C，分别采用语言模型KenLM算法进行训练，获取简化语言模型LM_S和复杂语言模型LM_C；LM_S和LM_C在后面的迭代学习过程中保持不变；语言模型用来计算给语料中指定的词语序列的概率；简化语言模型和复杂语言模型通过计算词语序列的概率，有助于通过以下方法提高简化模型的质量：执行本地替换和词语顺序重排。Step 5. For the simplified sentence set _S and the complex sentence set _C obtained in step 3, the language model _KenLM algorithm is used to train respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and _LMC are iteratively learned later The process remains unchanged; the language model is used to calculate the probability given to the word sequence specified in the corpus; the simplified language model and the complex language model help to improve the quality of the simplified model by calculating the probability of the word sequence by performing local replacement and word order rearrangement.

步骤6、利用短语表PT、简化语言模型LM_S和复杂语言模型LM_C，采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation)，构建复杂句子到简化句子的简化算法PBMT算法最先在2007年《Statistical phrase-based translation》提出，用来用于有双语言的机器翻译；给定复杂句子c，算法利用式(5)，分别计算不同词的组合组成的句子s的得分，最后选择得分做高的句子s’将作为简化句子：Step 6. Using the phrase table PT, the simplified language model _LMS and the complex language model LMC, the phrase _- based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences The PBMT algorithm was first proposed in "Statistical phrase-based translation" in 2007, and is used for machine translation with dual languages; given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:

s'＝argmax_sp(c|s)p(s) (5)s'=argmax _s p(c|s)p(s) (5)

式(5)中，PBMT算法分解p(c|s)作为短语表PT的内积，p(s)是句子s的概率，是从语言模型LM_S获得。In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence _s , which is obtained from the language model LMS.

步骤7、鉴于只能获取非并行语料，利用初始的PBMT算法迭代执行回译(Back-translation)的策略，可以把非常困难的无监督学习问题转化为有监督学习任务，从而生成更优的文本简化算法；Step 7. Since only non-parallel corpus can be obtained, use the initial PBMT algorithm Iterative implementation of the Back-translation strategy can transform very difficult unsupervised learning problems into supervised learning tasks, thereby generating better text simplification algorithms;

步骤7.1、首先利用算法，翻译复杂句子集合C，得到新的合成的简化句子集合S₀；然后，循环执行步骤7.2到7.5，迭代次数i从1到N；Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S ₀ ; then, perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;

步骤7.2、利用合成的并行语料(S_i-1,C)、简化语言模型LM_C和复杂语言模型LM_S，训练新的从简化句子到复杂句子的PBMT算法 Step 7.2, use the synthesized parallel corpus ( _S _i-1 , _C ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from simplified sentences to complex sentences

步骤7.5、利用翻译复杂句子集合C，得到新的合成的简化句子集合S_i；重新回到步骤7、2重复执行，直到迭代N次；本实例中，N被设置为3。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S _i ; go back to step 7 and 2 and repeat until N iterations; in this example, N is set to 3.

直观的说，由于PBMT算法的输入是包含噪音的，导致短语表中许多条目是不正确的；尽管如此，在产生简化句子的过程中，语言模型能够帮助纠正一些错误；只要这种情况发生了，随着迭代的持续进行，短语表和翻译算法都会相应的被提高；随着短语表中更多的条目将被纠正过来，PBMT算法也会越来越强大。Intuitively, since the input of the PBMT algorithm contains noise, many entries in the phrase table are incorrect; however, in the process of generating simplified sentences, the language model can help correct some errors; as long as this happens , as the iteration continues, the phrase table and the translation algorithm will be improved accordingly; as more entries in the phrase table will be corrected, the PBMT algorithm will become more and more powerful.

本发明并不局限于上述实施例，在本发明公开的技术方案的基础上，本领域的技术人员根据所公开的技术内容，不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形，这些替换和变形均在本发明的保护范围内。The present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some replacements and modifications to some of the technical features according to the disclosed technical content without creative work. Deformation, these replacements and deformations are all within the protection scope of the present invention.

Claims

1. a kind of unsupervised english sentence simplifies algorithm automatically, which is characterized in that carry out as follows:

Step 1, using disclosed English wikipedia corpus D as training corpus, obtained using word embedded mobile GIS Word2vec The vector of word t indicates v_t；The term vector obtained by Word2vec algorithm indicates to can be good at catching the semanteme of word special Sign；Using Skip-Gram model learning word embedded mobile GIS Word2vec；Given corpus D and word t considers that one with t is The sliding window of the heart, uses W_tIt indicates to appear in the set of words in t contextual window；The logarithm for observing context words set is general Rate is defined as follows:

In formula (1), v'_wIt is the context vector expression of word w, V is the vocabulary of D；Then, the overall goals letter of Skig-Gram Number is defined as foloows:

In formula (2), the vector expression of word can be learnt by maximizing the objective function；

Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates appearance of the word t in D Number；

Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition；

Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase table PT of another word probability using the vector of word (Phrase Table)；In PT, word t_iTo word t_jTranslation probability p (t_j|t_i) calculation formula it is as follows:

In formula (4), cos indicates cosine similarity calculation formula；

Step 5 is directed to simplified sentence set S and complex sentence subclass C, and language model KenLM algorithm is respectively adopted and is trained, Obtain reduction language model LM_SWith complex language model LM_C；LM_SAnd LM_CIt is remained unchanged in iterative learning procedure below；

Step 6 utilizes phrase table PT, reduction language model LM_SWith complex language model LM_C, using phrase-based machine translation Algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification algorithm for simplifying sentenceGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the score of the sentence s of different contamination compositions, Finally selecting score to be high sentence s ' will be as simplified sentence:

S'=argmax_sp(c|s)p(s) (5)

In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and p (s) is the probability of sentence s, is from language mould Type LM_SIt obtains；

Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), generates more Excellent text simplifies algorithm.

2. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that step 3 tool Body includes:

Step 3.1, for each sentence s in wikipedia corpus D, using Flesch Reading Ease (FRE) algorithm into Row marking, such as formula (3), and is ranked up from high to low by score value；

In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, and ts (s) indicates sentence The articulatory number of institute in sub- s；

Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, removes intermediate comparison scores Sentence set；Finally, it is multiple to select the sentence set of high score to be used as the sentence set for simplifying sentence set S and low score Miscellaneous sentence set C.

3. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that the step 7 specifically include:

Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence set S of new synthesis₀, Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N；

Step 7.2, the parallel corpus (S using synthesis_i-1, C), reduction language model LM_SWith complex language model LM_C, training is newly From simplify sentence to complicated sentence PBMT algorithm

Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesis_i；

Step 7.4, the parallel corpus (C using synthesis_i, S), reduction language model LM_CWith complex language model LM_S, train newly From complicated sentence to the PBMT algorithm for simplifying sentence

Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtained_i；It comes back to Step 7.2 repeats, until iteration n times.