CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically - Google Patents

A kind of unsupervised english sentence simplifies algorithm automatically Download PDF

Info

Publication number
CN110096705A
CN110096705A CN201910354246.1A CN201910354246A CN110096705A CN 110096705 A CN110096705 A CN 110096705A CN 201910354246 A CN201910354246 A CN 201910354246A CN 110096705 A CN110096705 A CN 110096705A
Authority
CN
China
Prior art keywords
sentence
word
algorithm
complex
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910354246.1A
Other languages
Chinese (zh)
Other versions
CN110096705B (en
Inventor
强继朋
李云
袁运浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201910354246.1A priority Critical patent/CN110096705B/en
Publication of CN110096705A publication Critical patent/CN110096705A/en
Application granted granted Critical
Publication of CN110096705B publication Critical patent/CN110096705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了互联网领域内的一种无监督的英文句子自动简化算法,按如下步骤进行:步骤1、训练词的向量表示;步骤2、获取词的频率;步骤3、分别获取简化句子集合和复杂句子集合;步骤4、填充短语表;步骤5、分别训练简化句子语言模型和复杂句子语言模型;步骤6、构建基于短语的句子简化模型;步骤7、迭代执行回译的策略,训练更优的句子简化模型,本发明在没有利用任何标注的并行语料,充分利用英文维基百科语料,有效提高了英文句子简化的准确性。The invention discloses an unsupervised automatic simplification algorithm for English sentences in the Internet field, which is carried out according to the following steps: step 1, vector representation of training words; step 2, obtaining the frequency of words; step 3, obtaining the simplified sentence set and A collection of complex sentences; step 4, filling the phrase table; step 5, training the simplified sentence language model and the complex sentence language model respectively; step 6, building a simplified sentence model based on phrases; step 7, iteratively executing the back-translation strategy for better training The sentence simplification model of the present invention effectively improves the accuracy of English sentence simplification by making full use of the English Wikipedia corpus without using any marked parallel corpus.

Description

一种无监督的英文句子自动简化算法An Unsupervised Algorithm for Automatic Simplification of English Sentences

技术领域technical field

本发明涉及一种互联网文本算法,特别涉及一种无监督的英文句子自动简化算法。The invention relates to an Internet text algorithm, in particular to an unsupervised automatic English sentence simplification algorithm.

背景技术Background technique

近年来,互联网上的文本资料向更广泛的用户提供了很多有用的知识和信息。然后,对于许多人来说,网上文本的撰写方式,如词汇和句法结果,可能难以阅读和理解,特别是对那些识字率低、认知或语言障碍、或者文本语言知识有限的人。包含非常用词或长而复杂句子的文本不仅很难被人们阅读和理解,也同样很难被机器进行分析。自动文本简化是在保留原有文本信息的情况下,尽可能简化原有文本的内容,从而达到更容易被更广泛的观众阅读和理解。In recent years, textual materials on the Internet have provided a lot of useful knowledge and information to a wider range of users. Then, the way texts are written online, such as lexical and syntactic results, can be difficult to read and understand for many people, especially those with low literacy, cognitive or language disabilities, or limited knowledge of the language of the text. Texts that contain very wordy words or long and complex sentences are not only difficult to read and understand by humans, but also difficult for machines to analyze. Automatic text simplification is to simplify the content of the original text as much as possible while retaining the original text information, so as to make it easier to be read and understood by a wider audience.

现有的文本简化算法利用机器翻译的算法,从一种语言下的复杂句子和简化句子的并行语料对中学习简化句子。这种文本简化算法是一种有监督的学习任务,它的有效性严重依赖大量的并行简化语料。可是,现在已有的英文并行简化语料主要是从普通英语的维基百科和儿童版的英语维基百科中获取,通过匹配算法分别两个不同维基百科中选择句子作为并行句子对。目前能够获取的并行简化语料,不仅数量少,而且包含很多非简化的句子对和错误的句子对,主要因为儿童版的维基百科由非专业人士编写,并不是和普通的维基百科一一对应,导致很难选择合适的句子匹配算法。因为简化并行语料的问题,导致已有文本简化算法效果并不是很理想。Existing text simplification algorithms use machine translation algorithms to learn simplified sentences from parallel corpus pairs of complex sentences and simplified sentences in one language. This text reduction algorithm is a supervised learning task, and its effectiveness relies heavily on a large number of parallel reduced corpora. However, the existing English parallel simplified corpus is mainly obtained from the general English Wikipedia and the English Wikipedia for children, and sentences are selected from two different Wikipedias as parallel sentence pairs through matching algorithms. The parallel simplified corpus currently available is not only small in number, but also contains many non-simplified sentence pairs and wrong sentence pairs, mainly because the children's version of Wikipedia is written by non-professionals and does not correspond to ordinary Wikipedia one-to-one. It makes it difficult to choose a suitable sentence matching algorithm. Because of the problem of simplifying the parallel corpus, the effect of the existing text simplification algorithm is not very ideal.

发明内容Contents of the invention

本发明的目的是提供一种无监督的英文句子自动简化算法,在无需要任何并行简化语料,只利用公开下载的维基百科语料,实现对英文句子的自动简化,从而能让用户更容易阅读和理解英文句子,特别是认知或者语言障碍的人。The purpose of the present invention is to provide an unsupervised automatic simplification algorithm for English sentences. Without the need for any parallel simplification corpus, only the publicly downloaded Wikipedia corpus is used to realize the automatic simplification of English sentences, thereby allowing users to read and read more easily. Comprehend English sentences, especially for those with cognitive or language disabilities.

本发明的目的是这样实现的:一种无监督的英文句子自动简化算法,按如下步骤进行:The purpose of the present invention is achieved in that a kind of unsupervised English sentence automatic simplification algorithm, carries out as follows:

步骤1、把公开的英文维基百科语料库D作为训练语料,采用词嵌入算法Word2vec获取词语t的向量表示vt;通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征;采用Skip-Gram模型学习词嵌入算法Word2vec;给定语料库D和词语t,考虑一个以t为中心的滑动窗口,用Wt表示出现在t上下文窗口中的词语集合;观察上下文词语集合的对数概率定义如下:Step 1. Use the public English Wikipedia corpus D as the training corpus, and use the word embedding algorithm Word2vec to obtain the vector representation v t of the word t; the word vector representation obtained by the Word2vec algorithm can well capture the semantic features of the word; use Skip -Gram model learning word embedding algorithm Word2vec; given corpus D and word t, consider a sliding window centered on t, and use W t to represent the set of words appearing in the context window of t; observe the logarithmic probability definition of the context word set as follows:

式(1)中,v'w是词语w的上下文向量表示,V是D的词汇表;然后,Skig-Gram的整体目标函数被定义如下:In Equation (1), v'w is the context vector representation of word w , and V is the vocabulary of D; then, the overall objective function of Skig-Gram is defined as follows:

式(2)中,词的向量表示可以通过最大化该目标函数进行学习;In formula (2), the word vector representation can be learned by maximizing the objective function;

步骤2、利用维基百科语料D,统计每个词语t的频率f(t),f(t)表示词语t在D中的出现次数;Step 2, using Wikipedia corpus D, count the frequency f(t) of each word t, f(t) represents the number of occurrences of word t in D;

步骤3、利用维基百科语料D,获取简化句子集合S和复杂句子集合C;Step 3, using the Wikipedia corpus D to obtain a simplified sentence set S and a complex sentence set C;

步骤4、利用词的向量表示和词的频率,填充表示词翻译为另一个词语概率的短语表PT(Phrase Table);在PT中,词语ti到词语tj的翻译概率p(tj|ti)的计算公式如下:Step 4, using the vector representation of the word and the frequency of the word, populate the phrase table PT (Phrase Table) representing the probability of word translation into another word; in PT, the translation probability p( t j | The calculation formula of t i ) is as follows:

式(4)中,cos表示余弦相似度计算公式;In formula (4), cos represents the calculation formula of cosine similarity;

步骤5、针对简化句子集合S和复杂句子集合C,分别采用语言模型KenLM算法进行训练,获取简化语言模型LMS和复杂语言模型LMC;LMS和LMC在后面的迭代学习过程中保持不变;Step 5. For the simplified sentence set S and the complex sentence set C , the language model KenLM algorithm is used for training respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and LMC remain unchanged in the subsequent iterative learning process . Change;

步骤6、利用短语表PT、简化语言模型LMS和复杂语言模型LMC,采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation),构建复杂句子到简化句子的简化算法给定复杂句子c,算法利用式(5),分别计算不同词的组合组成的句子s的得分,最后选择得分做高的句子s’将作为简化句子:Step 6. Using the phrase table PT, the simplified language model LMS and the complex language model LMC, the phrase - based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences Given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:

s'=argmaxsp(c|s)p(s) (5)s'=argmax s p(c|s)p(s) (5)

式(5)中,PBMT算法分解p(c|s)作为短语表PT的内积,p(s)是句子s的概率,是从语言模型LMS获得;In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence s , which is obtained from the language model LMS;

步骤7、利用初始的PBMT算法迭代执行回译(Back-translation)的策略,生成更优的文本简化算法。Step 7. Use the initial PBMT algorithm Iteratively execute the Back-translation strategy to generate a better text simplification algorithm.

作为本发明的进一步限定,步骤3具体包括:As a further limitation of the present invention, step 3 specifically includes:

步骤3.1、针对维基百科语料D中的每个句子s,采用Flesch Reading Ease(FRE)算法进行打分,如式(3),并按分值从高到低进行排序;Step 3.1, for each sentence s in the Wikipedia corpus D, use the Flesch Reading Ease (FRE) algorithm to score, such as formula (3), and sort by the score from high to low;

式(3)中,FRE(s)表示句子s的FRE得分,tw(s)表示句子s中所有词的数目,ts(s)表示句子s中所有音节的数目;In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, and ts(s) represents the number of all syllables in sentence s;

步骤3.2、去除得分超过100的句子集合,去除得到低于20分的句子集合,去除中间得分的句子集合;最后,选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C。Step 3.2, remove the sentence set with a score of more than 100, remove the sentence set with a score lower than 20, and remove the sentence set with an intermediate score; finally, select the sentence set with a high score as the simplified sentence set S and the sentence set with a low score as a complex sentence collection C.

作为本发明的进一步限定,所述步骤7具体包括:As a further limitation of the present invention, said step 7 specifically includes:

步骤7.1、首先利用算法,翻译复杂句子集合C,得到新的合成的简化句子集合S0,然后,循环执行步骤7.2到7.5,迭代次数i从1到N;Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S 0 , and then perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;

步骤7.2、利用合成的并行语料(Si-1,C)、简化语言模型LMS和复杂语言模型LMC,训练新的从简化句子到复杂句子的PBMT算法 Step 7.2. Use the synthesized parallel corpus ( S i-1 , C ), simplified language model LMS and complex language model LMC to train a new PBMT algorithm from simplified sentences to complex sentences

步骤7.3、利用翻译简化句子集合S,得到新的合成的复杂句子集合CiStep 7.3, use Translate the simplified sentence set S to obtain a new synthetic complex sentence set C i ;

步骤7.4、利用合成的并行语料(Ci,S)、简化语言模型LMC和复杂语言模型LMS,训练新的从复杂句子到简化句子的PBMT算法 Step 7.4, use the synthesized parallel corpus ( C i , S ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from complex sentences to simplified sentences

步骤7.5、利用翻译复杂句子集合C,得到新的合成的简化句子集合Si;重新回到步骤7.2重复执行,直到迭代N次。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S i ; go back to step 7.2 and repeat until N iterations.

与现有技术相比,本发明的有益效果在于:Compared with prior art, the beneficial effect of the present invention is:

1、本发明在填充短语表的过程中,结合了从维基百科语料中获取的词向量表示和词频率,能够抓住词语的语义信息和词语的使用频率,克服了传统的基于短语的机器翻译PBMT算法需要利用并行语料填充短语表;1. In the process of filling the phrase table, the present invention combines the word vector representation and word frequency obtained from the Wikipedia corpus, can capture the semantic information of words and the frequency of use of words, and overcome the traditional machine translation based on phrases The PBMT algorithm needs to use parallel corpus to fill the phrase table;

2、本发明将维基百科语料库作为知识库,利用Flesch Reading Ease(FRE)算法对句子进行打分,从而获取简化句子集合和复杂句子集合,从而能够更为准确的训练复杂句子语言模型和简化句子语言模型;2. The present invention uses the Wikipedia corpus as a knowledge base, and uses the Flesch Reading Ease (FRE) algorithm to score sentences, thereby obtaining simplified sentence sets and complex sentence sets, thereby enabling more accurate training of complex sentence language models and simplified sentence languages Model;

3、本发明利用获得的短语表、复杂句子语言模型和简化句子语言模型,基于PBMT算法构建了初始的无监督的文本简化算法;该文本简化算法不仅是无监督的算法,更是简单、容易解释和能够快速的进行训练;3. The present invention uses the obtained phrase table, complex sentence language model and simplified sentence language model to construct an initial unsupervised text simplification algorithm based on the PBMT algorithm; this text simplification algorithm is not only an unsupervised algorithm, but also simple and easy Explain and be able to train quickly;

4、本发明在构建初始的简化算法之后,利用简化算法生成并行语料,从而采用回译的策略对已有的文本简化模型进行优化,修正了初始的短语表中可能错误的条目,进一步提升算法型性能。4. After constructing the initial simplification algorithm, the present invention uses the simplification algorithm to generate parallel corpus, thereby adopting the strategy of back-translation to optimize the existing text simplification model, correcting possible wrong entries in the initial phrase table, and further improving the algorithm type performance.

具体实施方式Detailed ways

下面结合具体实施例对本发明做进一步说明。The present invention will be further described below in conjunction with specific embodiments.

一种无监督的英文句子自动简化算法,按如下步骤进行:An unsupervised automatic simplification algorithm for English sentences is carried out as follows:

步骤1、把公开的英文维基百科语料库D作为训练语料,可以从“https:// dumps.wikimedia.org/enwiki/”下载,采用词嵌入算法Word2vec获取词语t的向量表示vt;通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征;获取词的向量表示后,可以获取词语的相似度,帮助寻找每个词的高相似的词语集合;本实例中,每个向量的维数设置为300,采用Skip-Gram模型学习词嵌入算法Word2vec;给定语料库D和词语t,考虑一个以t为中心的滑动窗口,用Wt表示出现在t上下文窗口中的词语集合;滑动窗口设置为t前面5个词和后面5个词;观察上下文词语集合的对数概率定义如下:Step 1. Use the public English Wikipedia corpus D as the training corpus, which can be downloaded from " https://dumps.wikimedia.org/enwiki/ ", and use the word embedding algorithm Word2vec to obtain the vector representation v t of the word t ; through the Word2vec algorithm The obtained word vector representation can well capture the semantic features of the word; after obtaining the word vector representation, the similarity of the word can be obtained to help find the highly similar word set of each word; in this example, each vector’s The dimension is set to 300, and the Skip-Gram model is used to learn the word embedding algorithm Word2vec; given a corpus D and a word t, consider a sliding window centered on t, and use W t to represent the set of words appearing in the context window of t; sliding The window is set to 5 words before and 5 words after t; the logarithmic probability of observing the context word set is defined as follows:

式(1)中,v'w是词语w的上下文向量表示,V是D的词汇表;然后,Skig-Gram的整体目标函数被定义如下:In Equation (1), v'w is the context vector representation of word w , and V is the vocabulary of D; then, the overall objective function of Skig-Gram is defined as follows:

式(2)中,词的向量表示可以通过采用随机的梯度下降算法和负抽样,最大化该目标函数进行学习。In Equation (2), the word vector representation can be learned by maximizing the objective function by using stochastic gradient descent algorithm and negative sampling.

步骤2、利用维基百科语料D,统计每个词语t的频率f(t),f(t)表示词语t在D中的出现次数;在文本简化领域中,词的复杂度测量通过会考虑词语的频率;一般说来,词的频率越高,该词越容易理解;因此,词频可以用来从词语t的高相似的词语集合中寻找最容易理解的词。Step 2. Use the Wikipedia corpus D to count the frequency f(t) of each word t, f(t) represents the number of occurrences of the word t in D; in the field of text simplification, the complexity measurement of words will take into account words Generally speaking, the higher the frequency of a word, the easier it is to understand the word; therefore, word frequency can be used to find the most understandable word from the highly similar word set of word t.

步骤3、维基百科语料D中一个超大的语料库,包含了大量的复杂句子集合和简单句子集合;利用维基百科语料D,获取简化句子集合S和复杂句子集合C;Step 3, a very large corpus in the Wikipedia corpus D contains a large number of complex sentence sets and simple sentence sets; using the Wikipedia corpus D, obtain the simplified sentence set S and the complex sentence set C;

步骤3.1、针对维基百科语料D中的每个句子s,采用FRE(Flesch Reading Ease)算法进行打分,如式(3),并按分值从高到低进行排序;分值越高意味着句子越简单,分值越低意味着句子越困难;Step 3.1, for each sentence s in the Wikipedia corpus D, use the FRE (Flesch Reading Ease) algorithm to score, such as formula (3), and sort according to the score from high to low; the higher the score, the sentence The simpler, the lower the score means the more difficult the sentence;

式(3)中,FRE(s)表示句子s的FRE得分,tw(s)表示句子s中所有词的数目,ts(s)表示句子s中所有音节的数目;FRE算法通常被用来评价文本简化模型最后简化结果的好坏;In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, ts(s) represents the number of all syllables in sentence s; FRE algorithm is usually used to evaluate The quality of the final simplification result of the text simplification model;

步骤3.2、去除得分超过100的句子集合,去除得到低于20分的句子集合,去除中间得分的句子集合;去除高分和低分的句子,是为了去除特别极端的句子;去除中间得分的句子是为了在S和C之间建立明显的界限;最后,选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C;本实例中,S和C都分别选择了1千万个句子。Step 3.2, remove the set of sentences with a score of more than 100, remove the set of sentences with a score of less than 20, and remove the set of sentences with an intermediate score; remove sentences with high and low scores to remove particularly extreme sentences; remove sentences with an intermediate score In order to establish a clear boundary between S and C; finally, select the high-scoring sentence set as the simplified sentence set S and the low-scoring sentence set as the complex sentence set C; in this example, S and C are both selected 1 Thousands of sentences.

步骤4、利用词的向量表示和词的频率,填充表示词翻译为另一个词语概率的短语表PT(Phrase Table)。在PT中,词语ti到词语tj的翻译概率p(tj|ti)的计算公式如下:Step 4, using the vector representation of the word and the frequency of the word to fill the phrase table PT (Phrase Table) representing the probability of the word being translated into another word. In PT, the calculation formula of the translation probability p(t j |t i ) from word t i to word t j is as follows:

式(4)中,cos表示余弦相似度计算公式;考虑到学习所有词的概率转换是不可行的,在本实例中,选择了最频繁的30万个词语,并只计算到最相似的200个词语的概率;对词语中的专有名词,只计算到自己本身的概率。In formula (4), cos represents the cosine similarity calculation formula; considering that it is not feasible to learn the probability conversion of all words, in this example, the most frequent 300,000 words are selected, and only the most similar 200 words are calculated. The probability of each word; for the proper nouns in the words, only the probability of itself is calculated.

步骤5、针对步骤3获取的简化句子集合S和复杂句子集合C,分别采用语言模型KenLM算法进行训练,获取简化语言模型LMS和复杂语言模型LMC;LMS和LMC在后面的迭代学习过程中保持不变;语言模型用来计算给语料中指定的词语序列的概率;简化语言模型和复杂语言模型通过计算词语序列的概率,有助于通过以下方法提高简化模型的质量:执行本地替换和词语顺序重排。Step 5. For the simplified sentence set S and the complex sentence set C obtained in step 3, the language model KenLM algorithm is used to train respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and LMC are iteratively learned later The process remains unchanged; the language model is used to calculate the probability given to the word sequence specified in the corpus; the simplified language model and the complex language model help to improve the quality of the simplified model by calculating the probability of the word sequence by performing local replacement and word order rearrangement.

步骤6、利用短语表PT、简化语言模型LMS和复杂语言模型LMC,采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation),构建复杂句子到简化句子的简化算法PBMT算法最先在2007年《Statistical phrase-based translation》提出,用来用于有双语言的机器翻译;给定复杂句子c,算法利用式(5),分别计算不同词的组合组成的句子s的得分,最后选择得分做高的句子s’将作为简化句子:Step 6. Using the phrase table PT, the simplified language model LMS and the complex language model LMC, the phrase - based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences The PBMT algorithm was first proposed in "Statistical phrase-based translation" in 2007, and is used for machine translation with dual languages; given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:

s'=argmaxsp(c|s)p(s) (5)s'=argmax s p(c|s)p(s) (5)

式(5)中,PBMT算法分解p(c|s)作为短语表PT的内积,p(s)是句子s的概率,是从语言模型LMS获得。In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence s , which is obtained from the language model LMS.

步骤7、鉴于只能获取非并行语料,利用初始的PBMT算法迭代执行回译(Back-translation)的策略,可以把非常困难的无监督学习问题转化为有监督学习任务,从而生成更优的文本简化算法;Step 7. Since only non-parallel corpus can be obtained, use the initial PBMT algorithm Iterative implementation of the Back-translation strategy can transform very difficult unsupervised learning problems into supervised learning tasks, thereby generating better text simplification algorithms;

步骤7.1、首先利用算法,翻译复杂句子集合C,得到新的合成的简化句子集合S0;然后,循环执行步骤7.2到7.5,迭代次数i从1到N;Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S 0 ; then, perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;

步骤7.2、利用合成的并行语料(Si-1,C)、简化语言模型LMC和复杂语言模型LMS,训练新的从简化句子到复杂句子的PBMT算法 Step 7.2, use the synthesized parallel corpus ( S i-1 , C ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from simplified sentences to complex sentences

步骤7.3、利用翻译简化句子集合S,得到新的合成的复杂句子集合CiStep 7.3, use Translate the simplified sentence set S to obtain a new synthetic complex sentence set C i ;

步骤7.4、利用合成的并行语料(Ci,S)、简化语言模型LMC和复杂语言模型LMS,训练新的从复杂句子到简化句子的PBMT算法 Step 7.4, use the synthesized parallel corpus ( C i , S ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from complex sentences to simplified sentences

步骤7.5、利用翻译复杂句子集合C,得到新的合成的简化句子集合Si;重新回到步骤7、2重复执行,直到迭代N次;本实例中,N被设置为3。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S i ; go back to step 7 and 2 and repeat until N iterations; in this example, N is set to 3.

直观的说,由于PBMT算法的输入是包含噪音的,导致短语表中许多条目是不正确的;尽管如此,在产生简化句子的过程中,语言模型能够帮助纠正一些错误;只要这种情况发生了,随着迭代的持续进行,短语表和翻译算法都会相应的被提高;随着短语表中更多的条目将被纠正过来,PBMT算法也会越来越强大。Intuitively, since the input of the PBMT algorithm contains noise, many entries in the phrase table are incorrect; however, in the process of generating simplified sentences, the language model can help correct some errors; as long as this happens , as the iteration continues, the phrase table and the translation algorithm will be improved accordingly; as more entries in the phrase table will be corrected, the PBMT algorithm will become more and more powerful.

本发明并不局限于上述实施例,在本发明公开的技术方案的基础上,本领域的技术人员根据所公开的技术内容,不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形,这些替换和变形均在本发明的保护范围内。The present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some replacements and modifications to some of the technical features according to the disclosed technical content without creative work. Deformation, these replacements and deformations are all within the protection scope of the present invention.

Claims (3)

1. a kind of unsupervised english sentence simplifies algorithm automatically, which is characterized in that carry out as follows:
Step 1, using disclosed English wikipedia corpus D as training corpus, obtained using word embedded mobile GIS Word2vec The vector of word t indicates vt;The term vector obtained by Word2vec algorithm indicates to can be good at catching the semanteme of word special Sign;Using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t considers that one with t is The sliding window of the heart, uses WtIt indicates to appear in the set of words in t contextual window;The logarithm for observing context words set is general Rate is defined as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the overall goals letter of Skig-Gram Number is defined as foloows:
In formula (2), the vector expression of word can be learnt by maximizing the objective function;
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates appearance of the word t in D Number;
Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition;
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase table PT of another word probability using the vector of word (Phrase Table);In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;
Step 5 is directed to simplified sentence set S and complex sentence subclass C, and language model KenLM algorithm is respectively adopted and is trained, Obtain reduction language model LMSWith complex language model LMC;LMSAnd LMCIt is remained unchanged in iterative learning procedure below;
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine translation Algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification algorithm for simplifying sentenceGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the score of the sentence s of different contamination compositions, Finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and p (s) is the probability of sentence s, is from language mould Type LMSIt obtains;
Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), generates more Excellent text simplifies algorithm.
2. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that step 3 tool Body includes:
Step 3.1, for each sentence s in wikipedia corpus D, using Flesch Reading Ease (FRE) algorithm into Row marking, such as formula (3), and is ranked up from high to low by score value;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, and ts (s) indicates sentence The articulatory number of institute in sub- s;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, removes intermediate comparison scores Sentence set;Finally, it is multiple to select the sentence set of high score to be used as the sentence set for simplifying sentence set S and low score Miscellaneous sentence set C.
3. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that the step 7 specifically include:
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence set S of new synthesis0, Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMSWith complex language model LMC, training is newly From simplify sentence to complicated sentence PBMT algorithm
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, train newly From complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;It comes back to Step 7.2 repeats, until iteration n times.
CN201910354246.1A 2019-04-29 2019-04-29 An unsupervised automatic simplification algorithm for English sentences Active CN110096705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910354246.1A CN110096705B (en) 2019-04-29 2019-04-29 An unsupervised automatic simplification algorithm for English sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910354246.1A CN110096705B (en) 2019-04-29 2019-04-29 An unsupervised automatic simplification algorithm for English sentences

Publications (2)

Publication Number Publication Date
CN110096705A true CN110096705A (en) 2019-08-06
CN110096705B CN110096705B (en) 2023-09-08

Family

ID=67446309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910354246.1A Active CN110096705B (en) 2019-04-29 2019-04-29 An unsupervised automatic simplification algorithm for English sentences

Country Status (1)

Country Link
CN (1) CN110096705B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113807098A (en) * 2021-08-26 2021-12-17 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN117808124A (en) * 2024-02-29 2024-04-02 云南师范大学 Llama 2-based text simplification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 A method for automatic extraction of document summaries based on word vectors
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Automatic keyword extraction method based on gravitational model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 A method for automatic extraction of document summaries based on word vectors
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Automatic keyword extraction method based on gravitational model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAKUMI MARUYAMA等: "Sentence simplification with core vocabulary", 《 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system
CN110427629B (en) * 2019-08-13 2024-02-06 思必驰科技股份有限公司 Semi-supervised text simplified model training method and system
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN112612892B (en) * 2020-12-29 2022-11-01 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113807098A (en) * 2021-08-26 2021-12-17 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN113807098B (en) * 2021-08-26 2023-01-10 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN117808124A (en) * 2024-02-29 2024-04-02 云南师范大学 Llama 2-based text simplification method
CN117808124B (en) * 2024-02-29 2024-05-03 云南师范大学 A text simplification method based on Llama2

Also Published As

Publication number Publication date
CN110096705B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
CN110852117B (en) Effective data enhancement method for improving translation effect of neural machine
CN109359294B (en) Ancient Chinese translation method based on neural machine translation
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN110096705B (en) An unsupervised automatic simplification algorithm for English sentences
CN106484681B (en) A kind of method, apparatus and electronic equipment generating candidate translation
CN109858042B (en) Translation quality determining method and device
US9176936B2 (en) Transliteration pair matching
CN107480144B (en) Image natural language description generation method and device with cross-language learning ability
US11669695B2 (en) Translation method, learning method, and non-transitory computer-readable storage medium for storing translation program to translate a named entity based on an attention score using neural network
CN103853710A (en) Coordinated training-based dual-language named entity identification method
CN103678285A (en) Machine translation method and machine translation system
JP2009140503A (en) Method and apparatus for translating speech
CN102799579A (en) Statistical machine translation method with error self-diagnosis and self-correction functions
CN108460027A (en) A kind of spoken language instant translation method and system
CN106156013B (en) A two-stage machine translation method with fixed collocation type phrase priority
Liu et al. Morphological segmentation for Seneca
CN103810993A (en) Text phonetic notation method and device
CN107239449A (en) A kind of English recognition methods and interpretation method
CN119204046B (en) Medical machine translation self-learning method, device and electronic equipment
JP2016224483A (en) Model learning device, method and program
CN113822053B (en) Syntax error detection method, device, electronic device and storage medium
CN117149987B (en) Training method and device for multilingual dialogue state tracking model
CN117910483A (en) Translation method, translation device, electronic equipment and storage medium
CN109446537B (en) A translation evaluation method and device for machine translation
Singh et al. Urdu to Punjabi machine translation: an incremental training approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant