CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically - Google Patents
A kind of unsupervised english sentence simplifies algorithm automatically Download PDFInfo
- Publication number
- CN110096705A CN110096705A CN201910354246.1A CN201910354246A CN110096705A CN 110096705 A CN110096705 A CN 110096705A CN 201910354246 A CN201910354246 A CN 201910354246A CN 110096705 A CN110096705 A CN 110096705A
- Authority
- CN
- China
- Prior art keywords
- sentence
- word
- algorithm
- complex
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013519 translation Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000000034 method Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000003786 synthesis reaction Methods 0.000 claims 5
- 230000015572 biosynthetic process Effects 0.000 claims 1
- 238000011109 contamination Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 239000004576 sand Substances 0.000 claims 1
- 230000008569 process Effects 0.000 description 4
- 230000001149 cognitive effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了互联网领域内的一种无监督的英文句子自动简化算法,按如下步骤进行:步骤1、训练词的向量表示;步骤2、获取词的频率;步骤3、分别获取简化句子集合和复杂句子集合;步骤4、填充短语表;步骤5、分别训练简化句子语言模型和复杂句子语言模型;步骤6、构建基于短语的句子简化模型;步骤7、迭代执行回译的策略,训练更优的句子简化模型,本发明在没有利用任何标注的并行语料,充分利用英文维基百科语料,有效提高了英文句子简化的准确性。The invention discloses an unsupervised automatic simplification algorithm for English sentences in the Internet field, which is carried out according to the following steps: step 1, vector representation of training words; step 2, obtaining the frequency of words; step 3, obtaining the simplified sentence set and A collection of complex sentences; step 4, filling the phrase table; step 5, training the simplified sentence language model and the complex sentence language model respectively; step 6, building a simplified sentence model based on phrases; step 7, iteratively executing the back-translation strategy for better training The sentence simplification model of the present invention effectively improves the accuracy of English sentence simplification by making full use of the English Wikipedia corpus without using any marked parallel corpus.
Description
技术领域technical field
本发明涉及一种互联网文本算法,特别涉及一种无监督的英文句子自动简化算法。The invention relates to an Internet text algorithm, in particular to an unsupervised automatic English sentence simplification algorithm.
背景技术Background technique
近年来,互联网上的文本资料向更广泛的用户提供了很多有用的知识和信息。然后,对于许多人来说,网上文本的撰写方式,如词汇和句法结果,可能难以阅读和理解,特别是对那些识字率低、认知或语言障碍、或者文本语言知识有限的人。包含非常用词或长而复杂句子的文本不仅很难被人们阅读和理解,也同样很难被机器进行分析。自动文本简化是在保留原有文本信息的情况下,尽可能简化原有文本的内容,从而达到更容易被更广泛的观众阅读和理解。In recent years, textual materials on the Internet have provided a lot of useful knowledge and information to a wider range of users. Then, the way texts are written online, such as lexical and syntactic results, can be difficult to read and understand for many people, especially those with low literacy, cognitive or language disabilities, or limited knowledge of the language of the text. Texts that contain very wordy words or long and complex sentences are not only difficult to read and understand by humans, but also difficult for machines to analyze. Automatic text simplification is to simplify the content of the original text as much as possible while retaining the original text information, so as to make it easier to be read and understood by a wider audience.
现有的文本简化算法利用机器翻译的算法,从一种语言下的复杂句子和简化句子的并行语料对中学习简化句子。这种文本简化算法是一种有监督的学习任务,它的有效性严重依赖大量的并行简化语料。可是,现在已有的英文并行简化语料主要是从普通英语的维基百科和儿童版的英语维基百科中获取,通过匹配算法分别两个不同维基百科中选择句子作为并行句子对。目前能够获取的并行简化语料,不仅数量少,而且包含很多非简化的句子对和错误的句子对,主要因为儿童版的维基百科由非专业人士编写,并不是和普通的维基百科一一对应,导致很难选择合适的句子匹配算法。因为简化并行语料的问题,导致已有文本简化算法效果并不是很理想。Existing text simplification algorithms use machine translation algorithms to learn simplified sentences from parallel corpus pairs of complex sentences and simplified sentences in one language. This text reduction algorithm is a supervised learning task, and its effectiveness relies heavily on a large number of parallel reduced corpora. However, the existing English parallel simplified corpus is mainly obtained from the general English Wikipedia and the English Wikipedia for children, and sentences are selected from two different Wikipedias as parallel sentence pairs through matching algorithms. The parallel simplified corpus currently available is not only small in number, but also contains many non-simplified sentence pairs and wrong sentence pairs, mainly because the children's version of Wikipedia is written by non-professionals and does not correspond to ordinary Wikipedia one-to-one. It makes it difficult to choose a suitable sentence matching algorithm. Because of the problem of simplifying the parallel corpus, the effect of the existing text simplification algorithm is not very ideal.
发明内容Contents of the invention
本发明的目的是提供一种无监督的英文句子自动简化算法,在无需要任何并行简化语料,只利用公开下载的维基百科语料,实现对英文句子的自动简化,从而能让用户更容易阅读和理解英文句子,特别是认知或者语言障碍的人。The purpose of the present invention is to provide an unsupervised automatic simplification algorithm for English sentences. Without the need for any parallel simplification corpus, only the publicly downloaded Wikipedia corpus is used to realize the automatic simplification of English sentences, thereby allowing users to read and read more easily. Comprehend English sentences, especially for those with cognitive or language disabilities.
本发明的目的是这样实现的:一种无监督的英文句子自动简化算法,按如下步骤进行:The purpose of the present invention is achieved in that a kind of unsupervised English sentence automatic simplification algorithm, carries out as follows:
步骤1、把公开的英文维基百科语料库D作为训练语料,采用词嵌入算法Word2vec获取词语t的向量表示vt;通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征;采用Skip-Gram模型学习词嵌入算法Word2vec;给定语料库D和词语t,考虑一个以t为中心的滑动窗口,用Wt表示出现在t上下文窗口中的词语集合;观察上下文词语集合的对数概率定义如下:Step 1. Use the public English Wikipedia corpus D as the training corpus, and use the word embedding algorithm Word2vec to obtain the vector representation v t of the word t; the word vector representation obtained by the Word2vec algorithm can well capture the semantic features of the word; use Skip -Gram model learning word embedding algorithm Word2vec; given corpus D and word t, consider a sliding window centered on t, and use W t to represent the set of words appearing in the context window of t; observe the logarithmic probability definition of the context word set as follows:
式(1)中,v'w是词语w的上下文向量表示,V是D的词汇表;然后,Skig-Gram的整体目标函数被定义如下:In Equation (1), v'w is the context vector representation of word w , and V is the vocabulary of D; then, the overall objective function of Skig-Gram is defined as follows:
式(2)中,词的向量表示可以通过最大化该目标函数进行学习;In formula (2), the word vector representation can be learned by maximizing the objective function;
步骤2、利用维基百科语料D,统计每个词语t的频率f(t),f(t)表示词语t在D中的出现次数;Step 2, using Wikipedia corpus D, count the frequency f(t) of each word t, f(t) represents the number of occurrences of word t in D;
步骤3、利用维基百科语料D,获取简化句子集合S和复杂句子集合C;Step 3, using the Wikipedia corpus D to obtain a simplified sentence set S and a complex sentence set C;
步骤4、利用词的向量表示和词的频率,填充表示词翻译为另一个词语概率的短语表PT(Phrase Table);在PT中,词语ti到词语tj的翻译概率p(tj|ti)的计算公式如下:Step 4, using the vector representation of the word and the frequency of the word, populate the phrase table PT (Phrase Table) representing the probability of word translation into another word; in PT, the translation probability p( t j | The calculation formula of t i ) is as follows:
式(4)中,cos表示余弦相似度计算公式;In formula (4), cos represents the calculation formula of cosine similarity;
步骤5、针对简化句子集合S和复杂句子集合C,分别采用语言模型KenLM算法进行训练,获取简化语言模型LMS和复杂语言模型LMC;LMS和LMC在后面的迭代学习过程中保持不变;Step 5. For the simplified sentence set S and the complex sentence set C , the language model KenLM algorithm is used for training respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and LMC remain unchanged in the subsequent iterative learning process . Change;
步骤6、利用短语表PT、简化语言模型LMS和复杂语言模型LMC,采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation),构建复杂句子到简化句子的简化算法给定复杂句子c,算法利用式(5),分别计算不同词的组合组成的句子s的得分,最后选择得分做高的句子s’将作为简化句子:Step 6. Using the phrase table PT, the simplified language model LMS and the complex language model LMC, the phrase - based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences Given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:
s'=argmaxsp(c|s)p(s) (5)s'=argmax s p(c|s)p(s) (5)
式(5)中,PBMT算法分解p(c|s)作为短语表PT的内积,p(s)是句子s的概率,是从语言模型LMS获得;In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence s , which is obtained from the language model LMS;
步骤7、利用初始的PBMT算法迭代执行回译(Back-translation)的策略,生成更优的文本简化算法。Step 7. Use the initial PBMT algorithm Iteratively execute the Back-translation strategy to generate a better text simplification algorithm.
作为本发明的进一步限定,步骤3具体包括:As a further limitation of the present invention, step 3 specifically includes:
步骤3.1、针对维基百科语料D中的每个句子s,采用Flesch Reading Ease(FRE)算法进行打分,如式(3),并按分值从高到低进行排序;Step 3.1, for each sentence s in the Wikipedia corpus D, use the Flesch Reading Ease (FRE) algorithm to score, such as formula (3), and sort by the score from high to low;
式(3)中,FRE(s)表示句子s的FRE得分,tw(s)表示句子s中所有词的数目,ts(s)表示句子s中所有音节的数目;In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, and ts(s) represents the number of all syllables in sentence s;
步骤3.2、去除得分超过100的句子集合,去除得到低于20分的句子集合,去除中间得分的句子集合;最后,选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C。Step 3.2, remove the sentence set with a score of more than 100, remove the sentence set with a score lower than 20, and remove the sentence set with an intermediate score; finally, select the sentence set with a high score as the simplified sentence set S and the sentence set with a low score as a complex sentence collection C.
作为本发明的进一步限定,所述步骤7具体包括:As a further limitation of the present invention, said step 7 specifically includes:
步骤7.1、首先利用算法,翻译复杂句子集合C,得到新的合成的简化句子集合S0,然后,循环执行步骤7.2到7.5,迭代次数i从1到N;Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S 0 , and then perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;
步骤7.2、利用合成的并行语料(Si-1,C)、简化语言模型LMS和复杂语言模型LMC,训练新的从简化句子到复杂句子的PBMT算法 Step 7.2. Use the synthesized parallel corpus ( S i-1 , C ), simplified language model LMS and complex language model LMC to train a new PBMT algorithm from simplified sentences to complex sentences
步骤7.3、利用翻译简化句子集合S,得到新的合成的复杂句子集合Ci;Step 7.3, use Translate the simplified sentence set S to obtain a new synthetic complex sentence set C i ;
步骤7.4、利用合成的并行语料(Ci,S)、简化语言模型LMC和复杂语言模型LMS,训练新的从复杂句子到简化句子的PBMT算法 Step 7.4, use the synthesized parallel corpus ( C i , S ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from complex sentences to simplified sentences
步骤7.5、利用翻译复杂句子集合C,得到新的合成的简化句子集合Si;重新回到步骤7.2重复执行,直到迭代N次。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S i ; go back to step 7.2 and repeat until N iterations.
与现有技术相比,本发明的有益效果在于:Compared with prior art, the beneficial effect of the present invention is:
1、本发明在填充短语表的过程中,结合了从维基百科语料中获取的词向量表示和词频率,能够抓住词语的语义信息和词语的使用频率,克服了传统的基于短语的机器翻译PBMT算法需要利用并行语料填充短语表;1. In the process of filling the phrase table, the present invention combines the word vector representation and word frequency obtained from the Wikipedia corpus, can capture the semantic information of words and the frequency of use of words, and overcome the traditional machine translation based on phrases The PBMT algorithm needs to use parallel corpus to fill the phrase table;
2、本发明将维基百科语料库作为知识库,利用Flesch Reading Ease(FRE)算法对句子进行打分,从而获取简化句子集合和复杂句子集合,从而能够更为准确的训练复杂句子语言模型和简化句子语言模型;2. The present invention uses the Wikipedia corpus as a knowledge base, and uses the Flesch Reading Ease (FRE) algorithm to score sentences, thereby obtaining simplified sentence sets and complex sentence sets, thereby enabling more accurate training of complex sentence language models and simplified sentence languages Model;
3、本发明利用获得的短语表、复杂句子语言模型和简化句子语言模型,基于PBMT算法构建了初始的无监督的文本简化算法;该文本简化算法不仅是无监督的算法,更是简单、容易解释和能够快速的进行训练;3. The present invention uses the obtained phrase table, complex sentence language model and simplified sentence language model to construct an initial unsupervised text simplification algorithm based on the PBMT algorithm; this text simplification algorithm is not only an unsupervised algorithm, but also simple and easy Explain and be able to train quickly;
4、本发明在构建初始的简化算法之后,利用简化算法生成并行语料,从而采用回译的策略对已有的文本简化模型进行优化,修正了初始的短语表中可能错误的条目,进一步提升算法型性能。4. After constructing the initial simplification algorithm, the present invention uses the simplification algorithm to generate parallel corpus, thereby adopting the strategy of back-translation to optimize the existing text simplification model, correcting possible wrong entries in the initial phrase table, and further improving the algorithm type performance.
具体实施方式Detailed ways
下面结合具体实施例对本发明做进一步说明。The present invention will be further described below in conjunction with specific embodiments.
一种无监督的英文句子自动简化算法,按如下步骤进行:An unsupervised automatic simplification algorithm for English sentences is carried out as follows:
步骤1、把公开的英文维基百科语料库D作为训练语料,可以从“https:// dumps.wikimedia.org/enwiki/”下载,采用词嵌入算法Word2vec获取词语t的向量表示vt;通过Word2vec算法获取的词向量表示能够很好的抓住词语的语义特征;获取词的向量表示后,可以获取词语的相似度,帮助寻找每个词的高相似的词语集合;本实例中,每个向量的维数设置为300,采用Skip-Gram模型学习词嵌入算法Word2vec;给定语料库D和词语t,考虑一个以t为中心的滑动窗口,用Wt表示出现在t上下文窗口中的词语集合;滑动窗口设置为t前面5个词和后面5个词;观察上下文词语集合的对数概率定义如下:Step 1. Use the public English Wikipedia corpus D as the training corpus, which can be downloaded from " https://dumps.wikimedia.org/enwiki/ ", and use the word embedding algorithm Word2vec to obtain the vector representation v t of the word t ; through the Word2vec algorithm The obtained word vector representation can well capture the semantic features of the word; after obtaining the word vector representation, the similarity of the word can be obtained to help find the highly similar word set of each word; in this example, each vector’s The dimension is set to 300, and the Skip-Gram model is used to learn the word embedding algorithm Word2vec; given a corpus D and a word t, consider a sliding window centered on t, and use W t to represent the set of words appearing in the context window of t; sliding The window is set to 5 words before and 5 words after t; the logarithmic probability of observing the context word set is defined as follows:
式(1)中,v'w是词语w的上下文向量表示,V是D的词汇表;然后,Skig-Gram的整体目标函数被定义如下:In Equation (1), v'w is the context vector representation of word w , and V is the vocabulary of D; then, the overall objective function of Skig-Gram is defined as follows:
式(2)中,词的向量表示可以通过采用随机的梯度下降算法和负抽样,最大化该目标函数进行学习。In Equation (2), the word vector representation can be learned by maximizing the objective function by using stochastic gradient descent algorithm and negative sampling.
步骤2、利用维基百科语料D,统计每个词语t的频率f(t),f(t)表示词语t在D中的出现次数;在文本简化领域中,词的复杂度测量通过会考虑词语的频率;一般说来,词的频率越高,该词越容易理解;因此,词频可以用来从词语t的高相似的词语集合中寻找最容易理解的词。Step 2. Use the Wikipedia corpus D to count the frequency f(t) of each word t, f(t) represents the number of occurrences of the word t in D; in the field of text simplification, the complexity measurement of words will take into account words Generally speaking, the higher the frequency of a word, the easier it is to understand the word; therefore, word frequency can be used to find the most understandable word from the highly similar word set of word t.
步骤3、维基百科语料D中一个超大的语料库,包含了大量的复杂句子集合和简单句子集合;利用维基百科语料D,获取简化句子集合S和复杂句子集合C;Step 3, a very large corpus in the Wikipedia corpus D contains a large number of complex sentence sets and simple sentence sets; using the Wikipedia corpus D, obtain the simplified sentence set S and the complex sentence set C;
步骤3.1、针对维基百科语料D中的每个句子s,采用FRE(Flesch Reading Ease)算法进行打分,如式(3),并按分值从高到低进行排序;分值越高意味着句子越简单,分值越低意味着句子越困难;Step 3.1, for each sentence s in the Wikipedia corpus D, use the FRE (Flesch Reading Ease) algorithm to score, such as formula (3), and sort according to the score from high to low; the higher the score, the sentence The simpler, the lower the score means the more difficult the sentence;
式(3)中,FRE(s)表示句子s的FRE得分,tw(s)表示句子s中所有词的数目,ts(s)表示句子s中所有音节的数目;FRE算法通常被用来评价文本简化模型最后简化结果的好坏;In formula (3), FRE(s) represents the FRE score of sentence s, tw(s) represents the number of all words in sentence s, ts(s) represents the number of all syllables in sentence s; FRE algorithm is usually used to evaluate The quality of the final simplification result of the text simplification model;
步骤3.2、去除得分超过100的句子集合,去除得到低于20分的句子集合,去除中间得分的句子集合;去除高分和低分的句子,是为了去除特别极端的句子;去除中间得分的句子是为了在S和C之间建立明显的界限;最后,选择高得分的句子集合作为简化句子集合S和低得分的句子集合作为复杂句子集合C;本实例中,S和C都分别选择了1千万个句子。Step 3.2, remove the set of sentences with a score of more than 100, remove the set of sentences with a score of less than 20, and remove the set of sentences with an intermediate score; remove sentences with high and low scores to remove particularly extreme sentences; remove sentences with an intermediate score In order to establish a clear boundary between S and C; finally, select the high-scoring sentence set as the simplified sentence set S and the low-scoring sentence set as the complex sentence set C; in this example, S and C are both selected 1 Thousands of sentences.
步骤4、利用词的向量表示和词的频率,填充表示词翻译为另一个词语概率的短语表PT(Phrase Table)。在PT中,词语ti到词语tj的翻译概率p(tj|ti)的计算公式如下:Step 4, using the vector representation of the word and the frequency of the word to fill the phrase table PT (Phrase Table) representing the probability of the word being translated into another word. In PT, the calculation formula of the translation probability p(t j |t i ) from word t i to word t j is as follows:
式(4)中,cos表示余弦相似度计算公式;考虑到学习所有词的概率转换是不可行的,在本实例中,选择了最频繁的30万个词语,并只计算到最相似的200个词语的概率;对词语中的专有名词,只计算到自己本身的概率。In formula (4), cos represents the cosine similarity calculation formula; considering that it is not feasible to learn the probability conversion of all words, in this example, the most frequent 300,000 words are selected, and only the most similar 200 words are calculated. The probability of each word; for the proper nouns in the words, only the probability of itself is calculated.
步骤5、针对步骤3获取的简化句子集合S和复杂句子集合C,分别采用语言模型KenLM算法进行训练,获取简化语言模型LMS和复杂语言模型LMC;LMS和LMC在后面的迭代学习过程中保持不变;语言模型用来计算给语料中指定的词语序列的概率;简化语言模型和复杂语言模型通过计算词语序列的概率,有助于通过以下方法提高简化模型的质量:执行本地替换和词语顺序重排。Step 5. For the simplified sentence set S and the complex sentence set C obtained in step 3, the language model KenLM algorithm is used to train respectively, and the simplified language model LMS and the complex language model LMC are obtained; LMS and LMC are iteratively learned later The process remains unchanged; the language model is used to calculate the probability given to the word sequence specified in the corpus; the simplified language model and the complex language model help to improve the quality of the simplified model by calculating the probability of the word sequence by performing local replacement and word order rearrangement.
步骤6、利用短语表PT、简化语言模型LMS和复杂语言模型LMC,采用基于短语的机器翻译算法PBMT(Phrased-based Machine Translation),构建复杂句子到简化句子的简化算法PBMT算法最先在2007年《Statistical phrase-based translation》提出,用来用于有双语言的机器翻译;给定复杂句子c,算法利用式(5),分别计算不同词的组合组成的句子s的得分,最后选择得分做高的句子s’将作为简化句子:Step 6. Using the phrase table PT, the simplified language model LMS and the complex language model LMC, the phrase - based machine translation algorithm PBMT (Phrased-based Machine Translation) is used to construct a simplified algorithm from complex sentences to simplified sentences The PBMT algorithm was first proposed in "Statistical phrase-based translation" in 2007, and is used for machine translation with dual languages; given a complex sentence c, The algorithm uses formula (5) to calculate the scores of the sentence s composed of different word combinations, and finally select the sentence s' with the highest score as the simplified sentence:
s'=argmaxsp(c|s)p(s) (5)s'=argmax s p(c|s)p(s) (5)
式(5)中,PBMT算法分解p(c|s)作为短语表PT的内积,p(s)是句子s的概率,是从语言模型LMS获得。In formula (5), the PBMT algorithm decomposes p(c|s) as the inner product of the phrase table PT, and p(s) is the probability of the sentence s , which is obtained from the language model LMS.
步骤7、鉴于只能获取非并行语料,利用初始的PBMT算法迭代执行回译(Back-translation)的策略,可以把非常困难的无监督学习问题转化为有监督学习任务,从而生成更优的文本简化算法;Step 7. Since only non-parallel corpus can be obtained, use the initial PBMT algorithm Iterative implementation of the Back-translation strategy can transform very difficult unsupervised learning problems into supervised learning tasks, thereby generating better text simplification algorithms;
步骤7.1、首先利用算法,翻译复杂句子集合C,得到新的合成的简化句子集合S0;然后,循环执行步骤7.2到7.5,迭代次数i从1到N;Step 7.1, first use Algorithm, translate the complex sentence set C to obtain a newly synthesized simplified sentence set S 0 ; then, perform steps 7.2 to 7.5 in a loop, and the number of iterations i is from 1 to N;
步骤7.2、利用合成的并行语料(Si-1,C)、简化语言模型LMC和复杂语言模型LMS,训练新的从简化句子到复杂句子的PBMT算法 Step 7.2, use the synthesized parallel corpus ( S i-1 , C ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from simplified sentences to complex sentences
步骤7.3、利用翻译简化句子集合S,得到新的合成的复杂句子集合Ci;Step 7.3, use Translate the simplified sentence set S to obtain a new synthetic complex sentence set C i ;
步骤7.4、利用合成的并行语料(Ci,S)、简化语言模型LMC和复杂语言模型LMS,训练新的从复杂句子到简化句子的PBMT算法 Step 7.4, use the synthesized parallel corpus ( C i , S ), simplified language model LMC and complex language model LMS to train a new PBMT algorithm from complex sentences to simplified sentences
步骤7.5、利用翻译复杂句子集合C,得到新的合成的简化句子集合Si;重新回到步骤7、2重复执行,直到迭代N次;本实例中,N被设置为3。Step 7.5, use Translate the complex sentence set C to obtain a new synthesized simplified sentence set S i ; go back to step 7 and 2 and repeat until N iterations; in this example, N is set to 3.
直观的说,由于PBMT算法的输入是包含噪音的,导致短语表中许多条目是不正确的;尽管如此,在产生简化句子的过程中,语言模型能够帮助纠正一些错误;只要这种情况发生了,随着迭代的持续进行,短语表和翻译算法都会相应的被提高;随着短语表中更多的条目将被纠正过来,PBMT算法也会越来越强大。Intuitively, since the input of the PBMT algorithm contains noise, many entries in the phrase table are incorrect; however, in the process of generating simplified sentences, the language model can help correct some errors; as long as this happens , as the iteration continues, the phrase table and the translation algorithm will be improved accordingly; as more entries in the phrase table will be corrected, the PBMT algorithm will become more and more powerful.
本发明并不局限于上述实施例,在本发明公开的技术方案的基础上,本领域的技术人员根据所公开的技术内容,不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形,这些替换和变形均在本发明的保护范围内。The present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some replacements and modifications to some of the technical features according to the disclosed technical content without creative work. Deformation, these replacements and deformations are all within the protection scope of the present invention.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354246.1A CN110096705B (en) | 2019-04-29 | 2019-04-29 | An unsupervised automatic simplification algorithm for English sentences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354246.1A CN110096705B (en) | 2019-04-29 | 2019-04-29 | An unsupervised automatic simplification algorithm for English sentences |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110096705A true CN110096705A (en) | 2019-08-06 |
CN110096705B CN110096705B (en) | 2023-09-08 |
Family
ID=67446309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910354246.1A Active CN110096705B (en) | 2019-04-29 | 2019-04-29 | An unsupervised automatic simplification algorithm for English sentences |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096705B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427629A (en) * | 2019-08-13 | 2019-11-08 | 苏州思必驰信息科技有限公司 | Semi-supervised text simplified model training method and system |
CN112612892A (en) * | 2020-12-29 | 2021-04-06 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN113807098A (en) * | 2021-08-26 | 2021-12-17 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | A method for automatic extraction of document summaries based on word vectors |
CN105447206A (en) * | 2016-01-05 | 2016-03-30 | 深圳市中易科技有限责任公司 | New comment object identifying method and system based on word2vec algorithm |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Automatic keyword extraction method based on gravitational model |
-
2019
- 2019-04-29 CN CN201910354246.1A patent/CN110096705B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | A method for automatic extraction of document summaries based on word vectors |
CN105447206A (en) * | 2016-01-05 | 2016-03-30 | 深圳市中易科技有限责任公司 | New comment object identifying method and system based on word2vec algorithm |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Automatic keyword extraction method based on gravitational model |
Non-Patent Citations (1)
Title |
---|
TAKUMI MARUYAMA等: "Sentence simplification with core vocabulary", 《 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427629A (en) * | 2019-08-13 | 2019-11-08 | 苏州思必驰信息科技有限公司 | Semi-supervised text simplified model training method and system |
CN110427629B (en) * | 2019-08-13 | 2024-02-06 | 思必驰科技股份有限公司 | Semi-supervised text simplified model training method and system |
CN112612892A (en) * | 2020-12-29 | 2021-04-06 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN112612892B (en) * | 2020-12-29 | 2022-11-01 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN113807098A (en) * | 2021-08-26 | 2021-12-17 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN113807098B (en) * | 2021-08-26 | 2023-01-10 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
CN117808124B (en) * | 2024-02-29 | 2024-05-03 | 云南师范大学 | A text simplification method based on Llama2 |
Also Published As
Publication number | Publication date |
---|---|
CN110096705B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852117B (en) | Effective data enhancement method for improving translation effect of neural machine | |
CN109359294B (en) | Ancient Chinese translation method based on neural machine translation | |
CN110597997B (en) | Military scenario text event extraction corpus iterative construction method and device | |
CN110096705B (en) | An unsupervised automatic simplification algorithm for English sentences | |
CN106484681B (en) | A kind of method, apparatus and electronic equipment generating candidate translation | |
CN109858042B (en) | Translation quality determining method and device | |
US9176936B2 (en) | Transliteration pair matching | |
CN107480144B (en) | Image natural language description generation method and device with cross-language learning ability | |
US11669695B2 (en) | Translation method, learning method, and non-transitory computer-readable storage medium for storing translation program to translate a named entity based on an attention score using neural network | |
CN103853710A (en) | Coordinated training-based dual-language named entity identification method | |
CN103678285A (en) | Machine translation method and machine translation system | |
JP2009140503A (en) | Method and apparatus for translating speech | |
CN102799579A (en) | Statistical machine translation method with error self-diagnosis and self-correction functions | |
CN108460027A (en) | A kind of spoken language instant translation method and system | |
CN106156013B (en) | A two-stage machine translation method with fixed collocation type phrase priority | |
Liu et al. | Morphological segmentation for Seneca | |
CN103810993A (en) | Text phonetic notation method and device | |
CN107239449A (en) | A kind of English recognition methods and interpretation method | |
CN119204046B (en) | Medical machine translation self-learning method, device and electronic equipment | |
JP2016224483A (en) | Model learning device, method and program | |
CN113822053B (en) | Syntax error detection method, device, electronic device and storage medium | |
CN117149987B (en) | Training method and device for multilingual dialogue state tracking model | |
CN117910483A (en) | Translation method, translation device, electronic equipment and storage medium | |
CN109446537B (en) | A translation evaluation method and device for machine translation | |
Singh et al. | Urdu to Punjabi machine translation: an incremental training approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |