CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically - Google Patents

A kind of unsupervised english sentence simplifies algorithm automatically Download PDF

Info

Publication number
CN110096705A
CN110096705A CN201910354246.1A CN201910354246A CN110096705A CN 110096705 A CN110096705 A CN 110096705A CN 201910354246 A CN201910354246 A CN 201910354246A CN 110096705 A CN110096705 A CN 110096705A
Authority
CN
China
Prior art keywords
sentence
word
algorithm
complex
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910354246.1A
Other languages
Chinese (zh)
Other versions
CN110096705B (en
Inventor
强继朋
李云
袁运浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201910354246.1A priority Critical patent/CN110096705B/en
Publication of CN110096705A publication Critical patent/CN110096705A/en
Application granted granted Critical
Publication of CN110096705B publication Critical patent/CN110096705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the unsupervised english sentences of one kind in internet area to simplify algorithm automatically, carries out as follows: the vector expression of step 1, training word;Step 2, the frequency for obtaining word;Step 3 obtains simplified sentence set and complex sentence subclass respectively;Step 4, filling phrase table;Simplified sentence language model and complex sentence sublanguage model is respectively trained in step 5;The phrase-based sentence simplified model of step 6, building;Step 7, iteration execute the strategy of retroversion, and more preferably sentence simplified model, the present invention using the parallel corpus of any mark, are not making full use of English wikipedia corpus for training, effectively increase the accuracy that english sentence simplifies.

Description

A kind of unsupervised english sentence simplifies algorithm automatically
Technical field
The present invention relates to a kind of internet text algorithm, in particular to a kind of unsupervised english sentence simplifies automatically to be calculated Method.
Background technique
In recent years, the text information on internet provides many useful knowledge and information to wider user.So Afterwards, for many people, the writing mode of online text, such as vocabulary and syntax result, it may be difficult to reading and understanding, especially It is, cognition low to those literacy rates or aphasis or the limited people of text language knowledge.It is answered comprising non-everyday words or length The text of miscellaneous sentence is not only difficult similarly to be difficult to be analyzed by machine by people's reading and understanding.Autotext simplification is In the case where retaining original text information, simplify the content of original text as far as possible, to reach more easily wider Spectators' reading and understanding.
Existing text simplifies the algorithm that algorithm utilizes machine translation, from the complicated sentence under a kind of language and simplifies sentence Parallel corpus centering learn simplify sentence.It is a kind of learning tasks for having supervision, its validity that this text, which simplifies algorithm, Heavy dependence largely simplifies corpus parallel.But existing English is parallel now simplifies corpus mainly from general English It is obtained in the English wikipedia of wikipedia and children's edition, is distinguished in two different wikipedias by matching algorithm and select sentence Son is used as parallel sentence pair.The parallel simplified corpus that can be obtained at present, not only quantity is few, but also includes the sentence of many non-reduced Son to the sentence pair with mistake, write by layman by the wikipedia for being primarily due to children's edition, is not and common dimension Base encyclopaedia corresponds, and causes to be difficult to select suitable sentence matching algorithm.Because the problem of simplifying parallel corpus, causes to have It is not highly desirable that text, which simplifies algorithm effect,.
Summary of the invention
The object of the present invention is to provide a kind of unsupervised english sentences to simplify algorithm automatically, woth no need to any parallel letter Change corpus, only using the wikipedia corpus of open downloading, the automatic simplification to english sentence is realized, so as to allow user more to hold Easy reading and understanding english sentence, the especially people of cognition or aphasis.
The object of the present invention is achieved like this: the unsupervised english sentence of one kind simplifies algorithm automatically, as follows It carries out:
Step 1, using disclosed English wikipedia corpus D as training corpus, using word embedded mobile GIS Word2vec The vector for obtaining word t indicates vt;The term vector obtained by Word2vec algorithm indicates to can be good at catching the language of word Adopted feature;Using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t, consideration one is with t Centered on sliding window, use WtIt indicates to appear in the set of words in t contextual window;Observe pair of context words set Number definition of probability is as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the entirety of Skig-Gram Objective function is defined as foloows:
In formula (2), the vector expression of word can be learnt by maximizing the objective function;
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D Frequency of occurrence;
Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition;
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word Table PT (Phrase Table);In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;
Step 5 is directed to simplified sentence set S and complex sentence subclass C, and the progress of language model KenLM algorithm is respectively adopted Training obtains reduction language model LMSWith complex language model LMC;LMSAnd LMCIt is kept not in iterative learning procedure below Become;
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence AlgorithmGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the sentence s's of different contamination compositions Score, finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s, Say model LMSIt obtains;
Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), raw Simplify algorithm at more preferably text.
It is further limited as of the invention, step 3 specifically includes:
Step 3.1, for each sentence s in wikipedia corpus D, calculated using Flesch Reading Ease (FRE) Method is given a mark, and such as formula (3), and is ranked up from high to low by score value;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table Show the articulatory number of institute in sentence s;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate The sentence set of score;Finally, selecting the sentence set of high score as the sentence collection cooperation for simplifying sentence set S and low score For complex sentence subclass C.
It is further limited as of the invention, the step 7 specifically includes:
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis Close S0, then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMSWith complex language model LMC, instruction Practice the new PBMT algorithm from simplified sentence to complicated sentence
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, training It is new from complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;Again It returns to step 7.2 to repeat, until iteration n times.
Compared with prior art, the beneficial effects of the present invention are:
1, the present invention is during fill phrase table, combine the term vector that is obtained from wikipedia corpus indicate with Word frequency rate can catch the semantic information of word and the frequency of use of word, overcome traditional phrase-based machine translation PBMT algorithm needs to fill phrase table using parallel corpus;
2, the present invention utilizes Flesch Reading Ease (FRE) algorithm pair using wikipedia corpus as knowledge base Sentence is given a mark, so that simplified sentence set and complex sentence subclass are obtained, so as to more accurate training complex sentence Sublanguage model and simplified sentence language model;
3, the present invention is based on PBMT using the phrase table, complex sentence sublanguage model and simplified sentence language model that obtain Algorithm constructs initial unsupervised text and simplifies algorithm;The text simplifies algorithm and is not only unsupervised algorithm, even more simple It is single, easy to explain and be rapidly performed by training;
4, the present invention generates parallel corpus using algorithm is simplified, to use back after constructing initial simplification algorithm The strategy translated optimizes existing text simplified model, entry that may be wrong in initial phrase table is had modified, into one Walk boosting algorithm type performance.
Specific embodiment
The present invention will be further described combined with specific embodiments below.
A kind of unsupervised english sentence simplifies algorithm automatically, carries out as follows:
Step 1, using disclosed English wikipedia corpus D as training corpus, can from "https:// dumps.wikimedia.org/enwiki/" downloading, v is indicated using the vector that word embedded mobile GIS Word2vec obtains word tt; The term vector obtained by Word2vec algorithm indicates to can be good at catching the semantic feature of word;The vector for obtaining word indicates Afterwards, the similarity of available word helps the similar set of words of height for finding each word;In this example, each vector Dimension is set as 300, using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t, considers One sliding window centered on t uses WtIt indicates to appear in the set of words in t contextual window;Sliding window is set as t The word of front 5 and below 5 words;The log probability of observation context words set is defined as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the entirety of Skig-Gram Objective function is defined as foloows:
In formula (2), the vector of word indicates that the mesh can be maximized by using random gradient descent algorithm and negative sampling Scalar functions are learnt.
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D Frequency of occurrence;Simplify in field in text, the complexity measure of word passes through the frequency that can consider word;It is, in general, that the frequency of word Rate is higher, which is more readily appreciated that;Therefore, word frequency can be used to find from the similar set of words of height of word t and be easiest to The word of understanding.
The corpus of a super large in step 3, wikipedia corpus D contains a large amount of complex sentence subclass and simple Sentence set;Using wikipedia corpus D, obtains and simplify sentence set S and complex sentence subclass C;
Step 3.1, for each sentence s in wikipedia corpus D, calculated using FRE (Flesch Reading Ease) Method is given a mark, and such as formula (3), and is ranked up from high to low by score value;Score value is higher to mean that sentence is simpler, and score value is lower Mean that sentence is more difficult;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table Show the articulatory number of institute in sentence s;FRE algorithm is usually used to the quality that evaluation text simplified model finally simplifies result;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate The sentence set of score;High score and low point of sentence are removed, is to remove especially extreme sentence;Remove the sentence of intermediate comparison scores Son is to establish apparent boundary between S and C;Finally, select the sentence set of high score as simplify sentence set S and The sentence set of low score is as complex sentence subclass C;In this example, S and C have selected 1,000 ten thousand sentences respectively.
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word Table PT (Phrase Table).In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;Probability conversion in view of learning all words is infeasible , in this example, 300,000 most frequent words are selected, and only calculate the probability for arriving 200 most like words;To word Proper noun in language only calculates the probability for arriving oneself itself.
Step 5, the simplification sentence set S and complex sentence subclass C obtained for step 3, are respectively adopted language model KenLM algorithm is trained, and obtains reduction language model LMSWith complex language model LMC;LMSAnd LMCIn iteration below It is remained unchanged during practising;Language model is used to calculate to the probability for the sequence of terms specified in corpus;Reduction language model and Complex language model facilitates the quality for improving simplified model by the following method by the probability of calculating sequence of terms: executing Local replacement and word order are reset.
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence AlgorithmPBMT algorithm was proposed at 2007 " Statistical phrase-based translation " at first, was used Come for there is diglot machine translation;Given complexity sentence c,Algorithm utilizes formula (5), calculates separately the group of different words The score for the sentence s being combined into, finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s, Say model LMSIt obtains.
Step 7, in view of non-parallel corpus can only be obtained, utilize initial PBMT algorithmIteration executes retroversion (Back-translation) very difficult unsupervised learning problem can be converted into supervised learning task by strategy, Simplify algorithm to generate more preferably text;
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis Close S0;Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMCWith complex language model LMS, instruction Practice the new PBMT algorithm from simplified sentence to complicated sentence
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, training It is new from complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;Again It returns to step 7,2 repeat, until iteration n times;In this example, N is arranged to 3.
It intuitively says, since the input of PBMT algorithm includes noise, it is incorrect for leading to many entries in phrase table 's;Nevertheless, language model can help to correct some mistakes during generating simplified sentence;As long as such case It has occurred, with the lasting progress of iteration, phrase table and translation algorithm can be all enhanced accordingly;With more in phrase table Entry will be repaired, and PBMT algorithm also can be stronger and stronger.
The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor A little replacements and deformation, these replacements and deformation are within the scope of the invention.

Claims (3)

1. a kind of unsupervised english sentence simplifies algorithm automatically, which is characterized in that carry out as follows:
Step 1, using disclosed English wikipedia corpus D as training corpus, obtained using word embedded mobile GIS Word2vec The vector of word t indicates vt;The term vector obtained by Word2vec algorithm indicates to can be good at catching the semanteme of word special Sign;Using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t considers that one with t is The sliding window of the heart, uses WtIt indicates to appear in the set of words in t contextual window;The logarithm for observing context words set is general Rate is defined as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the overall goals letter of Skig-Gram Number is defined as foloows:
In formula (2), the vector expression of word can be learnt by maximizing the objective function;
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates appearance of the word t in D Number;
Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition;
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase table PT of another word probability using the vector of word (Phrase Table);In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;
Step 5 is directed to simplified sentence set S and complex sentence subclass C, and language model KenLM algorithm is respectively adopted and is trained, Obtain reduction language model LMSWith complex language model LMC;LMSAnd LMCIt is remained unchanged in iterative learning procedure below;
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine translation Algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification algorithm for simplifying sentenceGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the score of the sentence s of different contamination compositions, Finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and p (s) is the probability of sentence s, is from language mould Type LMSIt obtains;
Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), generates more Excellent text simplifies algorithm.
2. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that step 3 tool Body includes:
Step 3.1, for each sentence s in wikipedia corpus D, using Flesch Reading Ease (FRE) algorithm into Row marking, such as formula (3), and is ranked up from high to low by score value;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, and ts (s) indicates sentence The articulatory number of institute in sub- s;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, removes intermediate comparison scores Sentence set;Finally, it is multiple to select the sentence set of high score to be used as the sentence set for simplifying sentence set S and low score Miscellaneous sentence set C.
3. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that the step 7 specifically include:
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence set S of new synthesis0, Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMSWith complex language model LMC, training is newly From simplify sentence to complicated sentence PBMT algorithm
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, train newly From complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;It comes back to Step 7.2 repeats, until iteration n times.
CN201910354246.1A 2019-04-29 2019-04-29 Unsupervised English sentence automatic simplification algorithm Active CN110096705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910354246.1A CN110096705B (en) 2019-04-29 2019-04-29 Unsupervised English sentence automatic simplification algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910354246.1A CN110096705B (en) 2019-04-29 2019-04-29 Unsupervised English sentence automatic simplification algorithm

Publications (2)

Publication Number Publication Date
CN110096705A true CN110096705A (en) 2019-08-06
CN110096705B CN110096705B (en) 2023-09-08

Family

ID=67446309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910354246.1A Active CN110096705B (en) 2019-04-29 2019-04-29 Unsupervised English sentence automatic simplification algorithm

Country Status (1)

Country Link
CN (1) CN110096705B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113807098A (en) * 2021-08-26 2021-12-17 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN117808124A (en) * 2024-02-29 2024-04-02 云南师范大学 Llama 2-based text simplification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105447206A (en) * 2016-01-05 2016-03-30 深圳市中易科技有限责任公司 New comment object identifying method and system based on word2vec algorithm
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAKUMI MARUYAMA等: "Sentence simplification with core vocabulary", 《 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system
CN110427629B (en) * 2019-08-13 2024-02-06 思必驰科技股份有限公司 Semi-supervised text simplified model training method and system
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN112612892B (en) * 2020-12-29 2022-11-01 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113807098A (en) * 2021-08-26 2021-12-17 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN113807098B (en) * 2021-08-26 2023-01-10 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
CN117808124A (en) * 2024-02-29 2024-04-02 云南师范大学 Llama 2-based text simplification method
CN117808124B (en) * 2024-02-29 2024-05-03 云南师范大学 Llama 2-based text simplification method

Also Published As

Publication number Publication date
CN110096705B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
CN110096705A (en) A kind of unsupervised english sentence simplifies algorithm automatically
CN110543639B (en) English sentence simplification algorithm based on pre-training transducer language model
CN109359294B (en) Ancient Chinese translation method based on neural machine translation
CN107273355A (en) A kind of Chinese word vector generation method based on words joint training
McMahon et al. Language classification by numbers
Brodsky et al. Characterizing motherese: On the computational structure of child-directed language
US6188976B1 (en) Apparatus and method for building domain-specific language models
US20070174040A1 (en) Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN109858042B (en) Translation quality determining method and device
CN105068997A (en) Parallel corpus construction method and device
CN106202030A (en) A kind of rapid serial mask method based on isomery labeled data and device
CN106156013B (en) A kind of two-part machine translation method that regular collocation type phrase is preferential
CN106649289A (en) Realization method and realization system for simultaneously identifying bilingual terms and word alignment
Alqudsi et al. A hybrid rules and statistical method for Arabic to English machine translation
CN103810993B (en) Text phonetic notation method and device
Kondrak Identification of cognates and recurrent sound correspondences in word lists
CN106502988B (en) A kind of method and apparatus that objective attribute target attribute extracts
CN109376347A (en) A kind of HSK composition generation method based on topic model
CN111767743B (en) Machine intelligent evaluation method and system for translation test questions
CN102156692A (en) Forest-based system combination method for counting machine translation
CN106484670A (en) A kind of Chinese word segmentation error correction method, off-line training device and online treatment device
JP5555542B2 (en) Automatic word association apparatus, method and program thereof
JP2010027020A (en) Word alignment apparatus and program
JP5295037B2 (en) Learning device using Conditional Random Fields or Global Conditional Log-linearModels, and parameter learning method and program in the learning device
CN109446537B (en) Translation evaluation method and device for machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant