CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically - Google Patents
A kind of unsupervised english sentence simplifies algorithm automatically Download PDFInfo
- Publication number
- CN110096705A CN110096705A CN201910354246.1A CN201910354246A CN110096705A CN 110096705 A CN110096705 A CN 110096705A CN 201910354246 A CN201910354246 A CN 201910354246A CN 110096705 A CN110096705 A CN 110096705A
- Authority
- CN
- China
- Prior art keywords
- sentence
- word
- algorithm
- complex
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses the unsupervised english sentences of one kind in internet area to simplify algorithm automatically, carries out as follows: the vector expression of step 1, training word;Step 2, the frequency for obtaining word;Step 3 obtains simplified sentence set and complex sentence subclass respectively;Step 4, filling phrase table;Simplified sentence language model and complex sentence sublanguage model is respectively trained in step 5;The phrase-based sentence simplified model of step 6, building;Step 7, iteration execute the strategy of retroversion, and more preferably sentence simplified model, the present invention using the parallel corpus of any mark, are not making full use of English wikipedia corpus for training, effectively increase the accuracy that english sentence simplifies.
Description
Technical field
The present invention relates to a kind of internet text algorithm, in particular to a kind of unsupervised english sentence simplifies automatically to be calculated
Method.
Background technique
In recent years, the text information on internet provides many useful knowledge and information to wider user.So
Afterwards, for many people, the writing mode of online text, such as vocabulary and syntax result, it may be difficult to reading and understanding, especially
It is, cognition low to those literacy rates or aphasis or the limited people of text language knowledge.It is answered comprising non-everyday words or length
The text of miscellaneous sentence is not only difficult similarly to be difficult to be analyzed by machine by people's reading and understanding.Autotext simplification is
In the case where retaining original text information, simplify the content of original text as far as possible, to reach more easily wider
Spectators' reading and understanding.
Existing text simplifies the algorithm that algorithm utilizes machine translation, from the complicated sentence under a kind of language and simplifies sentence
Parallel corpus centering learn simplify sentence.It is a kind of learning tasks for having supervision, its validity that this text, which simplifies algorithm,
Heavy dependence largely simplifies corpus parallel.But existing English is parallel now simplifies corpus mainly from general English
It is obtained in the English wikipedia of wikipedia and children's edition, is distinguished in two different wikipedias by matching algorithm and select sentence
Son is used as parallel sentence pair.The parallel simplified corpus that can be obtained at present, not only quantity is few, but also includes the sentence of many non-reduced
Son to the sentence pair with mistake, write by layman by the wikipedia for being primarily due to children's edition, is not and common dimension
Base encyclopaedia corresponds, and causes to be difficult to select suitable sentence matching algorithm.Because the problem of simplifying parallel corpus, causes to have
It is not highly desirable that text, which simplifies algorithm effect,.
Summary of the invention
The object of the present invention is to provide a kind of unsupervised english sentences to simplify algorithm automatically, woth no need to any parallel letter
Change corpus, only using the wikipedia corpus of open downloading, the automatic simplification to english sentence is realized, so as to allow user more to hold
Easy reading and understanding english sentence, the especially people of cognition or aphasis.
The object of the present invention is achieved like this: the unsupervised english sentence of one kind simplifies algorithm automatically, as follows
It carries out:
Step 1, using disclosed English wikipedia corpus D as training corpus, using word embedded mobile GIS Word2vec
The vector for obtaining word t indicates vt;The term vector obtained by Word2vec algorithm indicates to can be good at catching the language of word
Adopted feature;Using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t, consideration one is with t
Centered on sliding window, use WtIt indicates to appear in the set of words in t contextual window;Observe pair of context words set
Number definition of probability is as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the entirety of Skig-Gram
Objective function is defined as foloows:
In formula (2), the vector expression of word can be learnt by maximizing the objective function;
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D
Frequency of occurrence;
Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition;
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word
Table PT (Phrase Table);In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;
Step 5 is directed to simplified sentence set S and complex sentence subclass C, and the progress of language model KenLM algorithm is respectively adopted
Training obtains reduction language model LMSWith complex language model LMC;LMSAnd LMCIt is kept not in iterative learning procedure below
Become;
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine
Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence
AlgorithmGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the sentence s's of different contamination compositions
Score, finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s,
Say model LMSIt obtains;
Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), raw
Simplify algorithm at more preferably text.
It is further limited as of the invention, step 3 specifically includes:
Step 3.1, for each sentence s in wikipedia corpus D, calculated using Flesch Reading Ease (FRE)
Method is given a mark, and such as formula (3), and is ranked up from high to low by score value;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table
Show the articulatory number of institute in sentence s;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate
The sentence set of score;Finally, selecting the sentence set of high score as the sentence collection cooperation for simplifying sentence set S and low score
For complex sentence subclass C.
It is further limited as of the invention, the step 7 specifically includes:
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis
Close S0, then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMSWith complex language model LMC, instruction
Practice the new PBMT algorithm from simplified sentence to complicated sentence
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi;
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, training
It is new from complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;Again
It returns to step 7.2 to repeat, until iteration n times.
Compared with prior art, the beneficial effects of the present invention are:
1, the present invention is during fill phrase table, combine the term vector that is obtained from wikipedia corpus indicate with
Word frequency rate can catch the semantic information of word and the frequency of use of word, overcome traditional phrase-based machine translation
PBMT algorithm needs to fill phrase table using parallel corpus;
2, the present invention utilizes Flesch Reading Ease (FRE) algorithm pair using wikipedia corpus as knowledge base
Sentence is given a mark, so that simplified sentence set and complex sentence subclass are obtained, so as to more accurate training complex sentence
Sublanguage model and simplified sentence language model;
3, the present invention is based on PBMT using the phrase table, complex sentence sublanguage model and simplified sentence language model that obtain
Algorithm constructs initial unsupervised text and simplifies algorithm;The text simplifies algorithm and is not only unsupervised algorithm, even more simple
It is single, easy to explain and be rapidly performed by training;
4, the present invention generates parallel corpus using algorithm is simplified, to use back after constructing initial simplification algorithm
The strategy translated optimizes existing text simplified model, entry that may be wrong in initial phrase table is had modified, into one
Walk boosting algorithm type performance.
Specific embodiment
The present invention will be further described combined with specific embodiments below.
A kind of unsupervised english sentence simplifies algorithm automatically, carries out as follows:
Step 1, using disclosed English wikipedia corpus D as training corpus, can from "https:// dumps.wikimedia.org/enwiki/" downloading, v is indicated using the vector that word embedded mobile GIS Word2vec obtains word tt;
The term vector obtained by Word2vec algorithm indicates to can be good at catching the semantic feature of word;The vector for obtaining word indicates
Afterwards, the similarity of available word helps the similar set of words of height for finding each word;In this example, each vector
Dimension is set as 300, using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t, considers
One sliding window centered on t uses WtIt indicates to appear in the set of words in t contextual window;Sliding window is set as t
The word of front 5 and below 5 words;The log probability of observation context words set is defined as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the entirety of Skig-Gram
Objective function is defined as foloows:
In formula (2), the vector of word indicates that the mesh can be maximized by using random gradient descent algorithm and negative sampling
Scalar functions are learnt.
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D
Frequency of occurrence;Simplify in field in text, the complexity measure of word passes through the frequency that can consider word;It is, in general, that the frequency of word
Rate is higher, which is more readily appreciated that;Therefore, word frequency can be used to find from the similar set of words of height of word t and be easiest to
The word of understanding.
The corpus of a super large in step 3, wikipedia corpus D contains a large amount of complex sentence subclass and simple
Sentence set;Using wikipedia corpus D, obtains and simplify sentence set S and complex sentence subclass C;
Step 3.1, for each sentence s in wikipedia corpus D, calculated using FRE (Flesch Reading Ease)
Method is given a mark, and such as formula (3), and is ranked up from high to low by score value;Score value is higher to mean that sentence is simpler, and score value is lower
Mean that sentence is more difficult;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table
Show the articulatory number of institute in sentence s;FRE algorithm is usually used to the quality that evaluation text simplified model finally simplifies result;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate
The sentence set of score;High score and low point of sentence are removed, is to remove especially extreme sentence;Remove the sentence of intermediate comparison scores
Son is to establish apparent boundary between S and C;Finally, select the sentence set of high score as simplify sentence set S and
The sentence set of low score is as complex sentence subclass C;In this example, S and C have selected 1,000 ten thousand sentences respectively.
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word
Table PT (Phrase Table).In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;Probability conversion in view of learning all words is infeasible
, in this example, 300,000 most frequent words are selected, and only calculate the probability for arriving 200 most like words;To word
Proper noun in language only calculates the probability for arriving oneself itself.
Step 5, the simplification sentence set S and complex sentence subclass C obtained for step 3, are respectively adopted language model
KenLM algorithm is trained, and obtains reduction language model LMSWith complex language model LMC;LMSAnd LMCIn iteration below
It is remained unchanged during practising;Language model is used to calculate to the probability for the sequence of terms specified in corpus;Reduction language model and
Complex language model facilitates the quality for improving simplified model by the following method by the probability of calculating sequence of terms: executing
Local replacement and word order are reset.
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine
Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence
AlgorithmPBMT algorithm was proposed at 2007 " Statistical phrase-based translation " at first, was used
Come for there is diglot machine translation;Given complexity sentence c,Algorithm utilizes formula (5), calculates separately the group of different words
The score for the sentence s being combined into, finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s,
Say model LMSIt obtains.
Step 7, in view of non-parallel corpus can only be obtained, utilize initial PBMT algorithmIteration executes retroversion
(Back-translation) very difficult unsupervised learning problem can be converted into supervised learning task by strategy,
Simplify algorithm to generate more preferably text;
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis
Close S0;Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMCWith complex language model LMS, instruction
Practice the new PBMT algorithm from simplified sentence to complicated sentence
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi;
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, training
It is new from complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;Again
It returns to step 7,2 repeat, until iteration n times;In this example, N is arranged to 3.
It intuitively says, since the input of PBMT algorithm includes noise, it is incorrect for leading to many entries in phrase table
's;Nevertheless, language model can help to correct some mistakes during generating simplified sentence;As long as such case
It has occurred, with the lasting progress of iteration, phrase table and translation algorithm can be all enhanced accordingly;With more in phrase table
Entry will be repaired, and PBMT algorithm also can be stronger and stronger.
The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field
For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor
A little replacements and deformation, these replacements and deformation are within the scope of the invention.
Claims (3)
1. a kind of unsupervised english sentence simplifies algorithm automatically, which is characterized in that carry out as follows:
Step 1, using disclosed English wikipedia corpus D as training corpus, obtained using word embedded mobile GIS Word2vec
The vector of word t indicates vt;The term vector obtained by Word2vec algorithm indicates to can be good at catching the semanteme of word special
Sign;Using Skip-Gram model learning word embedded mobile GIS Word2vec;Given corpus D and word t considers that one with t is
The sliding window of the heart, uses WtIt indicates to appear in the set of words in t contextual window;The logarithm for observing context words set is general
Rate is defined as follows:
In formula (1), v'wIt is the context vector expression of word w, V is the vocabulary of D;Then, the overall goals letter of Skig-Gram
Number is defined as foloows:
In formula (2), the vector expression of word can be learnt by maximizing the objective function;
Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates appearance of the word t in D
Number;
Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition;
Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase table PT of another word probability using the vector of word
(Phrase Table);In PT, word tiTo word tjTranslation probability p (tj|ti) calculation formula it is as follows:
In formula (4), cos indicates cosine similarity calculation formula;
Step 5 is directed to simplified sentence set S and complex sentence subclass C, and language model KenLM algorithm is respectively adopted and is trained,
Obtain reduction language model LMSWith complex language model LMC;LMSAnd LMCIt is remained unchanged in iterative learning procedure below;
Step 6 utilizes phrase table PT, reduction language model LMSWith complex language model LMC, using phrase-based machine translation
Algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification algorithm for simplifying sentenceGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the score of the sentence s of different contamination compositions,
Finally selecting score to be high sentence s ' will be as simplified sentence:
S'=argmaxsp(c|s)p(s) (5)
In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and p (s) is the probability of sentence s, is from language mould
Type LMSIt obtains;
Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), generates more
Excellent text simplifies algorithm.
2. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that step 3 tool
Body includes:
Step 3.1, for each sentence s in wikipedia corpus D, using Flesch Reading Ease (FRE) algorithm into
Row marking, such as formula (3), and is ranked up from high to low by score value;
In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, and ts (s) indicates sentence
The articulatory number of institute in sub- s;
Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, removes intermediate comparison scores
Sentence set;Finally, it is multiple to select the sentence set of high score to be used as the sentence set for simplifying sentence set S and low score
Miscellaneous sentence set C.
3. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that the step
7 specifically include:
Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence set S of new synthesis0,
Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N;
Step 7.2, the parallel corpus (S using synthesisi-1, C), reduction language model LMSWith complex language model LMC, training is newly
From simplify sentence to complicated sentence PBMT algorithm
Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesisi;
Step 7.4, the parallel corpus (C using synthesisi, S), reduction language model LMCWith complex language model LMS, train newly
From complicated sentence to the PBMT algorithm for simplifying sentence
Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtainedi;It comes back to
Step 7.2 repeats, until iteration n times.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354246.1A CN110096705B (en) | 2019-04-29 | 2019-04-29 | Unsupervised English sentence automatic simplification algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910354246.1A CN110096705B (en) | 2019-04-29 | 2019-04-29 | Unsupervised English sentence automatic simplification algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110096705A true CN110096705A (en) | 2019-08-06 |
CN110096705B CN110096705B (en) | 2023-09-08 |
Family
ID=67446309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910354246.1A Active CN110096705B (en) | 2019-04-29 | 2019-04-29 | Unsupervised English sentence automatic simplification algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096705B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427629A (en) * | 2019-08-13 | 2019-11-08 | 苏州思必驰信息科技有限公司 | Semi-supervised text simplified model training method and system |
CN112612892A (en) * | 2020-12-29 | 2021-04-06 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN113807098A (en) * | 2021-08-26 | 2021-12-17 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
CN105447206A (en) * | 2016-01-05 | 2016-03-30 | 深圳市中易科技有限责任公司 | New comment object identifying method and system based on word2vec algorithm |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
-
2019
- 2019-04-29 CN CN201910354246.1A patent/CN110096705B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
CN105447206A (en) * | 2016-01-05 | 2016-03-30 | 深圳市中易科技有限责任公司 | New comment object identifying method and system based on word2vec algorithm |
CN108334495A (en) * | 2018-01-30 | 2018-07-27 | 国家计算机网络与信息安全管理中心 | Short text similarity calculating method and system |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
Non-Patent Citations (1)
Title |
---|
TAKUMI MARUYAMA等: "Sentence simplification with core vocabulary", 《 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427629A (en) * | 2019-08-13 | 2019-11-08 | 苏州思必驰信息科技有限公司 | Semi-supervised text simplified model training method and system |
CN110427629B (en) * | 2019-08-13 | 2024-02-06 | 思必驰科技股份有限公司 | Semi-supervised text simplified model training method and system |
CN112612892A (en) * | 2020-12-29 | 2021-04-06 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN112612892B (en) * | 2020-12-29 | 2022-11-01 | 达而观数据(成都)有限公司 | Special field corpus model construction method, computer equipment and storage medium |
CN113807098A (en) * | 2021-08-26 | 2021-12-17 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN113807098B (en) * | 2021-08-26 | 2023-01-10 | 北京百度网讯科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN117808124A (en) * | 2024-02-29 | 2024-04-02 | 云南师范大学 | Llama 2-based text simplification method |
CN117808124B (en) * | 2024-02-29 | 2024-05-03 | 云南师范大学 | Llama 2-based text simplification method |
Also Published As
Publication number | Publication date |
---|---|
CN110096705B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110096705A (en) | A kind of unsupervised english sentence simplifies algorithm automatically | |
CN110543639B (en) | English sentence simplification algorithm based on pre-training transducer language model | |
CN109359294B (en) | Ancient Chinese translation method based on neural machine translation | |
CN107273355A (en) | A kind of Chinese word vector generation method based on words joint training | |
McMahon et al. | Language classification by numbers | |
Brodsky et al. | Characterizing motherese: On the computational structure of child-directed language | |
US6188976B1 (en) | Apparatus and method for building domain-specific language models | |
US20070174040A1 (en) | Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment | |
CN109858042B (en) | Translation quality determining method and device | |
CN105068997A (en) | Parallel corpus construction method and device | |
CN106202030A (en) | A kind of rapid serial mask method based on isomery labeled data and device | |
CN106156013B (en) | A kind of two-part machine translation method that regular collocation type phrase is preferential | |
CN106649289A (en) | Realization method and realization system for simultaneously identifying bilingual terms and word alignment | |
Alqudsi et al. | A hybrid rules and statistical method for Arabic to English machine translation | |
CN103810993B (en) | Text phonetic notation method and device | |
Kondrak | Identification of cognates and recurrent sound correspondences in word lists | |
CN106502988B (en) | A kind of method and apparatus that objective attribute target attribute extracts | |
CN109376347A (en) | A kind of HSK composition generation method based on topic model | |
CN111767743B (en) | Machine intelligent evaluation method and system for translation test questions | |
CN102156692A (en) | Forest-based system combination method for counting machine translation | |
CN106484670A (en) | A kind of Chinese word segmentation error correction method, off-line training device and online treatment device | |
JP5555542B2 (en) | Automatic word association apparatus, method and program thereof | |
JP2010027020A (en) | Word alignment apparatus and program | |
JP5295037B2 (en) | Learning device using Conditional Random Fields or Global Conditional Log-linearModels, and parameter learning method and program in the learning device | |
CN109446537B (en) | Translation evaluation method and device for machine translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |