CN103955456A - Sentence length penalty factor-based selection method for sentence rich in information amount - Google Patents

Sentence length penalty factor-based selection method for sentence rich in information amount Download PDF

Info

Publication number
CN103955456A
CN103955456A CN201410168282.6A CN201410168282A CN103955456A CN 103955456 A CN103955456 A CN 103955456A CN 201410168282 A CN201410168282 A CN 201410168282A CN 103955456 A CN103955456 A CN 103955456A
Authority
CN
China
Prior art keywords
sentence
information
rich
language
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410168282.6A
Other languages
Chinese (zh)
Inventor
杜金华
张萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an bonny Translation Co., Ltd.
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201410168282.6A priority Critical patent/CN103955456A/en
Publication of CN103955456A publication Critical patent/CN103955456A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a sentence length penalty factor-based selection method for a sentence rich in information amount. The method comprises the steps of step 1, establishing a primary counting machine translation system; step 2, establishing an information quantizing unit set X and performing information amount calculation; step 3, performing professional translation to obtain a parallel linguistic data sentence pair set; step 4, updating a linguistic data library; step 5, retraining the counting machine translation system; step 6, performing process iteration and algorithm evaluation. According to the sentence length penalty factor-based selection method for the sentence rich in information amount, the single language sentence of a source language can be effectively subjected to the information amount calculation, the relation between the absolute information amount and the sentence length of the selected sentence is balanced, the accuracy of selection on the sentence with the largest information amount is realized, the manual translation values are maximized, and the limited data efficiency is maximized.

Description

The system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence
Technical field
The invention belongs to computational linguistics/statistical machine translation technical field, relate to a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence.
Background technology
Machine translation mothod based on statistical method or corpus-based is the interpretation method based on data-driven in essence, and therefore, the quality of the size of data scale and data itself has vital impact to translation performance.Obtain a high-quality statictic machine translation system, conventionally need large-scale bilingual parallel corpora, and this cannot realize to a lot of language at present.There have been at present a lot of methods can alleviate this class problem, such as use repetition technology or with comparing language material etc.But for the language of resource shortage, data scale is the bottleneck problem of statistical machine translation technical research, is also one of key problem urgently to be resolved hurrily.
In numerous family of languageies now, also having majority is to belong to " low-density " language, uses the people of this language few, even if there is millions of people to say this language, but available digitized Parallel Corpus is still very deficient.For example, Chinese Minority Nationalities is numerous, and along with expanding economy, minority language and information processing research become one of important means of revitalizing regional economy, promotion regional development and facilitating the intellectual interchange with application.Under this background, the demand of the high-quality statictic machine translation system of " low-density " language is just seemed to particularly urgent.Generally speaking, two schemes can effectively be alleviated this demand: (1) builds extensive bilingual Parallel Corpus; (2) take the bilingual Parallel Corpus of existing certain scale as basis, build extensive single language corpus, adopt efficient method from single language language material, to generate bilingual data, the usefulness of growth data.
In fact, the structure of extensive bilingual Parallel Corpus is a systems engineering, needs a large amount of human and material resources and financial resources to drop into, and especially high-quality corpus needs the considerable time could be perfect.And the development of infotech is fast changing, neologisms and new knowledge emerge in an endless stream, how feasible method is proposed, effectively utilize the current bilingual parallel corpora having built and single language corpus to adapt to new knowledge, improve the statistical machine translation quality that resourceoriented lacks language, current, seem particularly important and urgent.
The existing subject matter that is rich in the general selection algorithm existence of quantity of information sentence is: when in extensive single language corpus, sentence length difference is larger, existing method can tend to select the long shorter sentence of sentence, thereby cause selected sentence make generated bilingual corpora the coverage rate of test set is all showed on still to the probability estimate of translation engine phrase table bad, cause the translation performance of the machine translation system of using the bilingual corpora that selection algorithm generates lower than the machine translation system performance of the bilingual corpora that uses random device to produce, thereby cause, be rich in the selecting without any meaning of quantity of information sentence.
Summary of the invention
The object of this invention is to provide a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence, solved in prior art, tend to select the long shorter sentence of sentence, thereby cause selected sentence to make generated bilingual corpora to the coverage rate of test set or on to the probability estimate of translation engine phrase table, all show badly, affect the problem of machine translation system performance.
The technical solution adopted in the present invention is, a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence is specifically implemented according to following steps:
Step 1, build initial statictic machine translation system
Use initial bilingual parallel corpora L={ (f i, e i) train statictic machine translation system, wherein L represents initial bilingual Parallel Corpus, f i, e irepresent respectively in L that i parallel sentence is right, i.e. i Chinese sentence and i english sentence, i=1 ... N};
Step 2, build information quantization unit set X and carry out information computing
According to defined information, represent unit x, from extensive single language language material U={f jin select sentence set U n, the long penalty factor of band sentence to be rich in quantity of information sentence selection algorithm as follows:
Wherein, U represents extensive single language corpus, U nrepresent single language language material subset that select sentence forms, BP is the long penalty factor of sentence, and the ratio long according to the long average of single language language material sentence and sentence sentence to be selected determines whether applying punishment, and ratio is greater than 1, punishment; Ratio is less than 1, does not punish;
the available information quantization unit set of sentence s, i.e. the set of phrase; P (x|U) and P (x|L) are illustrated respectively in the probability of certain phrase x in extensive single language language material U and bilingual parallel corpora L, and the computing formula of P (x|U) and P (x|L) is as follows respectively:
P ( x | U ) = Count ( x ) + ϵ Σ x ∈ X U m Count ( x ) + ϵ , - - - ( 2 )
P ( x | L ) = Count ( x ) + ϵ Σ x ∈ X L m Count ( x ) + ϵ , - - - ( 3 )
Wherein, ε is smoothing factor, represent that in extensive single language language material, U is available for calculating the possible phrase set of Sentence-level quantity of information, be available for calculating the possible phrase set of Sentence-level quantity of information in bilingual parallel corpora L, Count (x) represents that the number of times that phrase x occurs is frequency;
Step 3, carry out technical translator
By U ntransfer to manually to carry out technical translator, note U nthe collection of translations generating is combined into u nwith parallel corpora sentence pair set for intertranslation;
Step 4, corpus is upgraded
From U, remove the selected n sentence list language data U of previous step n, then by U nand add in bilingual parallel corpora L simultaneously;
Step 5, statictic machine translation system is trained again
In new formation bilingual corpora on again train statictic machine translation system, and use consequent translation engine test decode collection again; Then, the translation result of test set is pressed to the marking of BLEU criterion, according to score raising situation and random device, contrast, weigh sentence selection algorithm performance, score is higher represents that performance is better;
Step 6, carry out process iteration and algorithm evaluation
From step 2, start to carry out iteration, each iteration is selected the new sentence set of being rich in quantity of information, upgrades the data in U and L, weighs sentence selection algorithm performance.
The invention has the beneficial effects as follows, by calculating the quantity of information of single language language material sentence one-level, limited human translation given in the sentence of Extracting Information amount maximum, realizes obtaining of high-quality bilingual panel data, improves the statistical machine translation translation quality of bilingual resource shortage language; Can balance sentence contradiction between long and quantity of information, extract the sentence that is really rich in information, when guaranteeing to improve test set coverage rate, also guarantee the accuracy of translation engine phrase table probability estimate, the usefulness of expansion available data resource.The scarce resource of machine translation system such as to(for) Chinese and other minority languages, Chinese and other rare foreign languages language etc., what adopt the long punishment of band sentence that the present invention proposes is rich in quantity of information sentence selection algorithm, can effectively from extensive single language language material, obtain the bilingual panel data of high-quality.
Accompanying drawing explanation
Fig. 1 is towards the bilingual generative process schematic diagram of statistical machine translation of scarce resource under the inventive method Active Learning framework;
Fig. 2 is the performance comparison figure of existing how much phrase methods and random device statictic machine translation system when each iteration;
Fig. 3 is the mean sentence length statistical property figure that is rich in quantity of information sentence that existing how much phrase systems of selection and random choice method are selected when each iteration;
Fig. 4 is the coverage rate statistical property figures of the bilingual data that generate when each iteration of existing how much phrase systems of selection and random choice method to different test sets;
Fig. 5 is the mean sentence length statistical property figure that is rich in quantity of information sentence that the inventive method and other two kinds of existing methods are picked out when each iteration;
Fig. 6 is the inventive method and the comparison diagram of other two kinds of existing methods in statistical machine translation performance.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
The present invention is based on the system of selection of being rich in quantity of information sentence of the long penalty factor of sentence, consider and be rich in the impact on translation performance of bilingual panel data that single sentence of quantity of information generates, to pick out in extensive single language language material the sentence of quantity of information maximum, give and manually carry out technical translator, generate the bilingual panel data of high-quality, reach the object of the statistical machine translation quality that improves bilingual resource shortage language.
As shown in Figure 1, be that the inventive method adopts and is rich in the iterative process that the bilingual parallel corpora of sentence generation of quantity of information maximum selected in quantity of information sentence under Active Learning framework.The bilingual parallel corpora that each iteration generates adds the Parallel Corpus of last training translation engine, and then training obtains new statictic machine translation system again.New statictic machine translation system is translated decoding to same test set, use (the accuracy of word matching degree continuously of BLEU interpretational criteria, score value is between (0,1) between) to weigh the quality of each newly-increased bilingual corpora, thereby indirectly weigh the quality that is rich in the set of quantity of information sentence of selecting, with the validity of reflection sentence selection algorithm, this process has two condition precedents: 1) a small amount of for building the initial parallel corpora (this system is existing technology) of baseline statictic machine translation system; 2) extensive single language language material: for generate bilingual corpora under framework of the present invention.Meanwhile, how the key issue that this process comprises is for to design the effective ways of selecting the sentence that is rich in quantity of information, and this is also core of the present invention.
The present invention is based on the system of selection of being rich in quantity of information sentence of the long penalty factor of sentence, according to above-mentioned principle, according to following steps, specifically implement:
Step 1, build initial statictic machine translation system
Use a small amount of resource-constrained initial bilingual parallel corpora L={ (f i, e i) train statictic machine translation system, wherein L represents initial bilingual Parallel Corpus, f i, e irepresent respectively in L that i parallel sentence is right, i.e. i Chinese sentence and i english sentence, i=1 ... N};
Step 2, build information quantization unit set X and carry out information computing
According to defined information, represent unit x, from extensive single language language material U={f jin select sentence set U n, the long penalty factor of band sentence to be rich in quantity of information sentence selection algorithm as follows:
Wherein, U represents extensive single language corpus, U nrepresent single language language material subset that select sentence forms, the long penalty factor of sentence that BP is the inventive method, basic thought is for to determine whether applying punishment according to the long ratio of the long average of single language language material sentence and sentence sentence to be selected, and ratio is greater than 1, punishment; Ratio is less than 1, does not punish, the introducing of this penalty factor is also the creationary core of the inventive method;
X is that the information for Sentence-level information quantization represents unit, and x is defined as the word sequence that statictic machine translation system is used based on phrase in the present invention;
the available information quantization unit set of sentence s, i.e. the set of phrase; P (x|U) and P (x|L) are illustrated respectively in the probability of certain phrase x in extensive single language language material U and bilingual parallel corpora L, and the computing formula of P (x|U) and P (x|L) is as follows respectively:
P ( x | U ) = Count ( x ) + ϵ Σ x ∈ X U m Count ( x ) + ϵ , - - - ( 2 )
P ( x | L ) = Count ( x ) + ϵ Σ x ∈ X L m Count ( x ) + ϵ , - - - ( 3 )
Wherein, ε is smoothing factor, represent that in extensive single language language material, U is available for calculating the possible phrase set of Sentence-level quantity of information, be available for calculating the possible phrase set of Sentence-level quantity of information in bilingual parallel corpora L, Count (x) represents that the number of times that phrase x occurs is frequency,
Step 3, carry out technical translator
By U ntransfer to manually to carry out technical translator, note U nthe collection of translations generating is combined into u nwith parallel corpora sentence pair set for intertranslation;
Step 4, corpus is upgraded
From U, remove the selected n sentence list language data U of previous step n, then by U nand add in bilingual parallel corpora L simultaneously;
Step 5, statictic machine translation system is trained again
In new formation bilingual corpora on again train statictic machine translation system, and use consequent translation engine test decode collection again; Then, to the translation result of test set press BLEU criterion (the continuously accuracy of word matching degree, score value is between (0,1) between) marking, according to score raising situation and random device, contrast, weigh sentence selection algorithm performance, score is higher represents that performance is better.
Step 6, carry out process iteration and algorithm evaluation
From step 2, start to carry out iteration, each iteration is selected the new sentence set of being rich in quantity of information, upgrades the data in U and L, weighs sentence selection algorithm performance, after 25 iteration, stops; Calculate the average of 25 iteration BLEU scores, compare with the average of random device, weigh the performance of sentence selection algorithm, average more illustrates that algorithm is better, illustrate that this algorithm selects the sentence ability that contains much information stronger, stronger to the ability of mechanical translation performance boost.
Embodiment
Take " Han-Ying " language to and translation direction be object, the bilingual parallel corpora of Han-Ying FBIS providing from the open evaluation and test of national institute of standards and technology (NIST) random choose to go out initial parallel corpora 5K(Chinese side mean sentence length be 36.5 words), and to simulate single language (Chinese) language material 20K(Chinese mean sentence length be 36.4 words).Using NIST at the test set of 2006 as the exploitation collection U of this experiment statistics machine translation system model parameter training (totally 1,664, every containing four reference translations), use NIST as this experiment test system, to translate the test set of performance at the test set of 2005 and 2008, the former is containing 1,083 (every 4 reference translations), the latter is containing 1,357 (every 4 reference translations).Human translation in Active Learning framework is provided by the English side data of simulating in single language data (the actual bilingual parallel corpora for sentence alignment).
Iterations in Active Learning framework is set to 25, and each iteration is selected the sentence that n=200 sentence is rich in quantity of information from U; Smoothing factor in formula (2) and (3) is made as to 0.5.
What Fig. 2 represented is to take the existing quantity of information sentence selection algorithm (methods based on how much phrases) that is rich in as the comparing result of example explanation with random choose method.The fundamental formular of how much phrase selection methods is as follows:
φ ( s ) : = 1 | X s m | Σ x ∈ X s m log P ( x | U ) P ( x | U ) , - - - ( 4 )
Parameters cotype (1) in formula (4).
Broken line in Fig. 2 represents initiatively under framework corresponding statictic machine translation system BLEU score situation after each bilingual panel data of grey iterative generation.Each node of broken line represents iteration one time, and each iteration is a complete process that is rich in Machine Translation Model training after the selection of quantity of information sentence, human translation, the new parallel corpora of merging, decoding.Two broken lines of Fig. 2 middle and upper part represent that the performance test of mechanical translation is based on NIST2005 test set, and two broken lines of bottom represent the NIST2008 test set adopting.Two broken lines of entwining together represent respectively to adopt the existing method based on how much phrases and machine translation system based on random device representative.From comparing result, the performance of existing how much phrase methods (being Geom phrase NIST05 and Geom phrase NIST08 in figure) is not as random device (being random NIST05 and random NIST08 in figure).
Existing how much phrase methods and random approach have been carried out to statistical characteristic analysis.In Fig. 3, listed the sentence average length that existing how much phrase methods and random system of selection are chosen in each iteration.The mean sentence length that in each iteration, how much phrase methods are picked out sentence is all shorter than the mean sentence length that random approach is picked out.25 iteration are carried out to statistical study, and in how much phrase methods, the long average of sentence of 25 iteration is 27.7, and variance is 5.93, and the long average of sentence that random approach is selected is 36.5, and variance is 1.23.Can find out, general how much phrase methods no matter in mean sentence length, or in dynamic range, neither as random approach performance stablize.
Conventionally, with regard to the sentence of extensive single language language material, the neologisms that sentence contains (for there is bilingual Parallel Corpus) are more, and this contained quantity of information is higher so; The old word that sentence contains (for there is bilingual Parallel Corpus) is more, and this contained quantity of information is lower, but more accurate to the probability estimate of phrase.Therefore, each iteration under Active Learning framework is picked out to 200 sentence set the coverage rate of test set has been carried out to statistical characteristic analysis.Fig. 4 represents that the bilingual data that generate in each iteration are to the coverage rate situation in different test sets.Can see in each iteration, the bilingual panel data that existing how much phrase methods generate is at three test set (NIST05, NIST06 and NIST08) on coverage rate all than the height of random approach, represent that the sentence that existing how much phrase methods are picked out comprises more neologisms.Yet high coverage rate also could not improve translation quality.The long statistical property of synthetic sentence and test set coverage rate statistical property, analyze the performance of existing phrase geometric method and be that the quantity of information that the sentence that is rich in quantity of information that the former selects shows is " information maximization relatively " not as the basic reason of random device, the quantification that is information is not normalized according to sentence is long, therefore, although neologisms are many, but because sentence is long shorter, cause absolute information quantity not sufficient, so the performance of translation performance is inferior to the bilingual data that random approach generates.
According to above statistical study, the present invention proposes the long penalty factor of band sentence based on how much of phrases be rich in the system of selection of quantity of information sentence, relative entropy is converted into Absolute Information Quantity, select single sentence of Absolute Information Quantity maximum to carry out human translation.
Fig. 5 has illustrated the average length comparison diagram of the inventive method (phrase-pen) and existing phrase method of geometry (phrase) and random approach (random) selected sentence when each iteration.As can be seen from Figure 5, add after the long penalty factor of sentence, because having considered the long factor of sentence of corpus, the sentence ensemble average length of at every turn picking out is not only improved, reduced sentence long wave moving simultaneously.
Fig. 6 has illustrated the comparison diagram of the inventive method (Phrase-penalty) and existing phrase method of geometry (Phrase) and random device (Random) statictic machine translation system performance (BLEU score) when each iteration.As can be seen from Figure 6, than random approach, add after the long penalty factor of sentence, the inventive method has significantly improved machine translation system performance.The concrete numbers illustrated of the present embodiment is in Table 1.
The translation performance comparison of table 1, the inventive method and other two kinds of methods
Table 1 has illustrated the inventive method and existing phrase geometric method and random approach statistical machine translation Performance Ratio after average to 25 iteration.In table, " BLEU " list is shown and is adopted 4 yuan of BLEU automatic Evaluation criterions to evaluate translation result, and value is between 0~100%, and larger expression performance is better." Phrase " represents existing phrase method of geometry, and " Random " represents random approach, and " Phrase-penalty " represents the system of selection of the long penalty factor of band sentence that the present invention proposes.
As can be seen from Table 1, compare random approach and existing phrase method of geometry, the method of the long penalty factor of band sentence proposed by the invention makes to translate performance and on NIST05 data set, has improved respectively 0.33 and 0.14 point, on NIST08 data set, improve respectively 0.45 and 0.33 point, significantly improved the translation performance of system.Therefore, method proposed by the invention is effective and feasible.

Claims (2)

1. the system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence, is characterized in that, according to following steps, specifically implements:
Step 1, build initial statictic machine translation system
Use initial bilingual parallel corpora L={ (f i, e i) train statictic machine translation system, wherein L represents initial bilingual Parallel Corpus, f i, e irepresent respectively in L that i parallel sentence is right, i.e. i Chinese sentence and i english sentence, i=1 ... N};
Step 2, build information quantization unit set X and carry out information computing
According to defined information, represent unit x, from extensive single language language material U={f jin select sentence set U n, the long penalty factor of band sentence to be rich in quantity of information sentence selection algorithm as follows:
Wherein, U represents extensive single language corpus, U nrepresent single language language material subset that select sentence forms, BP is the long penalty factor of sentence, and the ratio long according to the long average of single language language material sentence and sentence sentence to be selected determines whether applying punishment, and ratio is greater than 1, punishment; Ratio is less than 1, does not punish;
the available information quantization unit set of sentence s, i.e. the set of phrase; P (x|U) and P (x|L) are illustrated respectively in the probability of certain phrase x in extensive single language language material U and bilingual parallel corpora L, and the computing formula of P (x|U) and P (x|L) is as follows respectively:
P ( x | U ) = Count ( x ) + ϵ Σ x ∈ X U m Count ( x ) + ϵ , - - - ( 2 )
P ( x | L ) = Count ( x ) + ϵ Σ x ∈ X L m Count ( x ) + ϵ , - - - ( 3 )
Wherein, ε is smoothing factor, represent that in extensive single language language material, U is available for calculating the possible phrase set of Sentence-level quantity of information, be available for calculating the possible phrase set of Sentence-level quantity of information in bilingual parallel corpora L, Count (x) represents that the number of times that phrase x occurs is frequency;
Step 3, carry out technical translator
By U ntransfer to manually to carry out technical translator, note U nthe collection of translations generating is combined into u nwith parallel corpora sentence pair set for intertranslation;
Step 4, corpus is upgraded
From U, remove the selected n sentence list language data U of previous step n, then by U nand add in bilingual parallel corpora L simultaneously;
Step 5, statictic machine translation system is trained again
In new formation bilingual corpora on again train statictic machine translation system, and use consequent translation engine test decode collection again; Then, the translation result of test set is pressed to the marking of BLEU criterion, according to score raising situation and random device, contrast, weigh sentence selection algorithm performance, score is higher represents that performance is better;
Step 6, carry out process iteration and algorithm evaluation
From step 2, start to carry out iteration, each iteration is selected the new sentence set of being rich in quantity of information, upgrades the data in U and L, weighs sentence selection algorithm performance.
2. the system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence according to claim 1, is characterized in that, in described step 6, carries out stopping after 25 iteration; The average of calculating 25 iteration BLEU scores, compares with the average of random device, weighs the performance of sentence selection algorithm, and average more illustrates that algorithm is better.
CN201410168282.6A 2014-04-23 2014-04-23 Sentence length penalty factor-based selection method for sentence rich in information amount Pending CN103955456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410168282.6A CN103955456A (en) 2014-04-23 2014-04-23 Sentence length penalty factor-based selection method for sentence rich in information amount

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410168282.6A CN103955456A (en) 2014-04-23 2014-04-23 Sentence length penalty factor-based selection method for sentence rich in information amount

Publications (1)

Publication Number Publication Date
CN103955456A true CN103955456A (en) 2014-07-30

Family

ID=51332731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410168282.6A Pending CN103955456A (en) 2014-04-23 2014-04-23 Sentence length penalty factor-based selection method for sentence rich in information amount

Country Status (1)

Country Link
CN (1) CN103955456A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022381A1 (en) * 2009-07-21 2011-01-27 International Business Machines Corporation Active learning systems and methods for rapid porting of machine translation systems to new language pairs or new domains
CN102855263A (en) * 2011-06-30 2013-01-02 富士通株式会社 Method and device for aligning sentences in bilingual corpus
CN103092831A (en) * 2013-01-25 2013-05-08 哈尔滨工业大学 Parameter adjustment method used for counting machine translation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022381A1 (en) * 2009-07-21 2011-01-27 International Business Machines Corporation Active learning systems and methods for rapid porting of machine translation systems to new language pairs or new domains
CN102855263A (en) * 2011-06-30 2013-01-02 富士通株式会社 Method and device for aligning sentences in bilingual corpus
CN103092831A (en) * 2013-01-25 2013-05-08 哈尔滨工业大学 Parameter adjustment method used for counting machine translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINHUA DU ET AL.: "Findings and Considerations in Active Learning based Framework for Resource-poor SMT", 《2013 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING(IALP 2013)》 *
朱俊国 等: "基于译文加权的BLEU改进方法", 《中文信息处理前沿进展—中国中文信息学会二十五周年学术会议论文集》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427629A (en) * 2019-08-13 2019-11-08 苏州思必驰信息科技有限公司 Semi-supervised text simplified model training method and system
CN110427629B (en) * 2019-08-13 2024-02-06 思必驰科技股份有限公司 Semi-supervised text simplified model training method and system

Similar Documents

Publication Publication Date Title
CN107766324B (en) Text consistency analysis method based on deep neural network
CN104298651B (en) Biomedicine named entity recognition and protein interactive relationship extracting on-line method based on deep learning
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103984681B (en) News event evolution analysis method based on time sequence distribution information and topic model
WO2018218708A1 (en) Deep-learning-based public opinion hotspot category classification method
CN102662931B (en) Semantic role labeling method based on synergetic neural network
CN108829684A (en) A kind of illiteracy Chinese nerve machine translation method based on transfer learning strategy
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN104391842A (en) Translation model establishing method and system
CN106484681A (en) A kind of method generating candidate's translation, device and electronic equipment
CN103631858B (en) A kind of science and technology item similarity calculating method
CN104915337B (en) Translation chapter integrity assessment method based on bilingual structure of an article information
CN103678285A (en) Machine translation method and machine translation system
CN104239554A (en) Cross-domain and cross-category news commentary emotion prediction method
CN104933027A (en) Open Chinese entity relation extraction method using dependency analysis
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system
CN103198228B (en) Based on the relational network link Forecasting Methodology of the hidden topic model of broad sense relationship
CN103473280A (en) Method and device for mining comparable network language materials
CN102760121B (en) Dependence mapping method and system
CN103678272A (en) Method for processing unknown words in Chinese-language dependency tree banks
CN110362797A (en) A kind of research report generation method and relevant device
CN106919556A (en) A kind of natural language semanteme deep analysis algorithm of use sparse coding
CN104881399A (en) Event identification method and system based on probability soft logic PSL
CN103744838A (en) Chinese emotional abstract system and Chinese emotional abstract method for measuring mainstream emotional information
CN105573985A (en) Sentence expression method based on Chinese sentence meaning structural model and topic model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160214

Address after: 710065 Shaanxi high tech Zone in Xi'an City, New Road No. 5 letter wealth center room B601

Applicant after: Xi'an bonny Translation Co., Ltd.

Address before: 710048 Shaanxi city of Xi'an Province Jinhua Road No. 5

Applicant before: Xi'an University of Technology

RJ01 Rejection of invention patent application after publication

Application publication date: 20140730

RJ01 Rejection of invention patent application after publication