CN105183722A - Chinese-English bilingual translation corpus alignment method - Google Patents

Chinese-English bilingual translation corpus alignment method Download PDF

Info

Publication number
CN105183722A
CN105183722A CN201510592410.4A CN201510592410A CN105183722A CN 105183722 A CN105183722 A CN 105183722A CN 201510592410 A CN201510592410 A CN 201510592410A CN 105183722 A CN105183722 A CN 105183722A
Authority
CN
China
Prior art keywords
original
translated document
section
translated
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510592410.4A
Other languages
Chinese (zh)
Inventor
郝瑞
张马成
王兴强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd filed Critical CHENGDU URELITE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510592410.4A priority Critical patent/CN105183722A/en
Publication of CN105183722A publication Critical patent/CN105183722A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a Chinese-English bilingual translation corpus alignment method. The method comprises the steps of 1, acquiring an original file and a corresponding translated file; 2, dividing the original file and the translated file by paragraph; 3, dividing any paragraph of the original file and the corresponding paragraph of the translated file by sentence, conducting rule alignment on sentences obtained through division, and establishing correlation; 4, traversing the paragraph, directly executing the step 6 if the number of sentences in the paragraph of the original file is identical with the number of sentences in the corresponding paragraph of the translated file, and directly executing the step 5 if the number of sentences in the paragraph of the original file is different from the number of sentences in the corresponding paragraph of the translated file; 5, combining certain sentences in the original file or/and the translated file; 6, selecting any undivided paragraph in the original file and the corresponding paragraph of the translated file again, and executing the step 3 to the step 5 on the paragraph; 7, executing the step 6 till all paragraphs are processed; 8, exporting an aligned file. By the adoption of the method, repeated use of existing translated files can be achieved.

Description

A kind of alignment schemes of Chinese-English bilingual translation language material
Technical field
The present invention relates to translation technology field, more specifically, relate to the alignment schemes of a kind of Chinese-English bilingual translation language material.
Background technology
Along with the continuous progress of science and technology, international exchange is more and more frequent, and the more and more opening of world economy, globalizes more and more deep, and the translation between various language file material also gets more and more, especially between English, the Chinese.Translated document relates to the every aspect of life: the every field such as trade, law, electronics, communication, computing machine, machinery, chemical industry, oil, medicine, food.
Translation belongs to service sector, and service sector will customer-orientation all the time., file number of words increasing in translation amount increasing today, how improving translation speed, the demand meeting client is very important.The popular translation speed that makes of CAT technology improves greatly.Fractionation and the distribution method of the file of existing translation can be avoided repeatedly translating identical paragraph, to improve translation efficiency to a certain extent.But it is only dropped into row to the repeated segments in same section file and rejects, and the paragraph repeated in one section of file is after all few, effective raising translation efficiency that can not be real.And existing translated document gets more and more, the paragraph of repetition also gets more and more, and how to accomplish that the recycling of existing translated document is very important to improve translation speed.
Summary of the invention
The present invention is in order to solve the problems of the technologies described above the alignment schemes providing a kind of Chinese-English bilingual to translate language material, and it can realize the recycling of existing translated document, improves translation efficiency.
The present invention's adopted technical scheme that solves the problem is:
An alignment schemes for Chinese-English bilingual translation language material, is characterized in that, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
The present invention analyzes document, splits on the basis of existing completed translated document, original original and translated document are regularly exported, the language material document of generation standard, to solve the problem of language material content recycling, improve translation speed, greatly accelerate manufacturing process, shorten the language material file generated time, improve efficiency.Original and the translated document of its correspondence generate and corresponding associate statement after segmentation, subordinate sentence, and regularly export, and are convenient to follow-up translation and repeatedly utilize.In Han-Ying translation process, exist a pair two or two to one situation, when statement associates, to merge this clause, make it meet translation brief, strengthen statement incidence relation, improve translation quality.Raw material are converted to grog by method of the present invention, are converted to the TMX file that can directly utilize by original and corresponding translated document.Least unit in translation process is sentence, instead of section.In the process of process language material, be preferably treated to sentence, so that only do simple modification when directly reusing or reuse later.
As preferably, in Han-Ying process, translation statement exist two to one situation, in order to investigate this situation, step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merge these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
As preferably, in Han-Ying process, there is the situation of a pair two in translation statement, in order to investigate this situation, step 5 comprises: search original and translated document, find out the corresponding sentence of " a pair two " in original and translated document, merge these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
As preferably, also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
To sum up, the invention has the beneficial effects as follows:
Method of the present invention regularly exports original and corresponding translated document being converted to the TMX file that can directly utilize on the basis of existing completed translated document, the language material document of generation standard, to solve the problem of language material content recycling, improve translation speed, greatly accelerate manufacturing process, shorten the language material file generated time, improve efficiency.
Embodiment
Below in conjunction with embodiment, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment
An alignment schemes for Chinese-English bilingual translation language material, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
Step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merges these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
Step 5 comprises: search original and translated document, finds out the corresponding sentence of " a pair two " in original and translated document, merges these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
Also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
We above basis are illustrated with concrete example below.
The original of Chinese is:
I likes doing housework, and special love washes the dishes.But once I has broken bowl into pieces.
I, when clearing up disintegrating slag, has hurt finger because of carelessness.
English translated document is:
Iliketodohousework,especiallylovetowashdishes.ButonceIbrokemybowl.
WhenIwasincleaningupcrumbs,accidentallycutmyfinger.
In alignment procedure, first according to the paragraph relation of original and translated document by its segmentation, original is divided into two sections, and first paragraph is: I likes doing housework, and likes to wash the dishes especially.But once I has broken bowl into pieces.The first paragraph of the translated document corresponding to it is: Iliketodohousework, especiallylovetowashdishes.ButonceIbrokemybowl.
The second segment of original is: I, when clearing up disintegrating slag, has hurt finger because of carelessness.The second segment of the translated document corresponding to it is: WhenIwasincleaningupcrumbs, accidentallycutmyfinger.
Now for first paragraph, the first paragraph of original and translated document is carried out to subordinate sentence and is associated, in original, " I likes doing housework, and special love washes the dishes." corresponding translation statement is: " Iliketodohousework, especiallylovetowashdishes. " " but once I has broken bowl into pieces." corresponding translation statement is: " ButonceIbrokemybowl. " checks the corresponding relation of its sentence, this section do not exist a pair two or two to one situation, then directly process second segment.Finally, the language material of alignment is derived.
The file translated is carried out language material alignment, in translation process, only needs to carry out degree of correlation retrieval to statement to be translated, be convenient to call alignment language material, improve translation efficiency.For example, statement to be translated is: I likes doing housework, and special love washes the dishes.In translation process, directly retrieve to search for matching degree to parallel corpus, when have in search road parallel corpus mate completely " I likes doing housework, and likes to wash the dishes especially.", directly its translation statement is called, greatly shorten translation cycle.
In step 3 to step 7, fractionation registration process can be carried out not in accordance with random mode to paragraph, also can carry out fractionation registration process to paragraph successively according to by section, namely press first paragraph, second segment, the 3rd section until the processing mode of final stage.
As mentioned above, the present invention can be realized preferably.

Claims (4)

1. an alignment schemes for Chinese-English bilingual translation language material, is characterized in that, comprising:
Step 1, obtains original and corresponding translated document;
Step 2, splits by section respectively to original and translated document;
Step 3, splits by sentence arbitrary section of original and the corresponding section of translated document, fractionation statement is carried out rule and aligns and be associated;
Step 4, travels through this section, if original is consistent with the sentence number of translated document in this section, directly jumps to step 6; If the sentence number of original and translated document is inconsistent in this section, directly jump to step 5;
Step 5, to original or/and some sentence in translated document merges;
Step 6, again chooses arbitrary section of fractionation of original and the corresponding section of translated document, according to step 3 to the method for step 5, operates this section;
Step 7, according to the method for step 6, until be disposed all paragraphs;
Step 8, derives alignment file.
2. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: step 5 comprises: search original and translated document, to find out in original and translated document the corresponding sentence of " two to ", merge these two the corresponding sentences in original, merge in original the statement of fractionation after sentencing successively on move and adjust incidence relation.
3. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: step 5 comprises: search original and translated document, find out the corresponding sentence of " a pair two " in original and translated document, merge these two the corresponding sentences in translated document, merge in translated document the statement of fractionation after sentencing successively on move and adjust incidence relation.
4. the alignment schemes of a kind of Chinese-English bilingual translation language material according to claim 1, it is characterized in that: also comprise in step 4: travel through whole section, search specific vocabulary and check the translation vocabulary of specific vocabulary, check whether translation vocabulary is certain translation vocabulary, if not, then replaced with certain translation vocabulary.
CN201510592410.4A 2015-09-17 2015-09-17 Chinese-English bilingual translation corpus alignment method Pending CN105183722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510592410.4A CN105183722A (en) 2015-09-17 2015-09-17 Chinese-English bilingual translation corpus alignment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510592410.4A CN105183722A (en) 2015-09-17 2015-09-17 Chinese-English bilingual translation corpus alignment method

Publications (1)

Publication Number Publication Date
CN105183722A true CN105183722A (en) 2015-12-23

Family

ID=54905811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510592410.4A Pending CN105183722A (en) 2015-09-17 2015-09-17 Chinese-English bilingual translation corpus alignment method

Country Status (1)

Country Link
CN (1) CN105183722A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775338A (en) * 2016-12-23 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method and system by pulling alignment language material
CN106775339A (en) * 2016-12-26 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method and system that adjustment language material position is clicked on by pulling
CN106802753A (en) * 2016-12-21 2017-06-06 语联网(武汉)信息技术有限公司 A kind of language material alignment schemes and system
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN110807337A (en) * 2019-11-01 2020-02-18 北京中献电子技术开发有限公司 Patent double sentence pair processing method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002001401A1 (en) * 2000-06-26 2002-01-03 Onerealm Inc. Method and apparatus for normalizing and converting structured content
CN101706777A (en) * 2009-11-10 2010-05-12 中国科学院计算技术研究所 Method and system for extracting resequencing template in machine translation
CN102043773A (en) * 2009-10-20 2011-05-04 张龙哺 Method and device for forming modularized bilingual sentence pairs
CN102622340A (en) * 2012-03-28 2012-08-01 成都优译信息技术有限公司 Translated file splitting and distributing method
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN104360996A (en) * 2014-11-27 2015-02-18 武汉传神信息技术有限公司 Sentence alignment method of bilingual text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002001401A1 (en) * 2000-06-26 2002-01-03 Onerealm Inc. Method and apparatus for normalizing and converting structured content
CN102043773A (en) * 2009-10-20 2011-05-04 张龙哺 Method and device for forming modularized bilingual sentence pairs
CN101706777A (en) * 2009-11-10 2010-05-12 中国科学院计算技术研究所 Method and system for extracting resequencing template in machine translation
CN102622340A (en) * 2012-03-28 2012-08-01 成都优译信息技术有限公司 Translated file splitting and distributing method
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN104360996A (en) * 2014-11-27 2015-02-18 武汉传神信息技术有限公司 Sentence alignment method of bilingual text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
薛松: "汉英平行语料库中名词短语对齐算法的研究", 《中国优秀硕博士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107436865B (en) * 2016-05-25 2020-10-16 阿里巴巴集团控股有限公司 Word alignment training method, machine translation method and system
CN106802753A (en) * 2016-12-21 2017-06-06 语联网(武汉)信息技术有限公司 A kind of language material alignment schemes and system
CN106775338A (en) * 2016-12-23 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method and system by pulling alignment language material
CN106775339A (en) * 2016-12-26 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method and system that adjustment language material position is clicked on by pulling
CN110807337A (en) * 2019-11-01 2020-02-18 北京中献电子技术开发有限公司 Patent double sentence pair processing method and system
CN110807337B (en) * 2019-11-01 2021-11-12 北京中献电子技术开发有限公司 Patent double sentence pair processing method and system

Similar Documents

Publication Publication Date Title
CN105183722A (en) Chinese-English bilingual translation corpus alignment method
CN102073706B (en) Combined application method of distributed file storage system and relation database
CN103885942B (en) A kind of rapid translation device and method
CN104182465A (en) Network-based big data processing method
CN108920472A (en) A kind of emerging system and method for the machine translation system based on deep learning
CN109241543A (en) The preconditioning technique of consistency translationese
CN105740218A (en) Post-editing processing method for mechanical translation
CN105279506A (en) Manchu script central axis positioning method
CN105183723A (en) Associating method for translation software and language material searching
CN102968539A (en) Method for massively and quickly generating format drawing
Yamada Revising text: An empirical investigation of revision and the effects of integrating a TM and MT system into the translation process
CN109241076A (en) A kind of data query method and device
CN108536724A (en) Main body recognition methods in a kind of metro design code based on the double-deck hash index
Haque et al. Terminology-aware sentence mining for NMT domain adaptation: ADAPT’s submission to the adap-MT 2020 English-to-Hindi AI translation shared task
CN103455477A (en) Term unifying method for aided translation
CN111723297B (en) Dual-semantic similarity judging method for grid society situation research and judgment
Yamazaki et al. Ensemble Models for Detecting Wikidata Vandalism with Stacking-Team Honeyberry Vandalism Detector at WSDM Cup 2017
Eisele et al. Improving machine translation performance using comparable corpora
CN110647988A (en) Accelerated calculation method of SSD (solid State disk) target detection convolutional neural network
Boito et al. How does language influence documentation workflow? unsupervised word discovery using translations in multiple languages
Heo et al. Identifying UX Issues for Multimodal Interaction of Intelligent Systems Using User-Centered Design Techniques
CN106126500A (en) A kind of statistical method associating hot word
Li et al. Generating Poetry Title Based on Semantic Relevance with Convolutional Neural Network
Skadina et al. Searching for the Best Translation
Fouladpour et al. Some Reflections of the Iranian Manichaean fictional elements on the Folkloric Iranian Literature A Comparative Study of the Parables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 610000 B, building 4, building 200, Tianfu five street, Chengdu hi tech Zone, Sichuan,

Applicant after: Chengdu excellent translation information technology Limited by Share Ltd

Address before: 610000, No. 1, building 107, 1 West Bauhinia Road, Chengdu hi tech Zone, Sichuan, 6

Applicant before: Chengdu Urelite Information technology Co., Ltd.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151223